Let's start out with a simple micro benchmark:
using System;
using System.Threading;
class Program
{
public static void Main()
{
int start = Environment.TickCount;
double[] d = new double[1000];
for (int i = 0; i < 1000000; i++)
{
for (int j = 0; j < d.Length; j++)
{
d[j] = (double)(3.0 * d[j]);
}
}
int end = Environment.TickCount;
Console.WriteLine(end - start);
}
}
On my system this takes about 7 seconds when run in optimized mode (i.e. not in the debugger).
Here's the optimized x86 code generated by the 2.0 CLR JIT for the body of the inner loop:
fld qword ptr [ecx+edx*8+8] ; d[j]
fmul dword ptr ds:[007B1230h] ; * 3.0
fstp qword ptr [esp] ; (double)
fld qword ptr [esp] ; (double)
fstp qword ptr [ecx+edx*8+8] ; d[j] =
There first thing that jumps out is that the double cast takes two x87 instructions, a store and a load. Part of the reason the cast is expensive is because the value has to leave the FPU and go to main memory and back into the FPU. In this particular case it turns out to be very expensive, because esp
happens to be not 8 byte aligned.
Making a seemingly unrelated change can make the micro benchmark much faster, just adding the following two lines at the top of the Main method will make the loop run in about 2.3 seconds on my system:
double dv = 0.0;
Interlocked.CompareExchange(ref dv, dv, dv);
The reason for this performance improvement becomes clear when we look at the method prologue in the new situation:
push ebp
mov ebp,esp
and esp,0FFFFFFF8h
push edi
push esi
push ebx
sub esp,14h
This results in an 8 byte aligned esp
pointer. As a result the fstp/fld
instructions will run much faster. It looks like a "bug" in the JIT that it doesn't align the stack in the first scenario.
Of course, the much more obvious question is: Why does the cast generate code at all, isn't a double already a double?
Before answering this question, let's first look at another minor change to the micro benchmark. Let's remove the Interlocked.CompareExchange()
again and change the inner loop body to the following:
double v = 3.0 * d[j];
d[j] = (double)v;
With this change, the loop now takes just 1 second on my system. When we look at the x86 code generated by the JIT, it becomes obvious why:
fld qword ptr [ecx+edx*8+8]
fmul dword ptr ds:[002A1170h]
fstp qword ptr [ecx+edx*8+8]
The redundant fstp/fld
instructions are gone.
Back to the question of why the cast isn't always optimized away. The reason for this lies in the fact that the x87 FPU internally uses an extended 80 bit representation for floating point numbers. When you explicitly cast to a double, the ECMA CLI specification requires that this results in a conversion from the internal representation into the IEEE 64 bit representation. Of course, in this scenario we're already storing the value in memory, so this necessarily implies a conversion to the 64 bit representation, making the extra fstp/fld unnecessary.
Finally, in x64 mode all three variations of the benchmark take 1 second on my system. This is because the x64 CLR JIT uses SSE instructions that internally work on the IEEE 64 bit representation of doubles, so the cast is optimized away in all situations here.
For completeness, here's the code generated by the x64 JIT for the inner loop body:
movsd xmm0,mmword ptr [rcx]
mulsd xmm0,mmword ptr [000000C0h]
movsd mmword ptr [rcx],xmm0