Tuesday, August 07, 2007

Floating Point "Redundant" Cast Performance on the CLR

Let's start out with a simple micro benchmark:

using System;
using System.Threading;
class Program
public static void Main()

int start = Environment.TickCount;
double[] d = new double[1000];
for (int i = 0; i < 1000000; i++)
for (int j = 0; j < d.Length; j++)
d[j] = (double)(3.0 * d[j]);
int end = Environment.TickCount;
Console.WriteLine(end - start);

On my system this takes about 7 seconds when run in optimized mode (i.e. not in the debugger).

Here's the optimized x86 code generated by the 2.0 CLR JIT for the body of the inner loop:

fld   qword ptr [ecx+edx*8+8]      ; d[j]
fmul  dword ptr ds:[007B1230h]     ; * 3.0 
fstp  qword ptr [esp]              ; (double) 
fld   qword ptr [esp]              ; (double) 
fstp  qword ptr [ecx+edx*8+8]      ; d[j] = 

There first thing that jumps out is that the double cast takes two x87 instructions, a store and a load. Part of the reason the cast is expensive is because the value has to leave the FPU and go to main memory and back into the FPU. In this particular case it turns out to be very expensive, because esp happens to be not 8 byte aligned.

Making a seemingly unrelated change can make the micro benchmark much faster, just adding the following two lines at the top of the Main method will make the loop run in about 2.3 seconds on my system:

    double dv = 0.0;
Interlocked.CompareExchange(ref dv, dv, dv);

The reason for this performance improvement becomes clear when we look at the method prologue in the new situation:

push  ebp 
mov   ebp,esp 
and   esp,0FFFFFFF8h 
push  edi 
push  esi 
push  ebx 
sub   esp,14h

This results in an 8 byte aligned esp pointer. As a result the fstp/fld instructions will run much faster. It looks like a "bug" in the JIT that it doesn't align the stack in the first scenario.

Of course, the much more obvious question is: Why does the cast generate code at all, isn't a double already a double?

Before answering this question, let's first look at another minor change to the micro benchmark. Let's remove the Interlocked.CompareExchange() again and change the inner loop body to the following:

    double v = 3.0 * d[j];
d[j] = (double)v;

With this change, the loop now takes just 1 second on my system. When we look at the x86 code generated by the JIT, it becomes obvious why:

fld   qword ptr [ecx+edx*8+8] 
fmul  dword ptr ds:[002A1170h] 
fstp  qword ptr [ecx+edx*8+8]

The redundant fstp/fld instructions are gone.

Back to the question of why the cast isn't always optimized away. The reason for this lies in the fact that the x87 FPU internally uses an extended 80 bit representation for floating point numbers. When you explicitly cast to a double, the ECMA CLI specification requires that this results in a conversion from the internal representation into the IEEE 64 bit representation. Of course, in this scenario we're already storing the value in memory, so this necessarily implies a conversion to the 64 bit representation, making the extra fstp/fld unnecessary.

Finally, in x64 mode all three variations of the benchmark take 1 second on my system. This is because the x64 CLR JIT uses SSE instructions that internally work on the IEEE 64 bit representation of doubles, so the cast is optimized away in all situations here.

For completeness, here's the code generated by the x64 JIT for the inner loop body:

movsd  xmm0,mmword ptr [rcx] 
mulsd  xmm0,mmword ptr [000000C0h] 
movsd  mmword ptr [rcx],xmm0 
8/7/2007 3:02:08 PM (W. Europe Daylight Time, UTC+02:00)  #    Comments [2]