Most compilers have some (or in some cases many)
intrinsic
functions. HotSpot has a number of them (see
here, search for "intrinsics known to the runtime") as does the CLR JIT.
IKVM has had a couple as well (System.arraycopy(),
AtomicReferenceFieldUpdater.newUpdater(),
String.toCharArray()). These were sort of hacked into the compiler and I
finally decided to clean that up a little and add more scalable support for
adding intrinsincs. The trigger to do this was that I added four more
intrinsics:
Float.floatToRawIntBits(),
Float.intBitsToFloat(),
Double.doubleToRawLongBits() and
Double.longBitsToDouble().
Benchmark
Here's a micro benchmark:
public class test {
public static void main(String[] args) {
long sum = 1;
long start =
System.currentTimeMillis();
for (int i = 0; i < 10000000; i++) {
sum += Double.doubleToRawLongBits(sum);
}
long end = System.currentTimeMillis();
System.out.println(end - start);
System.out.println(sum);
}
}
Here are the results:
|
x86 (aligned) |
x86
(unaligned) |
x64 |
JDK 1.6 HotSpot Server VM |
287 |
|
109 |
JDK
1.6 HotSpot Client VM |
335 |
|
|
IKVM
0.36 .NET 1.1 |
479 |
565 |
|
IKVM
0.36 .NET 2.0 |
570 |
704 |
124 |
IKVM 0.37 |
338 |
468 |
101 |
Since the x86 .NET results are highly sensitive as to whether the double on
the stack
happens to be aligned or not, I included both results.
Implementation
Here's the MSIL that IKVM generates for the loop:
IL_000b: ldloc.2
IL_000c: ldc.i4 0x989680
IL_0011: bge IL_0028
IL_0016: ldloc.0
IL_0017: ldloc.0
IL_0018: conv.r8
IL_0019: ldloca.s V_3
IL_001b: call int64
[IKVM.Runtime]IKVM.Runtime.DoubleConverter::ToLong(float64,
valuetype [IKVM.Runtime]IKVM.Runtime.DoubleConverter&)
IL_0020: add
IL_0021: stloc.0
IL_0022: ldloc.2
IL_0023: ldc.i4.1
IL_0024: add
IL_0025: stloc.2
IL_0026: br.s IL_000b
The conversion isn't actually inlined, but instead a local variable of
value type IKVM.Runtime.DoubleConverter
is added to the method
and a static method on that type that takes the value to be converted and a
reference to the local variable is called. Here's the code for IKVM.Runtime.DoubleConverter
:
[StructLayout(LayoutKind.Explicit)]
public struct DoubleConverter
{
[FieldOffset(0)]
private double d;
[FieldOffset(0)]
private long l;
public static long ToLong(double value,
ref DoubleConverter converter)
{
converter.d = value;
return converter.l;
}
public static double ToDouble(long value,
ref DoubleConverter converter)
{
converter.l = value;
return converter.d;
}
}
It uses the .NET feature that allows you to explicitly control the layout
of a struct to overlay the double and long fields. Note that this
construct is fully verifiable.
For comparison, the standard
System.BitConverter.DoubleToInt64Bits() uses unsafe code and looks
something like this:
public static unsafe long
DoubleToInt64Bits(double value)
{
return *((long*)&value);
}
For some reason (probably because it isn't verifiable) the JIT doesn't
like this so much and doesn't inline this method.
JIT Code
Here's the x86 code generated by the .NET 2.0 SP1 JIT:
049E15CE cmp ebx,989680h
049E15D4 jge 049E1600
049E15D6 lea ecx,[esp+8]
049E15DA mov dword ptr [esp+10h],esi
049E15DE mov dword ptr [esp+14h],edi
049E15E2 fild qword ptr [esp+10h]
049E15E6 fstp qword ptr [esp+10h]
049E15EA fld qword ptr [esp+10h]
049E15EE fstp qword ptr [ecx]
049E15F0 mov eax,dword ptr [ecx]
049E15F2 mov edx,dword ptr [ecx+4]
049E15F5 add eax,esi
049E15F7 adc edx,edi
049E15F9 mov esi,eax
049E15FB mov edi,edx
049E15FD inc ebx
049E15FE jmp 049E15CE
Here's the x64 code generated by the .NET 2.0 SP1 JIT:
00000642805B8A90 cmp ecx,989680h
00000642805B8A96 jge 00000642805B8AB1
00000642805B8A98 cvtsi2sd xmm0,rdi
00000642805B8A9D lea rax,[rsp+20h]
00000642805B8AA2 movsd mmword ptr [rax],xmm0
00000642805B8AA6 mov rax,qword ptr [rax]
00000642805B8AA9 add rdi,rax
00000642805B8AAC add ecx,1
00000642805B8AAF jmp 00000642805B8A90
In both cases the construct is inlined properly. It is also obvious why the
x64 code is so much faster, it uses SSE (as we've seen before) and only uses
one memory store/load combination.
HotSpot
For completeness, here's the code generated by HotSpot x64:
0000000002772EA0 cvtsi2sd xmm0,r11
0000000002772EA5 add ebp,10h
0000000002772EA8 movsd mmword ptr [rsp+20h],xmm0
0000000002772EAE mov r10,qword ptr [rsp+20h]
0000000002772EB3 add r10,r11
0000000002772EB6 cvtsi2sd xmm0,r10
0000000002772EBB movsd mmword ptr [rsp+20h],xmm0
0000000002772EC1 mov r11,qword ptr [rsp+20h]
0000000002772EC6 add r11,r10
0000000002772EC9 cvtsi2sd xmm0,r11
0000000002772ECE movsd mmword ptr [rsp+20h],xmm0
0000000002772ED4 mov r10,qword ptr [rsp+20h]
0000000002772ED9 add r10,r11
[...]
0000000002772FC0 cvtsi2sd xmm0,r10
0000000002772FC5 movsd mmword ptr [rsp+20h],xmm0
0000000002772FCB mov r11,qword ptr [rsp+20h]
0000000002772FD0 add r11,r10
0000000002772FD3 cmp ebp,r9d
0000000002772FD6 jl 0000000002772EA0
It actually unrolled the loop 16 times (which appears not be helping in the
case), but otherwise the code generated is pretty similar to what we saw on
the CLR. Of course, in HotSpot Double.doubleToRawIntBits()
is also an
intrinsic because in Java the only alternative would be to write it in
native code and the JNI transition would add significant overhead in this
case.