It's been a while since I've done an in depth investigation of a
microbenchmark and the
recent work I did on
AtomicReferenceFieldUpdater
to make it work in partial trust also has a
nice performance impact. So let's investigate that.
The Microbenchmark
import java.util.concurrent.atomic.*;
class Test
{
volatile Object field;
public static
void main(String[] args)
{
AtomicReferenceFieldUpdater upd =
AtomicReferenceFieldUpdater.newUpdater(Test.class,
Object.class,
"field");
Test obj = new
Test();
for (int j = 0; j < 5; j++)
{
long start =
System.currentTimeMillis();
for (int i = 0; i < 10000000; i++)
upd.compareAndSet(obj, null,
null);
long end =
System.currentTimeMillis();
System.out.println(end - start);
}
}
}
The Results
|
IKVM 0.34 |
IKVM 0.36 |
IKVM 0.37 |
JDK 1.6 |
.NET 1.1 |
36808 |
41453 |
|
|
.NET 2.0 / x86 |
75647 |
5776 |
561 |
321 |
.NET 2.0 / x64 |
|
5081 |
512 |
245 |
The Differences
The first thing that jumps out is that the IKVM 0.34
results show that .NET 2.0 reflection
is much slower than .NET 1.1 reflection. On IKVM 0.36 the reflection
implementation changed to take advantage of
DynamicMethod
when running on .NET 2.0, so
there we see a big improvement in performance when running on .NET 2.0.
IKVM 0.37 has the new AtomicReferenceFieldUpdate optimization that no
longer uses reflection (if it can figure out at compile time what to do),
this again yields a big performance improvement.
Finally, HotSpot manages to beat IKVM be a factor of two. There is no
difference between HotSpot client and server modes for this benchmark (on
JDK 1.6).
The Compiler
Let's look at some C# pseudo code that shows what ikvmc 0.37 generates
for the above benchmark:
using java.util.concurrent.atomic;
using System.Threading;
class Test
{
volatile object field;
private sealed
class __ARFU_fieldLjava/lang/Object; :
AtomicReferenceFieldUpdater
{
public override
bool compareAndSet(object obj,
object expect, object update)
{
return
expect == Interlocked.CompareExchange(ref
((Test)obj).field,
(object)update, (object)expect);
}
// ...other methods omitted...
}
static void
main(string[] args)
{
AtomicReferenceFieldUpdater udp =
new
__ARFU_fieldLjava/lang/Object;();
// ...rest of method omitted...
}
}
The bytecode compiler only does this optimization if the arguments to
newUpdater
are constants and match up with a volatile instance
reference field in the current class.
The reason this optimization
only first showed up in IKVM 0.37 is that it requires the generic version of
Interlocked.CompareExchange
. In this particular example the
non-generic version would have worked, but in the real world nearly all uses
of AtomicReferenceFieldUpdater are on fields that have a more specific type
than Object.
The Assembly
So why is HotSpot twice as fast? I modified the test slightly to make the
generated assembly code easier to read by making it an infinite loop. Here's
the x64 code for the loop:
00000000028C2690 mov r11,qword ptr [r8+10h]
00000000028C2694 mov r10,1026DD08h
00000000028C269E cmp r11,r10
00000000028C26A1 jne 00000000028C2773
00000000028C26A7 mov r10,qword ptr [r8+20h]
00000000028C26AB test r10,r10
00000000028C26AE jne 00000000028C273B
00000000028C26B4 mov r10,qword ptr [r8+28h]
00000000028C26B8 mov r11,r9
00000000028C26BB add r11,r10
00000000028C26BE xor eax,eax
00000000028C26C0 xor r10d,r10d
00000000028C26C3 lock cmpxchg qword ptr [r11],r10
00000000028C26C8 sete r12b
00000000028C26CC movzx r12d,r12b
00000000028C26D0 mov r10,r11
00000000028C26D3 shr r10,9
00000000028C26D7 mov r11,589FF80h
00000000028C26E1 mov byte ptr [r11+r10],0
00000000028C26E6 test dword ptr [160000h],eax
00000000028C26EC jmp 00000000028C2690
HotSpot did it's thing and was able to inline the virtual compareAndSet
method. I'm pretty sure that HotSpot doesn't have special support
for AtomicReferenceFieldUpdater, but this is simply the normal HotSpot
devirtualization optimization at work. The lock cmpxchg
instruction is the
result of HotSpot having intrinsic support for
sun.misc.Unsafe.compareAndSwapObject
.
Let's go over the assembly instructions in detail:
00000000028C2690 mov r11,qword ptr [r8+10h]
00000000028C2694 mov r10,1026DD08h
00000000028C269E cmp r11,r10
00000000028C26A1 jne 00000000028C2773
This looks like a HotSpot virtual method inline guard. It's checking to make
sure that the object is of the expected type (if it isn't, the inlined
virtual method may not be correct anymore).
00000000028C26A7 mov r10,qword ptr [r8+20h]
00000000028C26AB test r10,r10
00000000028C26AE jne 00000000028C273B
I'm not sure. Some field in the AtomicReferenceFieldUpdater object is tested for null.
00000000028C26B4 mov r10,qword ptr [r8+28h]
The offset to the field is loaded from the AtomicReferenceFieldUpdater
object.
00000000028C26B8 mov r11,r9
The passed in object reference is moved from r9
to r11
.
00000000028C26BB add r11,r10
Add the field offset to the object reference. We now have the address of
the memory location we want to update in r11
.
00000000028C26BE xor eax,eax
Clear rax
to represent the passed in null
value of the expect
argument.
I'm not sure why the disassembler shows the register as eax
, but this
instruction clears the full 64 bit rax
register.
00000000028C26C0 xor r10d,r10d
r10
is cleared and represents the passed in null
value of the
update
argument.
00000000028C26C3 lock cmpxchg qword ptr [r11],r10
The actual interlocked compare and exchange instruction. The qword at
memory location r11
is compared with rax
and if it matches r10
is written to
it. Since I'm on a dual core machine, the lock
prefix is applied.
Locking the bus is expensive, so HotSpot omits it when running on a single
core machine.
00000000028C26C8 sete r12b
00000000028C26CC movzx r12d,r12b
The cmpxchg
instruction sets the zero flag if it was successful. These
two instruction copy the zero flag into the r12
register (it is set to 0 or
255 to represent either false or true). Since the result isn't actually used
in this case, this could have been optimized away.
00000000028C26D0 mov r10,r11
00000000028C26D3 shr r10,9
00000000028C26D7 mov r11,589FF80h
00000000028C26E1 mov byte ptr [r11+r10],0
This is a little interesting. It takes the address of the field that was
just (potentially) updated and shifts it to the right by 9 bits and uses
that value to index a static table and clear the corresponding byte. This is
a GC write barrier. The GC consults the table (known as a card table) to
know what objects in older generations it needs to scan when doing a GC of a
younger generation.
00000000028C26E6 test dword ptr [160000h],eax
This seemingly useless test is part of a mechanism used by the VM to
suspend the thread at this instruction (a safepoint). When the VM
wants to suspend all threads (for a GC) it unmaps the safepoint polling
memory page (in this case at 0x160000) and waits for all threads to suspend.
Each thread running compiled Java code will eventually run this instruction
and cause a page fault, inside the page fault handler it is detected that a
safepoint thread suspend is requested and the thread calls the VM to suspend
itself.
00000000028C26EC jmp 00000000028C2690
Branch to the top and start over again.
The Conclusion
The .NET Framework JIT doesn't inline virtual methods and
Interlocked.CompareExchange
is not a JIT instrisic, so there
the story is pretty straightforward. Each loop iteration calls
Interlocked.CompareExchange
which in turn calls the GC write
barrier function. This is why HotSpot is able to beat IKVM 0.37 by a factor
of two.
Of course, when you're coding in C# you can write the microbenchmark to
call Interlocked.CompareExchange
directly:
using System;
using System.Threading;
class Test
{
volatile object field;
static void Main(string[] args)
{
Test obj = new Test();
for (int
j = 0; j < 5; j++)
{
int start =
Environment.TickCount;
for (int i = 0; i < 10000000; i++)
Interlocked.CompareExchange(ref obj.field,
null, null);
int end =
Environment.TickCount;
Console.WriteLine(end - start);
}
}
}
This runs in 265 milliseconds which goes to show that in this case all
the fancy footwork that HotSpot does can almost be matched simply by having
by ref argument passing in your language. Of course, the CLR JIT
isn't perfect. When you change the field type to string
the running time
increases to 436 milliseconds because the invocation of a generic method
goes through a stub that makes sure that the method instantiation exists.
Here it would probably pay to to teach the JIT about the
generic methods
in
System.Threading.Interlocked
.