# Thursday, 31 January 2008
« IKVM 0.36 Update 1 Release Candidate 2 | Main | IKVM 0.36 Update 1 Release Candidate 3 »
How to Disassemble an AtomicReferenceFieldUpdater

It's been a while since I've done an in depth investigation of a microbenchmark and the recent work I did on AtomicReferenceFieldUpdater to make it work in partial trust also has a nice performance impact. So let's investigate that.

The Microbenchmark

import java.util.concurrent.atomic.*;

class Test {
  volatile Object field;

  public static void main(String[] args) {
    AtomicReferenceFieldUpdater upd =
      AtomicReferenceFieldUpdater.newUpdater(Test.class, Object.class, "field");
    Test obj = new Test();
    for (int j = 0; j < 5; j++) {
      long start = System.currentTimeMillis();
      for (int i = 0; i < 10000000; i++)
        upd.compareAndSet(obj, null, null);
      long end = System.currentTimeMillis();
      System.out.println(end - start);
    }
  }
}

The Results

       IKVM 0.34      IKVM 0.36      IKVM 0.37      JDK 1.6
.NET 1.1 36808 41453    
.NET 2.0 / x86       75647 5776 561 321
.NET 2.0 / x64   5081 512 245

The Differences

The first thing that jumps out is that the IKVM 0.34 results show that .NET 2.0 reflection is much slower than .NET 1.1 reflection. On IKVM 0.36 the reflection implementation changed to take advantage of DynamicMethod when running on .NET 2.0, so there we see a big improvement in performance when running on .NET 2.0.

IKVM 0.37 has the new AtomicReferenceFieldUpdate optimization that no longer uses reflection (if it can figure out at compile time what to do), this again yields a big performance improvement.

Finally, HotSpot manages to beat IKVM be a factor of two. There is no difference between HotSpot client and server modes for this benchmark (on JDK 1.6).

The Compiler

Let's look at some C# pseudo code that shows what ikvmc 0.37 generates for the above benchmark:

using java.util.concurrent.atomic;
using System.Threading;

class Test {
  volatile object field;

  private sealed class __ARFU_fieldLjava/lang/Object; : AtomicReferenceFieldUpdater {
    public override bool compareAndSet(object obj, object expect, object update) {
      return expect == Interlocked.CompareExchange(ref ((Test)obj).field,
                                                   (object)update, (object)expect);
    }
    // ...other methods omitted...
  }

  static void main(string[] args) {
    AtomicReferenceFieldUpdater udp = new __ARFU_fieldLjava/lang/Object;();
    // ...rest of method omitted...
  }
}

The bytecode compiler only does this optimization if the arguments to newUpdater are constants and match up with a volatile instance reference field in the current class.

The reason this optimization only first showed up in IKVM 0.37 is that it requires the generic version of Interlocked.CompareExchange. In this particular example the non-generic version would have worked, but in the real world nearly all uses of AtomicReferenceFieldUpdater are on fields that have a more specific type than Object.

The Assembly

So why is HotSpot twice as fast? I modified the test slightly to make the generated assembly code easier to read by making it an infinite loop. Here's the x64 code for the loop:

00000000028C2690   mov          r11,qword ptr [r8+10h]
00000000028C2694   mov          r10,1026DD08h
00000000028C269E   cmp          r11,r10
00000000028C26A1   jne          00000000028C2773
00000000028C26A7   mov          r10,qword ptr [r8+20h]
00000000028C26AB   test         r10,r10
00000000028C26AE   jne          00000000028C273B
00000000028C26B4   mov          r10,qword ptr [r8+28h]
00000000028C26B8   mov          r11,r9
00000000028C26BB   add          r11,r10
00000000028C26BE   xor          eax,eax
00000000028C26C0   xor          r10d,r10d
00000000028C26C3   lock cmpxchg qword ptr [r11],r10
00000000028C26C8   sete         r12b
00000000028C26CC   movzx        r12d,r12b
00000000028C26D0   mov          r10,r11
00000000028C26D3   shr          r10,9
00000000028C26D7   mov          r11,589FF80h
00000000028C26E1   mov          byte ptr [r11+r10],0
00000000028C26E6   test         dword ptr [160000h],eax
00000000028C26EC   jmp          00000000028C2690

HotSpot did it's thing and was able to inline the virtual compareAndSet method. I'm pretty sure that HotSpot doesn't have special support for AtomicReferenceFieldUpdater, but this is simply the normal HotSpot devirtualization optimization at work. The lock cmpxchg instruction is the result of HotSpot having intrinsic support for sun.misc.Unsafe.compareAndSwapObject.

Let's go over the assembly instructions in detail:

00000000028C2690   mov          r11,qword ptr [r8+10h]
00000000028C2694   mov          r10,1026DD08h
00000000028C269E   cmp          r11,r10
00000000028C26A1   jne          00000000028C2773

This looks like a HotSpot virtual method inline guard. It's checking to make sure that the object is of the expected type (if it isn't, the inlined virtual method may not be correct anymore).

00000000028C26A7   mov          r10,qword ptr [r8+20h]
00000000028C26AB   test         r10,r10
00000000028C26AE   jne          00000000028C273B

I'm not sure. Some field in the AtomicReferenceFieldUpdater object is tested for null.

00000000028C26B4   mov          r10,qword ptr [r8+28h]

The offset to the field is loaded from the AtomicReferenceFieldUpdater object.

00000000028C26B8   mov          r11,r9

The passed in object reference is moved from r9 to r11.

00000000028C26BB   add          r11,r10

Add the field offset to the object reference. We now have the address of the memory location we want to update in r11.

00000000028C26BE   xor          eax,eax

Clear rax to represent the passed in null value of the expect argument. I'm not sure why the disassembler shows the register as eax, but this instruction clears the full 64 bit rax register.

00000000028C26C0   xor          r10d,r10d

r10 is cleared and represents the passed in null value of the update argument.

00000000028C26C3   lock cmpxchg qword ptr [r11],r10

The actual interlocked compare and exchange instruction. The qword at memory location r11 is compared with rax and if it matches r10 is written to it. Since I'm on a dual core machine, the lock prefix is applied. Locking the bus is expensive, so HotSpot omits it when running on a single core machine.

00000000028C26C8   sete         r12b
00000000028C26CC   movzx        r12d,r12b

The cmpxchg instruction sets the zero flag if it was successful. These two instruction copy the zero flag into the r12 register (it is set to 0 or 255 to represent either false or true). Since the result isn't actually used in this case, this could have been optimized away.

00000000028C26D0   mov          r10,r11
00000000028C26D3   shr          r10,9
00000000028C26D7   mov          r11,589FF80h
00000000028C26E1   mov          byte ptr [r11+r10],0

This is a little interesting. It takes the address of the field that was just (potentially) updated and shifts it to the right by 9 bits and uses that value to index a static table and clear the corresponding byte. This is a GC write barrier. The GC consults the table (known as a card table) to know what objects in older generations it needs to scan when doing a GC of a younger generation.

00000000028C26E6   test         dword ptr [160000h],eax

This seemingly useless test is part of a mechanism used by the VM to suspend the thread at this instruction (a safepoint). When the VM wants to suspend all threads (for a GC) it unmaps the safepoint polling memory page (in this case at 0x160000) and waits for all threads to suspend. Each thread running compiled Java code will eventually run this instruction and cause a page fault, inside the page fault handler it is detected that a safepoint thread suspend is requested and the thread calls the VM to suspend itself.

00000000028C26EC   jmp          00000000028C2690

Branch to the top and start over again.

The Conclusion

The .NET Framework JIT doesn't inline virtual methods and Interlocked.CompareExchange is not a JIT instrisic, so there the story is pretty straightforward. Each loop iteration calls Interlocked.CompareExchange which in turn calls the GC write barrier function. This is why HotSpot is able to beat IKVM 0.37 by a factor of two.

Of course, when you're coding in C# you can write the microbenchmark to call Interlocked.CompareExchange directly:

using System;
using System.Threading;

class Test {
  volatile object field;

  static void Main(string[] args) {
    Test obj = new Test();
    for (int j = 0; j < 5; j++) {
      int start = Environment.TickCount;
      for (int i = 0; i < 10000000; i++)
        Interlocked.CompareExchange(ref obj.field, null, null);
      int end = Environment.TickCount;
      Console.WriteLine(end - start);
    }
  }
}

This runs in 265 milliseconds which goes to show that in this case all the fancy footwork that HotSpot does can almost be matched simply by having by ref argument passing in your language. Of course, the CLR JIT isn't perfect. When you change the field type to string the running time increases to 436 milliseconds because the invocation of a generic method goes through a stub that makes sure that the method instantiation exists. Here it would probably pay to to teach the JIT about the generic methods in System.Threading.Interlocked.

Thursday, 31 January 2008 10:50:35 (W. Europe Standard Time, UTC+01:00)  #    Comments [3]