Back in 2003 I described how you could theoretically use value types to efficiently inject behavior into a generic type. Recently I was thinking about this again and decided to look at the code generated by the JIT.
Let's look at a very simple case:
interface IOperation
{
void DoIt();
}
struct Dummy : IOperation
{
public void DoIt() { }
}
public class Program
{
static void Gen<T>() where T : IOperation
{
default(T).DoIt();
}
static void Main()
{
for (; ; )
{
Gen<Dummy>();
}
}
}
The .NET 2.0 SP1 x86 JIT generates the following code for the Gen<Dummy>()
method:
sub esp,8
xor eax,eax
mov dword ptr [esp],eax
mov dword ptr [esp+4],eax
mov byte ptr [esp],0
movsx eax,byte ptr [esp]
mov byte ptr [esp+4],al
add esp,8
ret
The good news is that the Dummy.DoIt()
method has been inlined. The bad news is that every single instruction apart from the ret
is unnecessary.
In order to understand what is going on, we need to look at the IL generated by the C# compiler for the Gen<T>()
method and we also need to know that the CLR treats zero length types as having a length of one byte.
Here's the IL for Gen<T>()
:
.maxstack 1
.locals init ([0] !!T CS$0$0000, [1] !!T CS$0$0001)
ldloca.s CS$0$0000
initobj !!T
ldloc.0
stloc.1
ldloca.s CS$0$0001
constrained. !!T
callvirt instance void IOperation::DoIt()
ret
For some reason the default(T).Doit()
construct results in two local variables being created. Now the JIT code above is starting to make a little sense (although it still isn't an excuse why it isn't optimized away):
sub esp,8 xor eax,eax mov dword ptr [esp],eax mov dword ptr [esp+4],eax |
.locals init ([0] !!T CS$0$0000, [1] !!T CS$0$0001) |
Reserve and initialize space on the stack for the two local variables. Note that they are both unnecessarily padded, so we need 8 bytes intsead of 4. |
mov byte ptr [esp],0 |
ldloca.s CS$0$0000 initobj !!T |
Initialize the first local variable. |
movsx eax,byte ptr [esp] |
ldloc.0 |
Read the value of the first local variable. |
mov byte ptr [esp+4],al |
stloc.1 |
Store the value we just read in the second local variable. |
|
ldloca.s CS$0$0001 constrained. !!T callvirt instance void IOperation::DoIt() |
Inlined (empty) DoIt method. |
add esp,8 ret |
ret |
Unreserve stack space and return to caller. |
We can improve the code a little by changing the Gen<T>()
method body as follows:
T t = default(T);
t.DoIt();
Now the C# compiler won't generate the extra local variable and the x86 code looks a little better:
push eax
xor eax,eax
mov dword ptr [esp],eax
mov byte ptr [esp],0
pop ecx
ret
For completeness here's the x64 code for the original Gen<Dummy>()
method:
sub rsp,38h
xor eax,eax
mov byte ptr [rsp+21h],al
xor eax,eax
mov byte ptr [rsp+20h],al
xor eax,eax
mov byte ptr [rsp+21h],al
movzx eax,byte ptr [rsp+21h]
mov byte ptr [rsp+20h],al
lea rcx,[rsp+20h]
mov rax,6428001C078h
call rax
nop
add rsp,38h
rep ret
This is even worse than the x86 code, because the Dummy.DoIt()
method isn't even inlined.
NGEN
The standard excuse of why a JIT doesn't do some "obvious" optimizations is of course that it runs while your program is supposed to be running, so it can't take a lot of time to analyze and optimize the code. This is true to some extent (and JIT performance is one reason why running your managed code on x64 takes so much longer to start up), but it doesn't apply to NGEN and from what I understand today's NGEN doesn't do any more optimizations than the JIT compiler. I tested this specific example (for x64) and the NGEN code is indeed identical to the JIT code. This is unfortunate and hopefully, someday we'll see more sophisticated optimizations specific to NGEN as well.
Conclusion
This trick of using value types in generic algorithms appears to work reasonably well on x86, but there is still room for improvement in the JIT. Rumour has it that in the next .NET 2.0 service pack (due out this summer?) there will be an update of the JIT that finally supports inlining of methods that have value type arguments, let's hope that this JIT update will also include other optimizations as well. Let's also hope that this update includes a better version of the x64 JIT, because as this example shows yet again the x64 JIT still considerably lags behind the x86 JIT.
P.S. Here's the Mono 1.9 x86 JIT code for the original Gen<Dummy>()
:
push ebp
mov ebp,esp
sub esp,8
mov byte ptr [ebp-8],0
mov byte ptr [ebp-4],0
mov byte ptr [ebp-8],0
movsx eax,byte ptr [ebp-8]
mov byte ptr [ebp-4],al
lea eax,[ebp-4]
push eax
call 02A20278
pop ecx
leave
ret