# Tuesday, November 10, 2009
« New Development Snapshot | Main | Running Eclipse with NGEN »
Going Crazy with Generics, or The Story of ThreadLocal(Of T)

Jon Skeet recently blogged about the performance of [ThreadStatic] versus the new .NET 4.0 ThreadLocal<T>. I was surprised to see that ThreadLocal<T> was faster than [ThreadStatic], because ThreadLocal<T> uses [ThreadStatic] as the underlying primitive.

How do you go from a static field to a per instance field? It's simple, once you think of it. You (ab)use generic types. Here's a simplified ThreadLocal<T>:

public class ThreadLocal<T>
  HolderBase holder;
  static int count;
  static Type[] types = new Type[] { typeof(C1), typeof(C2), typeof(C3) };

  abstract class HolderBase
    internal abstract T Value { get; set; }

  class C1 { }
  class C2 { }
  class C3 { }

  class Holder : HolderBase
    static T val;

    internal override T Value
      get { return val; }
      set { val = value; }

  public ThreadLocal()
    holder = MakeHolder(Interlocked.Increment(ref count) - 1);

  HolderBase MakeHolder(int index)
    Type t1 = types[index % 3];
    Type t2 = types[(index / 3) % 3];
    Type t3 = types[index / 9];
    Type type = typeof(Holder<,,>).MakeGenericType(typeof(T), t1, t2, t3);
    return (HolderBase)Activator.CreateInstance(type);

  public T Value
    get { return holder.Value; }
    set { holder.Value = value; }

The real ThreadLocal<T> type in .NET 4.0 beta 2 is much more complex, because it has to deal with recycling the types and protecting against returning a value from a recycled type. It also uses a higher base counting system  to number the types, the maximum number of types generated (per T) is 4096 in beta 2. After you allocate more than that, it falls back to using a holder type that uses Thread.SetData().

I'm not sure what to make of this. It's a clever trick, but I think it ultimately is too clever. I benchmarked a simpler approach using arrays (where each ThreadLocal<T> simply allocated an index in the [ThreadStatic] array) and it was a little bit faster and doesn't suffer from the downsides of creating a gazillion types (which probably take more memory and those types stay around until the AppDomain is destroyed).

Finally a tip for Microsoft, move the Cn types out of ThreadLocal, because currently they are also generic (due to the fact that C# automatically makes nested types generic based on the outer type's generic type parameters) and that is unnecessarily wasteful.

Tuesday, November 10, 2009 7:03:31 AM (W. Europe Standard Time, UTC+01:00)  #    Comments [6]
Sunday, September 25, 2011 9:42:53 AM (W. Europe Daylight Time, UTC+02:00)
I can confirm your results. As strange as this may sound, indexing into thread-local arrays is faster than accessing a direct generic reference containing a static thread-local variable.

I implemented your generics trick, but <a href="http://sasa.hg.sourceforge.net/hgweb/sasa/sasa/file/586cf322c4cc/Sasa.TM/ThreadScoped.cs">significantly simplified and improved it</a>, such that it doesn't suffer from any of the issues you name, and it's simpler than the array version, with more features. I also implemented the <a href="http://sasa.hg.sourceforge.net/hgweb/sasa/sasa/file/3387d040ceaf/Sasa.TM/ThreadScoped.cs">array version</a>.

The advantages of the generic version: implementation is simpler, uses less memory, can grow incrementally limited only by memory, and can easily reuse instances.

The advantages of the array approach: more than twice as fast as the generics version, by my benchmarks. The array version has a serious limitation in that allocation will get slower over time since we have to search for selectors, and that there's no simple way I can think of to allow allocation to grow without limit. That's a tough sell, but the performance advantage is seriously compelling.
Monday, September 26, 2011 3:18:11 PM (W. Europe Daylight Time, UTC+02:00)
For .NET 4.5 (the preview included with the Windows 8 Developer Preview) Microsoft moved to an array based solution (probably because they wanted to add the Values property).
Tuesday, September 27, 2011 6:24:35 AM (W. Europe Daylight Time, UTC+02:00)
You may be interested in my post on this subject, where I explain my implementation and provide benchmarks:


I didn't cover the array-backed version though. You'll be happy to know that your intuition is correct that there's no way that thread-local instances could be faster than thread-local static data. I suspect there were some confounding factors in Skeet's original tests that lead to those results.
Tuesday, September 27, 2011 6:46:11 AM (W. Europe Daylight Time, UTC+02:00)
Your solution of recursing the index type is clever, but unfortunately on the current CLR is doesn't scale very well. Try this code:

List<object> list = new List<object>();
for (int i = 0; i < 10000; i++)

It takes a long time to run and eventually dies with a stack overflow.
Tuesday, September 27, 2011 11:13:31 PM (W. Europe Daylight Time, UTC+02:00)
Actually, it doesn't take much time for me, and it doesn't generate a stack overflow, it generates a System.TypeLoadException because the generic instantiation "type depth" is too deep, at i = 99. Bizarre. I don't use recursion in the allocation, it's a loop, so stack overflow wouldn't make sense.

In any case, the same solution I proposed would work with a flatter hierarchy across N type parameters instead of just one, so we could get the scaling desired. If generic recursion depth of n=99 is the limit, with two parameters we can construct 99^2 instances due to permutations, and with 3 it's 99^3, and so on. You just increment the indexes using simple counting, ie. up to i mod 99, increment first index, then overflow to the second, etc.

Thanks for the heads up! I'll fix it soon to use this new scheme. It's hard to predict where CLR limitations might crop up.
Wednesday, September 28, 2011 6:10:18 AM (W. Europe Daylight Time, UTC+02:00)
Don't mean to keep posting here, but thought you might be interested that I addressed your concern [1], and added tests to verify that I can create more than 10,000 unique instances. Theoretically, I can create almost a million (99^3), but I'm not sure if I'll run into other CLR limitations before then. If you have any thoughts, let me know!

[1] http://higherlogics.blogspot.com/2011/09/threadscoped-next-generation.html
Comments are closed.