# Tuesday, 10 November 2009
« New Development Snapshot | Main | Running Eclipse with NGEN »
Going Crazy with Generics, or The Story of ThreadLocal(Of T)

Jon Skeet recently blogged about the performance of [ThreadStatic] versus the new .NET 4.0 ThreadLocal<T>. I was surprised to see that ThreadLocal<T> was faster than [ThreadStatic], because ThreadLocal<T> uses [ThreadStatic] as the underlying primitive.

How do you go from a static field to a per instance field? It's simple, once you think of it. You (ab)use generic types. Here's a simplified ThreadLocal<T>:

public class ThreadLocal<T>
{
  HolderBase holder;
  static int count;
  static Type[] types = new Type[] { typeof(C1), typeof(C2), typeof(C3) };

  abstract class HolderBase
  {
    internal abstract T Value { get; set; }
  }

  class C1 { }
  class C2 { }
  class C3 { }

  class Holder : HolderBase
  {
    [ThreadStatic]
    static T val;

    internal override T Value
    {
      get { return val; }
      set { val = value; }
    }
  }

  public ThreadLocal()
  {
    holder = MakeHolder(Interlocked.Increment(ref count) - 1);
  }

  HolderBase MakeHolder(int index)
  {
    Type t1 = types[index % 3];
    Type t2 = types[(index / 3) % 3];
    Type t3 = types[index / 9];
    Type type = typeof(Holder<,,>).MakeGenericType(typeof(T), t1, t2, t3);
    return (HolderBase)Activator.CreateInstance(type);
  }

  public T Value
  {
    get { return holder.Value; }
    set { holder.Value = value; }
  }
}

The real ThreadLocal<T> type in .NET 4.0 beta 2 is much more complex, because it has to deal with recycling the types and protecting against returning a value from a recycled type. It also uses a higher base counting system  to number the types, the maximum number of types generated (per T) is 4096 in beta 2. After you allocate more than that, it falls back to using a holder type that uses Thread.SetData().

I'm not sure what to make of this. It's a clever trick, but I think it ultimately is too clever. I benchmarked a simpler approach using arrays (where each ThreadLocal<T> simply allocated an index in the [ThreadStatic] array) and it was a little bit faster and doesn't suffer from the downsides of creating a gazillion types (which probably take more memory and those types stay around until the AppDomain is destroyed).

Finally a tip for Microsoft, move the Cn types out of ThreadLocal, because currently they are also generic (due to the fact that C# automatically makes nested types generic based on the outer type's generic type parameters) and that is unnecessarily wasteful.

Tuesday, 10 November 2009 07:03:31 (W. Europe Standard Time, UTC+01:00)  #    Comments [6]
Sunday, 25 September 2011 09:42:53 (W. Europe Daylight Time, UTC+02:00)
I can confirm your results. As strange as this may sound, indexing into thread-local arrays is faster than accessing a direct generic reference containing a static thread-local variable.

I implemented your generics trick, but <a href="http://sasa.hg.sourceforge.net/hgweb/sasa/sasa/file/586cf322c4cc/Sasa.TM/ThreadScoped.cs">significantly simplified and improved it</a>, such that it doesn't suffer from any of the issues you name, and it's simpler than the array version, with more features. I also implemented the <a href="http://sasa.hg.sourceforge.net/hgweb/sasa/sasa/file/3387d040ceaf/Sasa.TM/ThreadScoped.cs">array version</a>.

The advantages of the generic version: implementation is simpler, uses less memory, can grow incrementally limited only by memory, and can easily reuse instances.

The advantages of the array approach: more than twice as fast as the generics version, by my benchmarks. The array version has a serious limitation in that allocation will get slower over time since we have to search for selectors, and that there's no simple way I can think of to allow allocation to grow without limit. That's a tough sell, but the performance advantage is seriously compelling.
Sandro
Monday, 26 September 2011 15:18:11 (W. Europe Daylight Time, UTC+02:00)
For .NET 4.5 (the preview included with the Windows 8 Developer Preview) Microsoft moved to an array based solution (probably because they wanted to add the Values property).
Tuesday, 27 September 2011 06:24:35 (W. Europe Daylight Time, UTC+02:00)
You may be interested in my post on this subject, where I explain my implementation and provide benchmarks:

http://higherlogics.blogspot.com/2011/09/faster-thread-local-data-with.html

I didn't cover the array-backed version though. You'll be happy to know that your intuition is correct that there's no way that thread-local instances could be faster than thread-local static data. I suspect there were some confounding factors in Skeet's original tests that lead to those results.
Sandro
Tuesday, 27 September 2011 06:46:11 (W. Europe Daylight Time, UTC+02:00)
Your solution of recursing the index type is clever, but unfortunately on the current CLR is doesn't scale very well. Try this code:

List<object> list = new List<object>();
for (int i = 0; i < 10000; i++)
list.Add(Sasa.Concurrency.ThreadScoped.Create(42));

It takes a long time to run and eventually dies with a stack overflow.
Tuesday, 27 September 2011 23:13:31 (W. Europe Daylight Time, UTC+02:00)
Actually, it doesn't take much time for me, and it doesn't generate a stack overflow, it generates a System.TypeLoadException because the generic instantiation "type depth" is too deep, at i = 99. Bizarre. I don't use recursion in the allocation, it's a loop, so stack overflow wouldn't make sense.

In any case, the same solution I proposed would work with a flatter hierarchy across N type parameters instead of just one, so we could get the scaling desired. If generic recursion depth of n=99 is the limit, with two parameters we can construct 99^2 instances due to permutations, and with 3 it's 99^3, and so on. You just increment the indexes using simple counting, ie. up to i mod 99, increment first index, then overflow to the second, etc.

Thanks for the heads up! I'll fix it soon to use this new scheme. It's hard to predict where CLR limitations might crop up.
Sandro
Wednesday, 28 September 2011 06:10:18 (W. Europe Daylight Time, UTC+02:00)
Don't mean to keep posting here, but thought you might be interested that I addressed your concern [1], and added tests to verify that I can create more than 10,000 unique instances. Theoretically, I can create almost a million (99^3), but I'm not sure if I'll run into other CLR limitations before then. If you have any thoughts, let me know!

[1] http://higherlogics.blogspot.com/2011/09/threadscoped-next-generation.html
Sandro
Name
E-mail
Home page

I apologize for the lameness of this, but the comment spam was driving me nuts. In order to be able to post a comment, you need to answer a simple question. Hopefully this question is easy enough not to annoy serious commenters, but hard enough to keep the spammers away.

Anti-Spam Question: What method on java.lang.System returns an object's original hashcode (i.e. the one that would be returned by java.lang.Object.hashCode() if it wasn't overridden)? (case is significant)

Answer:  
Comment (HTML not allowed)  

Live Comment Preview