The BIG performance difference between ArrayPools in .NET

13 min readApr 12, 2023

Automatic garbage collection simplifies program development by eliminating the need to track the lifecycle of objects and remove them manually. However, to make sure that the garbage collector is a useful tool, and not your main enemy on the way to high performance — sometimes it makes sense to help it by optimizing frequent allocation and allocation of large objects.

To reduce allocations, modern .NET provides Span/Memory<T>, stackalloc with Span support, structures and other tools. But if you can not do without an object in the heap, for example, if the object is too large for the stack, or is used in asynchronous code — this object can be reused. And for the largest objects — arrays, there are several implementations of ArrayPool<T> built into .NET.

In this article, I will talk about the internals of ArrayPool<T> implementations in .NET, the pitfalls that can make pooling ineffective, concurrent data structures, and the pooling of objects other than arrays.

Allocator vs Pool

The pool can be viewed as an object allocator. They have the same interface with two methods which perform new and delete functions. A good implementation of a native allocator also reuses memory: at delete a piece of memory is not immediately given back to the operating system, but is reused for new objects inside the program, i.e. it works like a pool.

It begs the question, why make the managed-object pool instead of switching to a native allocator?

- It requires fewer changes to the code that uses those objects. Also, some APIs take ArraySegment<T> rather than Memory<T>.
- This helps keep the code cross-platform. Using a third-party allocator usually involves connecting a native library.
- Managed objects can have references to other managed objects. It is not allowed to reference managed objects from memory which is not managed by garbage collector — when defragmenting the heap and moving objects, such references will not be updated and will become invalid.

The difference between the pool and the allocator is that the pool may not save some of the objects, for example, when the capacity is exceeded — then the garbage collector will collect them. An allocator, on the other hand, should guarantee that there are no memory leaks.

ArrayPool<T>

There are two different implementations of the abstract class ArrayPool<T> built into .NET. Since it is impossible to store arrays for each size individually (there would be too many), when you call .Rent(N) the pool returns an array of size N or larger. Inside, the pool stores arrays with lengths equal to powers of two.

The first is for pools created with ArrayPool<T>.Create()/ArrayPool<T>.Create(maxArrayLength, maxArraysPerBucket). The second is a static ArrayPool<T>.Shared. You can also make your own implementation of ArrayPool<T>.

Let’s start analyzing the differences between the pools mined through ArrayPool<T>.Shared and ArrayPool<T>.Create(…) with a benchmark. In addition to these implementations, let’s also test an implementation that doesn’t overuse anything, but simply allocates new arrays and left them to the GC.

// Threads = 16
// Iterations = 64*1024
// ArraySize = 1024
[Benchmark]
public void ArrayPoolConcurrent()
{
  var tasks = new Task[Threads];
  for (int i = 0; i < Threads; i++)
  {
    tasks[i] = Task.Run(() =>
    {
      for (int j = 0; j < Iterations; j++)
      {
        var arr = pool.Rent(ArraySize);
        
        // emulation of O(ArraySize) workload
        // to add a reasonable delay between .Rent and .Return
        Random.Shared.NextBytes(arr);
        pool.Return(arr);
      }
    });
  }
  Task.WaitAll(tasks);
}

|      Pool |        Mean |     Allocated |
|----------:|------------:|--------------:|
|    Create |   170.09 ms |       2.77 KB |
|    Shared |    14.96 ms |       2.41 KB |
|       new |    69.77 ms | 1072085.02 KB |

Non-pooling code (NextBytes) takes ~13 ms, which was measured separately.

The pool created through ArrayPool<byte>.Create() was slower even than allocation (the new operator), and much slower than ArrayPool<byte>.Shared (minus the overhead). But do not jump to conclusions from a single benchmark that only tests a special case and write off this implementation of the pool; we’ll further investigate how these pools are organized internally and why the result is so.

Note that the pool is only an auxiliary component and you should benchmark the algorithm or service in which the pool is used — to compare whether performance has improved by reusing objects. And the fastest implementation of pooling is the one without any synchronization with other threads. In single-threaded code it’s easier to store an array in a class field and always use it, without any pooling. And if you need to store multiple objects in a single-threaded algorithm, the usual Queue<T>/Stack<T> will do.

ArrayPool<T>.Create()

The methods .Create() and .Create(maxArrayLength, maxArraysPerBucket) create implementation internally named ConfigurableArrayPool<T>. You need to be careful here — the default value of the maximum array length in this pool is only 1024 * 1024, if it is exceeded, arrays will be allocated and not saved in the pool. So, if ArrayPool is created for large arrays, you will have to redefine the parameters.

The implementation of ConfigurableArrayPool<T> is very simple:
- arrays in the pool are grouped by size (size is always a degree of two)
- arrays of the same size are stored in a list (array-based)
- each list is protected against multi-threaded access by SpinLock

Due to the locking, this implementation of the pool does not scale for multiple threads. As a result, its main use case is large arrays, in which rent-returning time is negligibly small compared to processing the data that fills those arrays.

ArrayPool<T>.Shared

This is a static pool shared by all code in the program. The implementation is called TlsOverPerCoreLockedStacksArrayPool<T> and currently has no settings. The maximum size of the array in the pool is 2³⁰ elements. Tls in the name means ThreadLocalStorage, so you can guess what makes this pool run fast in a multithreaded environment.

In this pool implemented a two-level storage of objects. The first level is a local set of arrays for each thread. It is stored in the [ThreadStatic] field. Access to the local part of the pool does not require synchronization with other threads. However, only one array of each size may be stored locally. Using a static field here is possible because the pool is global, i.e., it is created as a single instance. In a non-static pool, we would have to use ThreadLocal<T[]> for this optimization.

The second level is shared between threads. But unlike ConfigurableArrayPool<T>, not one array list is stored for each size, but several — by the number of logical cores (max. 64), each list is protected by a separate lock. This reduces contention between threads — now they go under different locks instead of under one. Instead of SpinLock, the Shared pool implementation uses the usual lock/Monitor.

Speed — Memory tradeoff

Optimization with thread local slot has a price: a large number of large arrays can pile up in the pool, bloating application memory. In the example below, the first thread creates a new array and returns it to the pool. That array goes into the Thread Local slot of the first thread. As a result, if you try to get an array of the same size from another thread, there will be no reuse and a new array will be allocated. As a result, for large arrays, it may be more profitable to use pooling implementation with a common set of objects for all threads.

// thread 1
var pool = ArrayPool<long>.Shared;
var arr1 = pool.Rent(1024*1024*1024);
pool.Return(arr1);

Task.Run(() =>
{
  // thread 2
  var arr2 = pool.Rent(1024 * 1024 * 1024);
  Console.WriteLine(arr1 == arr2);
}).Wait();

Memory cleanup on GC

Partly to solve the previous problem, ArrayPool<T> cleanups memory on garbage collection. With the finalizer trick, the pool notified about garbage collections and periodically discards excess arrays to help free up memory.

This behavior is not always desirable. Premature removal of large arrays in the Large Object Heap can lead to excessive heap fragmentation. This is another reason to think about using a different mechanism to reuse large arrays.

Penalty for no returning of rented arrays

The benchmark we started with in the case of ArrayPool<T>.Shared “breaks through” only the first level of pooling — thread local. The question arises, how performant is the second level — per core locked stacks, especially considering that it has locking. To measure it, let’s make a benchmark using two arrays from the pool at once.

// Threads = 16
// Iterations = 1024
// ArraySize = 1024
[Benchmark]
public void ArrayPoolConcurrent_TwoArrays()
{
  var tasks = new Task[Threads];
  for (int i = 0; i < Threads; i++)
  {
    tasks[i] = Task.Run(() =>
    {
      for (int j = 0; j < Iterations; j++)
      {
        var arr1 = pool.Rent(ArraySize);
        var arr2 = pool.Rent(ArraySize);
        Random.Shared.NextBytes(arr1);
        Random.Shared.NextBytes(arr2);
        pool.Return(arr2);
        pool.Return(arr1);
      }
    });
  }
  Task.WaitAll(tasks);
}

|      Pool |        Mean |      Allocated |
|----------:|------------:|---------------:|
| Allocator |      138 ms |  2146306.76 KB |
|    Create |      230 ms |        2.88 KB |
|    Shared |       33 ms |        2.53 KB |

The non-pooling workload took 24 ms — the second level of Shared pool is no free, but the pool is still quite fast — this is achieved by the fact that threads in the benchmark take arrays from different lists and capture different locks — contention does not occur.

Each thread does not always use only one list. If a matching array is not found in .Rent(N) in current core list, all other arrays are bypassed (round-robin), only after that a new array is allocated. And if you don’t put the arrays back into the pool, each call to .Rent(N) will grab all the locks in turn, resulting in a large contention.

|            Pool |        Mean | Lock Contentions |     Allocated |
|----------------:|------------:|-----------------:|--------------:|
|       Allocator |      138 ms |                - | 2146306.76 KB |
|          Shared |       33 ms |                - |       2.53 KB |
| Shared_NoReturn |      968 ms |        3666.6667 | 2144145.85 KB |

There is some “protection” against this issue. The second level of the pool (PerCoreLockedStacks) is initialized only the first time an array returns to it. If no arrays are returned to the pool anywhere in the program, .Rent(N) will allocate a new array without capturing locks.

Also, the Shared pool capacity is relatively small — 8 arrays of each size in each PerCoreLockedStacks (i.e. 512 per size maximum). And if you need a lot of arrays, each of which will be used long term, there will be no effect from pooling, because new arrays will inevitably be created, and threads will capture locks in the hope of finding something in the empty pool.

Diagnostics

System.Buffers.ArrayPoolEventSource events are provided to monitor standard implementations of ArrayPool<T>. You can get them through PerfView, dotnet trace, EventListener and other ways. The main event that makes sense to look at is BufferAllocated — if a lot of new arrays are allocated, then pooling is ineffective. The problem with lock contention in case of Shared pooling can be calculated from Microsoft-Windows-DotNETRuntime/Contention/Start events.

PerfView stacktrace example:

Event Microsoft-Windows-DotNETRuntime/Contention/Start
+ module coreclr <<coreclr!?>>
 + module System.Private.CoreLib.il <<System.Private.CoreLib.il!System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1[System.Byte].Rent(int32)>>
 |+ module app <<app!PoolExperiments.<NoReturningMethod>b__24_0()>>

Alas, there are no events about inefficient use of SpinLock. But inefficient use of ConfigurableArrayPool<T> can be found out by profiling or dump analysis.

dotnet dump stacktrace example:

Call Site
[HelperMethodFrame: 0000004e2f77f488] System.Threading.Thread.SleepInternal(Int32)
System.Threading.Thread.Sleep(Int32) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Thread.cs @ 375]
System.Threading.SpinWait.SpinOnceCore(Int32) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/SpinWait.cs @ 196]
System.Threading.SpinLock.ContinueTryEnter(Int32, Boolean ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/SpinLock.cs @ 359]
System.Buffers.ConfigurableArrayPool`1+Bucket[[System.Byte, System.Private.CoreLib]].Rent() [/_/src/libraries/System.Private.CoreLib/src/System/Buffers/ConfigurableArrayPool.cs @ 205]
System.Buffers.ConfigurableArrayPool`1[[System.Byte, System.Private.CoreLib]].Rent(Int32) [/_/src/libraries/System.Private.CoreLib/src/System/Buffers/ConfigurableArrayPool.cs @ 88]
PoolExperiments.<ConfigurableArrayPool_SpinLock>b__21_0() [Program.cs @ 162]

Not only arrays

So far, we’ve only talked about implementations of ArrayPool<T> — array pools. Sometimes you need to reuse not only arrays, but other objects as well. And their pooling also has its own subtleties.

The second level of ArrayPool<T>.Shared for arrays of the same size is essentially a thread-safe pool of identical objects. It would seem, why not take it as an object pool? In some scenarios this approach will work well, but most non-array objects are small. On the one hand, this increases performance requirements for the pool — you can no longer hide rent-returning time among the workload. On the other hand, for small objects, we can avoid threading and create an [ThreadStatic]Stack/Queue<T> pool that is local for each thread. This variant isn’t suitable for objects that can move from one thread to another, since there’s a risk of excessive allocations in one thread and pool overflow in another — there’d be no difference with simple allocation (with the limited size pool, otherwise it will be a memory leak).

Also, object pooling is often implemented based on ConcurrentQueue<T>. For example, this is how DefaultObjectPool<T> from package Microsoft.Extensions.ObjectPool is implemented, with the difference that it also has the first level — the _fastItem field for storing one object. This is a class field, not static, and not [ThreadStatic]/ThreadLocal. It is handled by atomic instructions (Interlocked).

Instead of ConcurrentQueue<T> another data structures can be used — ConcurrentStack<T> or ConcurrentBag<T>. Using a stack makes no sense — it allocates a linked list node per element, and scales worse — when ConcurrentQueue<T> is torn by threads from two different sides, all the load on the stack falls only on one edge of it, and the LIFO-order provides no benefit to the pool. ConcurrentBag<T> is faster when adding and removing elements from one thread at the expense of ThreadLocal<> inside, but it does not catch up to the performance of the algorithm from ArrayPool<T>.Shared.

Several implementations of pools were made for the test — naive over one concurrent-collection, global Stack<T> under the lock (with and without _fastItem slot), a separate ThreadLocal Stack<T> for each thread, DefaultObjectPool<T> (and its modification with ThreadLocal) and ArrayPool<T>.Shared for comparison. A 1 KB array was used as an object to unify the code with the benchmark for ArrayPool<T>.

|                Pool |   Mean | Lock Contentions |   Allocated |
|--------------------:|-------:|-----------------:|------------:|
|       StackWithLock | 818 ms |        6476.0000 |     3.87 KB |
|  StackWithLock+Slot | 685 ms |        3200.0000 |     3.30 KB |
|  ConcurrentStackObj | 540 ms |                - | 65539.59 KB |
|  ConcurrentQueueObj | 455 ms |                - |     3.82 KB |
|   DefaultObjectPool | 320 ms |                - |     2.99 KB |
| ThreadLocObjectPool | 208 ms |                - |     2.87 KB |
|    ConcurrentBagObj |  77 ms |                - |     2.51 KB |
|              Shared |  35 ms |           0.0667 |     2.49 KB |
|   ThreadStaticStack |  27 ms |                - |     2.44 KB |

The lock-freeness is not enough for data structure to be fast — the difference with normal locking turned out to be only ~2 times. Atomic instructions are not magic, though they are faster than locks, but atomic write scales poorly. As a result, for better performance you should use the same techniques as in ArrayPool<T>.Shared — thread local slots and sharding the shared storage between threads to minimize any thread synchronization.

Performance-critical places within dotnet/runtime have their own pools implemented. For example, in PoolingAsyncValueTaskMethodBuilder, used to reduce the number of allocations in asynchronous code, a two-level pool is implemented — [ThreadStatic] slot + slots by number of cores handled by Interlocked.

Bounded queue

You may want to make your own pool. For example, locks inside standard `Shared` pool may cause high tail latency in service metrics (some two threads share one LockedStack — separation occurs by ProcessorId). Or an efficient pool for objects with shared storage between threads is needed.

It is possible to take ArrayPool<T>.Shared and replace LockedStacks with number of ConcurrentQueue<T>. But an important detail of the ArrayPool<T>.Shared implementations is the support for limiting the number of elements stored within the pool. In pools based on ConcurrentQueue<T> this is implemented with an additional item counter that changes through Interlocked. This, firstly, slows down the queue by adding another synchronization point. Second, it is not aesthetically pleasing — ConcurrentQueue<T> is implemented on top of the internal class ConcurrentQueue<T>.Segment, which is a queue with a limited capacity. As a result, when the counter is used, the bounded queue is wrapped in unbounded and a bounded queue is implemented on top of it.

I’d like to extract ConcurrentQueueue<T>.Segment into a separate class and use it to implement the pool. And I’m not the only one. There’s a very old proposal on dotnet/runtime about it. The ConcurrentQueue<T>.Segment class itself is easily separates from ConcurrentQueueue<T> and ready to use right away.

Let’s test this implementation. In the case where one LockedStack or queue per logical core is used, it makes no difference. But if you reduce the sharding by a factor of 4, you can see the difference between the different options. This shows how important the assumption that different threads will use different locks in the ArrayPool<T>.Shared structure is, and that the seemingly lightweight Interlocked counter around the queue does add overhead. If you’re happy with a pool with unlimited capacity, then ConcurrentQueueue<T> is a good structure for this task.

Per core data structures:
|                       Pool |     Mean |
|           BoundedQueuePool |    33 ms |
|            ConcurrentQueue |    33 ms |
|                     Shared |    33 ms |
|    ConcurrentQueue+Counter |    33 ms |

(ProcessorCount / 8) data structures:
|                    Pool |        Mean | Lock Contentions |
|------------------------ |------------:|-----------------:|
|        BoundedQueuePool |       65 ms |                - |
|         ConcurrentQueue |       65 ms |                - |
|          Shared_Limited |      118 ms |            354.4 |
| ConcurrentQueue+Counter |       75 ms |                - |

Conclusion

Object pooling helps to reduce allocation and garbage collection load, but pools themselves are complex data structures, and an unsuccessful or unsuitable implementation of a pool for a particular load profile can ruin performance. So, even staying within the standard pool implementations, for small arrays needed for a short time it is preferable to use the scalable ArrayPool<T>.Shared, and for large arrays a pool created through ArrayPool<T>.Create(..., ...) as more roomy and economical in terms of no separation by threads.

Also, if you want to pool all objects in the program, think twice. Excessive pooling complicates the code considerably, especially when manual reference counting and other non-trivial ways of object lifecycle tracking appear. None of this eliminates the risk of using an object already returned to the pool and not noticing it. In some cases, it’s easier to rely on garbage collection, or reduce allocations with other tools, like structures, stackalloc, Span/Memory<T>, or scatter-gather IO.

Besides array and object pooling, there are more complex cases, such as dynamic data structures — lists, hash tables; reusing objects within concurrent data structures. But that’s another story altogether.

Links

https://github.com/epeshk/arraypool-examples — repo with benchmarks