Benchmarking Adventures Part 1 — Avoiding boxing


I recently stumbled on an article by Michael Shpilt that describes an implementation of the pipeline pattern in C#. A pipeline is defined as a chain of processing steps.

In the context of this article, the pipeline API will simply be made of a builder and a pipeline:

It turns out that the .NET Disruptor is already a chain of processing steps. Implementing the pipeline API using the Disruptor is simply a matter of providing a builder that can generate event handlers for each step:

  • Single-threaded steps can be implemented using IEventHandler<T>.
  • Multi-threaded steps can be implemented using worker pools and IWorkHandler<T>.

However, a Disruptor instance has one event type to move data between handlers. Thus, implementing the pipeline API using the Disruptor implies defining a general-purpose event type that is used to exchange input and output data between steps. And because each step can have different input and output data types, it is not possible to create a simple generic PipelineStepEvent<T>.

I decided to create a very simple Disruptor-based pipeline implementation in which the event type only contains a simple object reference. The values are casted or boxed on read and write:

This was good enough for a first functional POC, but you can imagine that I did not feel comfortable with the idea of boxing and unboxing value types on each step.

Therefore, I decided to add an extra field in my event to store value types:

This field can be easily read and written using System.Runtime.CompilerServices.Unsafe:

Now the questions are: How do I know that my value can be stored in the _valueStorage field? And how can I make the test fast enough to avoid generating overhead that could negate the _valueStorage field benefits?

Solution 1: The not-so-naive way

The simplest solution is to use a hard coded list of supported types in the Read and Write methods:

This solution covers most used value types but cannot prevent boxing for custom structs. Yet, it is quite powerful: in a generic method, the tests on typeof(T) are evaluated at compile time by the JIT and the useless branches are removed. It is a well-known trick that is already used in many performance oriented codebases, as in the ZeroLog library.

This JIT behavior can easily be observed using SharpLab.

Image for post
Image for post

Of course, the optimization could be compiler-specific, so it is always better to create a benchmark to measure the performance of the Read and Write methods. Using BenchmarketDotNet can really be helpful here: the job attributes allow to test the code on multiple runtimes, the MemoryDiagnoser exposes allocation statistics and the DisassemblyDiagnoser lets us verify the generated assembly.

To create a comparison reference, I added a generic event type with a simple generic field:

Here are the benchmark results:

This solution is obviously not as fast as the typed version, but it is clearly fast enough. And it is not generating any garbage, which was my first concern. I verified in the benchmark output that the assembly code for this method does not contain conditional instructions. This implies that the branch is effectively removed by the JIT. I also verified that the Read and Write performance characteristics were similar for reference types.

Solution 2: RuntimeHelpers

The previous solution was efficient but limited to a known list of value types. However, .NET Core happens to expose a method that provides exactly the needed information: RuntimeHelpers.IsReferenceOrContainsReferences. This method is a JIT intrinsic that will be implemented as return true or return false for the specified generic type. If you are curious, the code generation for the method is quite simple and can be found in jitinterface.cpp.

The new Read method becomes:

The size test with Unsafe.SizeOf is required here because the value type storage has obviously a limited size. I arbitrary decided to use a storage size of 16 because it is a very small overhead but at the same time it is enough for most known value type, including decimal and Guid.

Here are the benchmark results:

Again, this solution is fast enough and not generating any allocations.

Solution 3: TypeCache

The previous solution was almost perfect, but it is not available in the .NET Framework. Although migrating to .NET Core is a good idea for any performance oriented codebase, or for that matter any codebase, most open source libraries still need to support the .NET Framework.

Again, I went for the simplest solution: reimplementing IsReferenceOrContainsReferences with reflection and exposing the result in a static class:

Of course, I benchmarked this solution:

Ok, there is something wrong here. It could be that the JIT cannot optimize the access to the field CanUsePadding16 because its value can be changed using reflection. Yet I have already observed situations where the JIT was able to inline static readonly field values.

Therefore, I suspected something else: The fact that TypeCache has a static constructor might introduce checks to invoke the constructor that could negatively impact the benchmark performance. To verify this hypothesis, I added an initialization method that forces the static constructor invocation before the Read or Write methods are jitted:

Forcing the static constructor to run can help here because then the JIT knows it does not need to introduce the invocation test. And indeed, the benchmark looked much better once I added a call to Init in the setup:

It turns out, there is another way to solve this issue. If you remove the static constructor to a type, it can be marked with the beforefieldinit flag, and the static constructor invocation behavior changes.

So I removed Init and slightly changed TypeCache so that it could be marked with the beforefieldinit flag:

Here are the benchmark results for the new version:

We now have our solution for .NET Framework or .NET Standard projects!


This article is longer than I expected. It is curious how optimizing such simple methods as the ones presented here can make you deal with low-level concepts like JIT intrinsics or the beforefieldinit flag. Of course, this kind of over-hasty micro-optimization is a very bad practice on a real project. Even when working on very performance-sensitive applications, I rarely analyze my code that much. And when I do so, it is always after profiling or performance testing has pointed me in the right direction.

Yet, something still bothers me in my pipeline implementation: the API is Task-based and thus every Execute invocation creates a new Task. I might try to replace Task by ValueTask or a custom awaitable. That is a story for another article!

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store