Improving .NET Disruptor performance — Part 1
This is the first part of a series of posts on the .NET Disruptor performance:
- Part 2.
- Part 3.
Some background
In 2016 the .NET Disruptor benefited from a major upgrade by our team, from Java version 2.10.0 to version 3.3.4. Our team consisted of @MendelMonteiro, @AntoineB0 and me. We also used our time to analyze the Disruptor performance, first by porting and running all the performance tests, then by doing a few dedicated code reviews, and finally by using profilers.
The performance analysis was quite hard for two reasons:
- It was tricky to make sense of the the throughput tests results. Because of the batching ability of the Disruptor, a slow down in specific parts of the code can increase the throughput. On the opposite, an improvement that reduces the latency can also reduce the throughput. I suspect that I rejected a few valid code changes back then because I thought they were hurting the Disruptor performance.
- The Disruptor was already heavily optimized. It had no obvious bottleneck, the code was clean and simple and used appropriate data structures. The standard .NET profilers like dotTrace were not really suited to find possible performance improvements.
The overall performance was already greatly improved by the design changes from the 3.3.4 version. However we were puzzled by the fact that the results of a few .NET performance tests seemed to be far behind their Java equivalents. We decided to use VTune to profile the .NET and the Java tests and we also compared the assembly code from both versions to try to understand the performance gap. We were able to find a bad padding issue in the .NET version that was causing false sharing, but in the end we could not explain all the differences.
After a few days we stopped working on the subject, out of time and out of ideas. However, I was clearly not satisfied with the situation…
A new start
In 2018 I finally found the time and motivation to work on the .NET Disruptor performance again. I had been regularly porting changes from the Java version and I was much more familiar with the code base.
This time, I knew that I should not solely rely on the existing performance tests. The fact that the batching behavior of the Disruptor was not visible in the test results was clearly a problem. My teammate @MendelMonteiro added the batching percentage and the average batch size to the tests results. It was clearly a nice improvement as it allowed to quickly make sense of a throughput increase or decrease.
I also added a project for Benchmark.NET measurements to the solution and I decided that I would always try to start with specific benchmark results before looking at the performance tests.
Finally I chose to look at the generated machine code more often to understand the impacts of code changes.
Step 1: Array access in RingBuffer
One of the first possible tweaks I knew of was improving the performance around array accesses. There are two places in the Java codebase where array accesses are performed using sun.misc.Unsafe
: in RingBuffer
and in MultiProducerSequencer
. In both cases Unsafe
is employed to improve performance.
The .NET RingBuffer
used a standard array access to return events in the indexer. This array access introduced two costs: bound checks and dynamic type checks. As @mikeb2701 pointed out to me, a bounds check also incurs the read of the length of the array. This extra read can potentially cause cache misses due to false sharing.
For reference, here is the original array access based indexer:
And here is the corresponding machine code:
Even if you are not familiar with x86 assembly, you can notice that this code is not really compact. It contains multiple conditional instructions (cmp
and test
) as expected. There is also a call, but it is not part of the happy path.
Blittable value type arrays can easily be pinned to obtain a pointer that allows insanely fast lookups, but that cannot be done with reference type arrays. To avoid bound checks, I first tried to use Unsafe
from System.Runtime.CompilerServices.Unsafe
:
I added a benchmark to make sure that the new code was faster and the gain was notable:
I was quite satisfied with this easy enhancement but I did not think it would impact significantly the existing end-to-end performance tests. To my surprise the gain was also important in those tests:
I decided to look at the machine code to understand the differences between the array access code and the Unsafe
based code. The new machine code was not as compact as I had imagined:
Again, even if you are not familiar with x86 code, you can notice a few differences. The conditional instructions are gone, which is a good thing. However, there is now an extra call in the hot path, which is unexpected. This call appears to be a JIT helper used to get the address of an array element. This helper performs a null check, a bounds check and a type check. The checks might be optimized, but they are still present, hidden in an extra call! There is obviously further room for improvement here.
After a few investigations, I discovered that inlining the RingBuffer
indexer with [MethodImpl(MethodImplOptions.AggressiveInlining)]
had two effects: it would make the call much faster and it would make the array access based version faster than the Unsafe
based version. The Unsafe
based optimisation was now useless. This was a good performance lesson: there is a lot to be gained by removing method calls, either by manually inlining them or by specifying MethodImplOptions.AggressiveInlining
. Removing method calls should often be the first step to micro-optimize an algorithm.
The machine code and performance results of the Unsafe
based version were clearly disappointing. So I decided to try another approach: write a custom indexer in IL, inspired by the ancestor of Span, slice.net. I did not want to add a new dependency to the project, but my teammate @Lucas_Trz recently created a Fody add-in which allows to inject arbitrary IL into assemblies at compile time: InlineIL.Fody. This add-in is only used during compilation and is not referenced by the generated assembly.
Here is the helper method that I used in the indexer:
I added a new benchmark including the 3 indexer versions:
The gain was also visible in the performance tests:
And this time the generated machine code was much cleaner:
The machine code was more compact and did not contain conditional instructions or calls. This is obviously what I wanted to get with the Unsafe
based version.
Step 2: Array access in MultiProducerSequencer
I said earlier that blittable value type arrays could be pinned to obtain a pointer that allows fast lookups. That is true, but pinning an array has a cost, and pinning an array for only one lookup is clearly a bad idea.
Here is the machine code for an array lookup:
And this is the machine code for a pinning and a pointer lookup:
For some reason, MultiProducerSequencer
was pinning its underlying array for every read or write. I decided to start by removing the pin and use a standard array access. The performance gain was significant:
I think that using a pinned array was a good idea. However, the array should be pinned in the MultiProducerSequencer
constructor, either by using GCHandle
, or by allocating an off-heap buffer using Marshal.AllocHGlobal
. Both approaches require adding a finalizer to MultiProducerSequencer
and implementing IDisposable
. The RingBuffer
and Disruptor
would then need to implement IDisposable
too, in order to dispose the underlying sequencer. I might decide to do this later, but I am not sure if it is worth the effort. I suppose it is a good candidate for an open discussion in a GitHub issue.
The next step
These first little victories convinced me that a few performance improvements could still be found in this stable and solid code base. I decided that my next target would be to reduce the number of interface method calls. But that is a story for another post!
Many thanks to @Lucas_Trz, @MendelMonteiro for the reviews and to @mikeb2701 for answering my questions about the Java Disruptor.