Improving .NET Disruptor performance — Part 3: Introducing the ValueDisruptor

Olivier Coanet
8 min readAug 27, 2018

--

This is the third part of a series of posts on the .NET Disruptor performance:
- Part 1.
- Part 2.

Context

The Disruptor data is stored in reference type objects, the events. The events are preallocated and stored in a contiguous array called the ring buffer. They are instantiated together when the Disruptor is created, so they tend to end up packed in the same heap segment. Because the events are processed sequentially, they benefit from good data prefetching. This is one of the reasons the Disruptor is fast.

Why not a struct

Given that the events are preallocated, inheritance is not an option. And to make prefetching effective, the events need to be as packed as possible. So, the event type seems like a very good candidate for a struct. However, the event type is a reference type in the current Disruptor design, mainly for two reasons.

First, the .NET Disruptor had always been a very accurate port of the Java version. And of course, the event type is a reference type in Java because value types are not available in Java.

Also, the events are mutable and generally quite large. Value type events should only be passed and returned by-ref to avoid slow copies and to support mutability. This makes value type event almost impossible to implement without ref return, which has only been available in C# since version 7.0.

Note that excellent packing in arrays is not the only advantage offered by value types. Structs also have less overhead because they have no object headers. You only pay the memory cost of your struct members, eventually with some padding. And obviously, reading structs in an array removes a level of indirection.

Yet, choosing a struct for events has one major flaw: using reference types for nested members becomes inefficient. Because sub-objects will be allocated outside the event array, they will not be stored contiguously to the struct data. Therefore, sub-objects usage weakens data locality and data prefetching.

So, creating a new value type Disruptor could be beneficial for some use cases, but it cannot become a systematic replacement for the existing reference type Disruptor.

Also, the real performance gain could be quite small and creating this Disruptor could introduce complexity and maintenance costs. I thought that one deciding factor for starting the development would be the ability to improve the memory layout of the events.

How effective is the current packing?

I remember that when I explained to @federicolois that “the Disruptor events are packed because they are allocated together at the startup of the program”, he was very doubtful. Of course, he was right: many things can happen that could prevent the instances from being stored contiguously in memory. First, a GC could kick in during the allocation phase and move around part of the instances. Then, another thread could be allocating at the same time and create many gaps in the memory layout.

In my applications, the setup sequence is single-threaded to prevent this last issue. So, I supposed that the instances were mostly packed. But how could I be sure? There was only one way to find out: measure it!

I decided to try CLR MD. CLR MD is an API for analysing processes or memory dumps. It can be used to introspect running processes by scanning threads, memory segments and objects.

As a first approach, I decided to create a small program that would track the memory location of the instances of a given type. It was a good first step because the event type is often only used in the Disruptor and because many applications have only one Disruptor instance. I was truly impressed by the simplicity and the potential of the CLR MD API. In a matter of minutes, I had a tool that could attach itself to a running application and display memory layout information.

This little program was already helpful but it was unusable on applications with multiple Disruptor instances. Also, the code assumed that the event instance locations are always sequential. That is, the memory location of every instance is after the memory location of the instances allocated before it. It should be mostly true, but my goal is to know the real positions of the event instance, not to guess them. And the only way to get them, is to read them from the event array. So, I added an option to my program that would specifically look for Disruptor event arrays and display memory layout information for them.

The code is available here on GitHub.

Here is the result in one application:

D:\ABC\HeapWalker>Disruptor.HeapWalker.exe 4280 --scan-events
PID: 4280
Runtime: Desktop v4.7.3130.00
Scanning for ring buffers
Found instance of Disruptor.RingBuffer<XXXXXXXXXX>
Ring buffer size: 65568 (65536 events + 32 padding)
Segments count: 2
Offset: 328 Count: 65534
Offset: 952364232 Count: 1

Here there is only one big gap of almost 1GB, so the events are located in two groups, but perfectly packed inside each group. The event size is reasonably small. This situation is quite good for the Disruptor usage.

Here is the result for another application:

Found instance of Disruptor.RingBuffer<YYYYYYYYYY>
Ring buffer size: 8224 (8192 events + 32 padding)
Segments count: 1
Offset: 584 Count: 8189
Offset: 608 Count: 1
Offset: 3184 Count: 1

Here the event size is also quite small, and the events are almost perfectly packed. This is a really good situation for the Disruptor.

Here is the result for a last application:

Found instance of Disruptor.RingBuffer<ZZZZZZZZZZ>
Ring buffer size: 8224 (8192 events + 32 padding)
Segments count: 2
Offset: 6040 Count: 3
Offset: 8576 Count: 4498
Offset: 8592 Count: 3653
Offset: 8600 Count: 2
Offset: 8616 Count: 2
30 remaining offsets for 33 events, max: 291768568, avg: 47325932

Here the situation looks quite bad. First, the events are fairly large. Of course, large events are not necessarily an issue, it depends on your Disruptor usage. But I can tell you, for this application, something should really be done to reduce the events size. Then, there are many gaps between events, the biggest being nearly 280MB. And what is really strange, is that the events offsets are bimodal. I don’t even know how it is possible.

Although those results are not that bad overall, they convinced me that the current memory layout of the Disruptor events is too random, especially for very large event types.

How is LMAX solving this issue?

I am trying to improve the Disruptor event packing. But surely, the people behind the Disruptor must have already thought about it and found a solution, right? It turns out, LMAX did find a solution for this issue. Yet, it is rather extreme: they use an off-heap Disruptor.

The off-heap Disruptor design is quite simple: the ring buffer consists of a very large byte array, and the events are segments on this array. Wrapper types are created to read and write data from the array segments in a strongly typed manner. Those wrappers can be handwritten or generated. They are very similar to the types generated by serializers like FlatBuffers, Cap’n Proto or SBE.

This design seems to be the best option for Java performance-wise even though it incurs a good deal of extra complexity. I think that a value type .NET Disruptor could offer the same performance characteristics while being much simpler to use.

Design considerations

Once I decided that I would try to implement a struct based Disruptor, the ValueDisruptor, I had a few design options. On one hand I could create a new project from scratch and use this opportunity to introduce many improvements that I am too lazy to code in the current codebase because of breaking changes. On the other hand, I could add the ValueDisruptor to the current project and try to reuse as much code as possible. I chose the latter, for the code reuse of course, but also because I wanted to have only one project to maintain, and one NuGet package to release.

I needed to introduce a few new types and interfaces, but the design was really straightforward.

A new ring buffer:

Here Util.ReadValue is a helper method that is used to read an event from the underlying ring buffer array without bound checks or type checks. It is a slightly modified version of the helper that I introduced in my first article. It is also written using InlineIL.Fody.

I could reuse a good part of the existing RingBuffer<T> that I moved in a non-generic base type RingBuffer.

A new event handler:

And of course, a new Disruptor:

As you can see in the changes, the value event instances are always passed or returned by-ref to avoid slow copies and to support mutability.

The full code can be found in the ValueDisruptor branch in the GitHub repository.

Some performance results

Now is the time to run the performance benchmarks.

Here are the results for one of the throughput tests:

There is a clear performance gain. But what is really nice is not only the improvement on the throughput (+14%) but also the fact that the batches are much smaller, which implies that the latency also improved significantly.

Because many performance improvements were already introduced in the version 3.4, it might be a good time to step back and look at the performance results for the version 3.3.7:

Ok, those results seem awful. But this particular performance test was one of the weak points of the .NET Disruptor at this time, and one of the reasons behind multiple optimizations that were introduced in version 3.4. Most of the other performance tests were thankfully much better.

Also, for comparison, here are the results for the Java Disruptor:

Please note that I am a complete Java newbie and I might not be running the performance tests correctly. I am by no means claiming that the .NET Disruptor is faster than the Java version. Also, this is only the result of one of the many performance tests of the project.

Conclusion

So, the throughput and latency gains for the ValueDisruptor are quite clear in the tests. What is funny, is that those tests use small ring buffer sizes and small events, thus they are not subject to ring buffer memory fragmentation. What is measured here is purely the gain of using a struct against using a class. You might get much better performance gains in your applications if you have event packing issues.

I also discovered that the memory layout of events can be quite bad for sizable ring buffers containing large events. You can use the HeapWalker tool to identify such issues in your programs. In some cases, the ValueDisruptor would be a good way to improve the memory layout. In other cases, when sub-objects are required for the event types, the memory layout can be improved by reducing the event size.

The development of the ValueDisruptor seems fairly justified by those results. It provides an easy way to improve Disruptor event packing, without using a complex off-heap ring buffer. It should be available in the next NuGet release. I strongly recommend you to consider it for your future low latency projects.

Many thanks to @Lucas_Trz, @MendelMonteiro and @federicolois for the reviews.

--

--