Improving .NET Disruptor performance — Part 2

Olivier Coanet
5 min readApr 24, 2018

--

This is the second part of a series of posts on the .NET Disruptor performance:
- Part 1.
- Part 3.

General thoughts on interface method calls

If there is one aspect of the .NET Disruptor that reveals the fact that is was ported from Java, it is its extensive usage of interfaces. The library includes 29 interfaces for 54 classes. There are interface method calls all over the codebase. This is not a problem for the Java version, because interface method calls are fairly well optimized by the JVM. The CLR and the JIT are clearly not as efficient in this domain. .NET Core is making good progress regarding devirtualization but both .NET runtimes are still far behind Java right now. If you are interested in how interface dispatch is implemented, you should read the Virtual Stub Dispatch article in the book of the runtime.

There is one situation however, where interface calls are really efficient in .NET: when the invocation target type is a generic parameter instantiated with a struct. Because the JIT is going to generate specific versions of your generic methods for structs, interface calls will be replaced by direct calls, and can even be inlined.

Here is an example of a method invoked with a struct as a generic argument:

And here is the corresponding machine code for SumValues:

You can see that the first interface call is performed using a direct call. The call target address is a constant in the assembly and not loaded from an intermediate pointer, as would have been the case for a standard dispatch stub. The second interface call got inlined into the method.

Removing interface calls from BatchEventProcessor: a first step

My first goal was to reduce the amount of interface calls in the BatchEventProcessor. This class is responsible for running the main loop that consumes the RingBuffer events and sends them to an event handler.

It has a few notable traits:

  • It is a long-lived object which is generally instantiated on application startup, when setting up the Disruptor. Therefore it is not an issue to introduce an optimization that makes the creation of the BatchEventProcessor a bit slower.
  • It is instantiated by the Disruptor DSL in most use cases. For that reason, it is an acceptable option to create an entirely new version of BatchEventProcessor.
  • It is a public type, so it would be better to keep a compatible version of BatchEventProcessor to avoid breaking changes. However, it would not be a good design choice to have different event processor types that would have to be maintained in parallel.

Here is the pseudo code of the BatchEventProcessor main loop:

There are clearly too many interface calls here. Even the indexer used to read events from the RingBuffer is invoked through an interface!

The interface used to read the RingBuffer events is IDataProvider<T>:

The instance of _dataProvider is most likely a RingBuffer. Therefore, my first idea was to add a generic parameter to BatchEventProcessor in order to create specialized instances of the event processors for the RingBuffer.

Of course, the specialized version could only be generated for a value type, so I added a proxy struct:

In addition, I introduced a factory method to create the BatchEventProcessor:

This simple change had a significant impact on the performance tests:

Removing interface calls from BatchEventProcessor: the automated way

This specialization on IDataProvider<T> was a good start. The same technique could also be used to remove the interface calls to ISequenceBarrier. However, this technique had a few limitations:

  • It could not be used on IEventHandler<T>, because it is a user specified instance.
  • It made BatchEventProcessor faster for the main use cases, but it did not support custom IDataProvider<T> or custom ISequenceBarrier.

To solve both of these issues, I decided that I would generate the struct proxies using System.Reflection.Emit. The code is quite simple because the generated types are only delegating the calls to a private instance. You can find it here.

Then I added generic parameters to BatchEventProcessor for IDataProvider<T>, ISequenceBarrier, IEventHandler<T> and IBatchStartAware:

In addition, I updated the factory method to create specialized instances of BatchEventProcessor:

You can notice that there is a special case for IBatchStartAware: if the handler is implementing the interface, a proxy can be generated; otherwise, a NoopBatchStartAware struct is used. This way the null-conditional operator in _batchStartAware?.OnBatchStart can be removed. Because the code will be specialized for the specified IBatchStartAware type, the invocation of _batchStartAware.OnBatchStart will be inlined and thus removed for NoopBatchStartAware.

Even though I doubt if any .NET Disruptor clients directly reference or instantiate the BatchEventProcessor, the type is public, so the new generic parameters are a breaking change. I decided to keep a non-generic version of the BatchEventProcessor for compatibility:

Of course, you do not get any performance improvement with this version.

Here are the performance tests results for the final version:

Solution generalization and limitations

I used the same technique to remove interface calls from ProcessingSequenceBarrier. This is quite funny, because the ProcessingSequenceBarrier itself is used by the BatchEventProcessor, so I ended up with a BatchEventProcessor<Event, RingBuffer, ProcessingSequenceBarrier<Sequencer, WaitStrategy>, EventHandler, BatchStartAware>.

Yet the technique cannot be applied to the whole codebase. If you want to use your generated instances, either you need to add the same generic parameters to the types or methods that use it, or you have to reference your generated instance through an interface.

On the one hand, adding generic parameters to every type is not something that I want to do. I want users to still be able to manipulate a Disruptor<T> and not a Disruptor<T, TSequencer>, even though it might be a good idea performance-wise. On the other hand, if my generated instances are accessed through an interface, it will introduce new interface calls, which will defeat the purpose of generating specialized instances.

So this technique can only be used on long-lived instances that are accessed through a coarse-grained interface. If your instances are not already accessed through an interface, the performance gain must outweigh the extra interface calls introduced.

The next step

This struct generation technique yielded impressive performance improvements for most of the .NET Disruptor use cases. It was also the source of a nice latency reduction and stabilization. A few performance gains are still to come, but they will get harder and harder to find and to implement!

Many thanks to @Lucas_Trz, @romainverdier and @MendelMonteiro for the reviews.

--

--