Led Zeppelin & Circular Buffers in F#

Still Dazed & Confused by circular buffers? Let’s Ramble On with some more aging rockers, and see if we can find a Stairway To Heaven

Chris White
The Startup
8 min readMay 2, 2020

--

“And her voice is sore from shouting
Cheering winners who are losing
And she worries if their days are few and soon they’ll have to go
- Celebration Day

In our previous article, aging leather-skinned crooner Mick Jagger helped us to understand how efficient data structures might help us in our search for the optimal circular buffer. Now we turn our incantations to the high priests of rock, Led Zeppelin, to see if they can give our project a Whole Lotta Love.

1. Good Times, Bad Times

Before going too much further, let’s do some proper benchmarking. We’ll be using the Benchmarkdotnet framework to do so. All the code is in the Github repo here.

First, we recap the 3 buffers we created in the previous article:

  1. Circular Buffer 1: A simple buffer using an array of fixed size which is equal to the size of buffer we require.
  2. Circular Buffer 2: A more complex buffer, using an array of twice the size of the buffer we require (allowing us always to return a contiguous slice).
  3. Circular Buffer 3: A complex buffer as for (2), but wrapped in a struct.

We’ll then measure the time it takes to create each buffer, the time it takes to push data to each buffer, and the time it takes to retrieve data from each buffer.

Time in nanoseconds to create a single buffer of length N
Time in nanoseconds to push a single value to a buffer N times in sequence
Time in nanoseconds to pull a slice of ticks N long from a buffer

The results are interesting and slightly disturbing. As expected, it takes around twice as long to create our complex buffers (buffers 2 & 3); they are twice as long as the simple buffer (buffer 1). As for pushing data, there is virtually no difference between them. For pulling data, however, it is clear that the complex data structure is far more efficient, both in terms of execution time as well as allocation of heap memory.

However, pushing and pulling data from our complex buffer (buffer 2) versus our struct orientated complex buffer (buffer 3) shows very little difference either. Aren’t structs supposed to improve our speed? Has there been a Communication Breakdown?

2. Gallows Pole

The answer, of course, lies in our float array. A struct containing value types will live on the stack, and be very fast to access. However an array, even an array of value types in a struct, will live on the heap and consequently be much slower. That is what is happening here; we are getting no uplift from our struct.

Cryin’ won’t help you, cryin’ won’t do you no good

Is there a solution? Yes, and we’ll look at that below. Before that we’re going to take a look at two new types introduced in 2018 in C#7.2/F#4.5/.Net Core 2.1 — the Span<T> type and the Memory<T> type — and see if they might help us.

3. Hammer of the Gods

Span<T> and Memory<T> are, for our purposes, high speed array accessors and slicers. They allow low level, type safe access to areas of contiguous memory. Where you have an array (i.e. an array of floats called myArray) and access it via indexers such as myArray.[10] or myArray.[10..15], we can increase access speed by creating a span (i.e. mySpan<float>(myArray) on myArray) and then accessing its contents via mySpan.[10] or mySpan.Slice(10, 5).ToArray() instead.

But like Bonzo on the drums, that speed and naked energy comes with the potential to cause some horrendous problems. This is low level programming which can horrifically backfire unless we are careful.

Span<T> and Memory<T> also come with certain restrictions, detailed here. The one that concerns us most here is the restriction on Span<T> that, as a stack-only type, we cannot run any sort of closure over it. Ideally, in our new Circular Buffer 4 (which uses Span<T> to access array contents) we would have something like this:

But doing so generates the compiler error “A type would store a byref typed value. This is not permitted by Common IL.”. As mentioned above we cannot do this, so we will need to create a new Span<T> each time we need to use it, thus:

How much overhead will this create? We’ll see below.

Memory<T> is much more forgiving. It is not stack-only. It plays better with asynchronous code. Against that, however, it is reported slower but we will be benchmarking it properly below. We can create our Memory<T> up front, which should be a win, although Memory<T> lacks an inbuilt slicer and forces us to create a Span<T> on the Memory<T> type, and do the slicing from there…

We’re going to create regular versions of the complex buffer using Span<T> and Memory<T>, and we’re going to create struct-wrapped versions as well.

“Sometimes all of our thoughts are misgiven” as Robert Plant reminds us. But in this case, we might also be thinking about whether that repeated Span<T> creation is really needed? Couldn’t we get a buffer to return a Span<T> over its array, which we could then pass back in every time we wanted to push or pull a value? Something like this:

Worth a shot. We’ll create another buffer like this, as well as a struct-wrapped version for good measure. To recap our now many buffers:

  1. Circular Buffer 1: A simple buffer using an array of fixed size which is equal to the size of buffer we require.
  2. Circular Buffer 2: A more complex buffer, using an array of twice the size of the buffer we require (allowing us always to return a contiguous slice).
  3. Circular Buffer 3: A simple buffer as for (1), but wrapped in a struct.
  4. Circular Buffer 4: As for (2), but using Span<T> internally.
  5. Circular Buffer 5: As for (2) but using Memory<T> internally.
  6. Circular Buffer 6: As for (4) but struct wrapped.
  7. Circular Buffer 7: As for (5) but struct wrapped.
  8. Circular Buffer 8: As for (2), but offering an externally accessible Span<T>.
  9. Circular Buffer 9: As for (8) but struct wrapped.

Time for some benchmarking.

Measuring a summer’s day,
I only find it slips away to grey.
The hours they bring me pain.
- Tangerine

“You need coolin’, baby, I’m not foolin’”. Was Robert Plant singing perhaps about a laptop after a big benchmarking session? Quite likely.

We have some winners here. As observed above, creation is really just a function of array length. Circular Buffer 1 wins out here because we are only creating half the array length of our more complex structures. Every other buffer is pretty equal in this regards as we would expect.

Pushing data is interesting. Span<T>, as we expected, causes a small increase in execution time as we have to instantiate a new one with each push, which is not offset by any access performance uplift. Not much, however.

Those big spikes in the push chart belong to buffers 5 and 7, our Memory<T> buffers. For all their possible advantages, they are not going to work out for us. Babe, I’m gonna leave you…

The big wins come in the data pull benchmarks. We’re pulling from the middle of a 1000 tick buffer here, which has been cycled through once. For pulls of 10 ticks, there is very little difference. Above that, however, access time halve for all of our new buffers versus our first three attempts. Our externally accessible Span<T> versions aren’t doing much for us, suggesting that our optimization has run its course.

4. Cache-mir

Do we stop here? Mr Plant exhorts us to keep going: Squeeze me baby, ‘till the juice runs down my leg. Fairly sure he’s telling us to squeeze all performance gains from this.

We noted above that the struct wrapped buffers don’t actually place their internal float arrays onto the stack. We like to think of the stack as fast, so can we force this somehow?

Microsoft.FSharp.NativeInterop gives us some unsafe solutions in the form of NativePtr.stackalloc. Like its C# equivalent, stackalloc, this definitively places the array onto the stack. In order to access it, we have to use the same Span<T> paradigm we used above:

So, how does this new buffer “stack” up?

The Song Remains(largely) the Same. Pushing and Pulling are largely a wash, maybe slightly worse in some cases. Creation is much faster, however, as we would expect from something created on the stack. Except, however, that the million and 10 million length array versions ran into stack overflow issues and crashed.

Clearly, this doesn’t make much sense for us, except in the case where we anticipate creating hundreds of short-lived buffers — which is a very unlikely thing to ever want to do.

5. Celebration Day

We have driven our ships to new lands. From our original Circular Buffer 1, where we were typically pulling data at a rate of 1.5ns per tick, we have reduced that to 1.0ns per tick through smarter data structure architecture, and further to 0.5ns per tick through efficient memory access structures.

Buffers 4, 6, 8 & 9 are almost identical in terms of execution speed, so which one makes the final cut? For most purposes, Circular Buffer 4 (complex buffer using Span<T>) makes the grade.

A final caveat. We’ve had a singular focus on execution speed in these articles, which has come at the expense of memory management and memory safety. If we’re diligent, we can manage this, but don’t be like that famous lady who was sure that all that glitters is gold…

--

--