Earlier in the same series:
Interestingly, the draft for this post was written a couple months ago, and it was relatively short. The gist of it was: “Go’s GC is clearly inferior to what .NET has, see the following posts: 1, 2, 3, 4 (note that some of them are fairly recent) for details.”
But… I couldn’t stop myself to somehow test this, so I’ve asked one of my friends — a Go expert — to help me with the benchmark. And we wrote GCBurn — a relatively simple garbage collection & memory allocation benchmark that currently supports Go and C#, though you are free to port it to any other language with GC.
Now, let’s get into the woods :)
What is Garbage Collection?
The post is pretty long, so please skip this section if you know what GC is.
Garbage Collection (GC, Garbage Collector) is a part of runtime responsible for reclaiming the memory used by “dead” objects. That’s how it works:
- “Alive” object is any object in the heap that is either used right now (~ a pointer to it is stored in one of CPU registers), or could be potentially used in future (~ there could be a program eventually acquiring a pointer to such an object). If you look at your heap as a graph of objects referencing each other, it’s easy to notice that if some object O is alive, then every object it references directly (O1, O2, … O_m) is alive as well: having a pointer to O, you can acquire a pointer to O1, O2, …O_m using a single CPU instruction. Same can be said about objects referenced by O1, O2, …O_m — each of these objects is alive as well. In other words, if object Q is reachable from some alive object A, Q is alive as well.
- “Dead” objects are all other objects in the heap. They are “dead” because there is no way for the code to somehow acquire a pointer to any of them in future. There is no way to find them, and thus no way to use them.
- A good real-world analogy of this is: you want to find out which cities (objects) are reachable by road network assuming you start your trip from any city with the airport (GC root).
This definition also explains the fundamental part of any garbage collection algorithm: it has to check what’s reachable (alive) from time to time, and delete everything else. That’s how it typically happens:
- Freeze every thread
- Mark all GC roots (objects referenced from CPU registers, locals / call stack frames, or static fields — i.e. everything that’s either used right now or is immediately available) as alive
- Mark every object that’s reachable from GC roots as alive as well; consider everything else dead.
- Make the memory allocated by dead objects available again — e.g. you can just mark it as “available for future allocations”, or compact the heap by moving all alive objects so that there are no gaps
- Finally, unfreeze all the threads.
What’s described here is typically called “Mark and Sweep” GC, and it’s the most straightforward implementation possible — but not the most efficient one. In particular, it implies we have to pause everything to perform GC, that’s why collectors having such pauses are also called Stop-the-World, or STW collectors — contrary to pauseless collectors.
Pauseless aren’t much different from STW collectors in terms of how they solve the problem — they just do nearly everything concurrently with your code. And obviously, this is tricky — if we return back to our real-world analogy with cities and roads, it’s like trying to map the cities reachable from the airports assuming that:
- You actually don’t have a map —but there is a fleet of vehicles you operate
- While these vehicles are driving, new cities and roads are constructed, and some roads are destroyed.
All of this makes the problem more complex — in particular, you can’t build the road as you used to in this case: you have to check if the fleet is currently running (i.e. GC is looking for alive objects), and whether it has passed through a city your newly constructed road starts from (i.e. GC already marked that city as alive). If this is true, you have to notify the fleet (GC) to get back there and find all the cities reachable via the new road.
Translated to our non-fictional case, it means that while GC is running, any pointer write operation requires a special check now (write barrier) — and it’s still going to slow down your code.
There is no black-and-white difference between pauseless and STW GCs — it’s just about the duration of STW pauses:
- How STW pause duration depends on different factors? E.g. is it ~ fixed, or proportional to the size of alive object set (O(alive_set_size))?
- If these pauses are fixed, what’s the actual duration? If it’s tiny good enough for your specific case, it’s ~ the same as totally pauseless GC.
- If these pauses aren’t fixed, can we make sure that they are never longer than the maximum we can afford?
Finally, note that different GC implementations may optimize for different factors:
- Overall slowdown caused by GC (or overall program throughput) — i.e. roughly, % of time spent in GC +all related expenses (e.g. write barrier checks in the example above).
- The distribution of STW pause duration — obviously, the shorter, the better (sorry, no puns here) + ideally you don’t want to have O(aliveSetSize) pauses.
- Overall % of memory spent solely on GC, or due to its specific implementation. The extra memory might be used by either an allocator or GC directly, be unavailable because of heap fragmentation, etc.
- Memory allocation throughput — GC is typically tightly coupled with memory allocator. In particular, allocator may trigger a pause in the current thread if GC is leaping behind, may do a part of GC job, or may use more expensive data structures to enable certain kind of GC.
- Etc. — many more factors are listed here.
The worst part is: you clearly can’t get all of that at once, i.e. different GC implementations have their own benefits and trade-offs. Which is why it’s hard to write a good GC benchmark :)
What is GCBurn?
GCBurn is a benchmark we crafted to measure the most important GC performance metrics directly — i.e. by actually measuring them rather than querying performance counters or APIs provided by the runtime.
Direct measurements provide a few benefits:
- Portability: it’s fairly easy to port our benchmark to nearly any runtime — it doesn’t matter at all whether it has the APIs allowing to query what we measure somehow or not. And you’re definitely welcome to do this — e.g. I am really curious to see how Java compares with Go and .NET Core.
- Less questionable validity: It’s also easier to validate that the data you get is actually valid: getting the same numbers from runtime would always raise questions like “how can you be sure that you’re getting correct numbers?”, and good answers to these questions imply you’ll probably spend way more time studying a specific GC implementation & way the metrics are collected there than writing a similar test.
Secondly, GCBurn is designed to compare apples to apples — i.e. all of its tests do almost exactly the same sequence of actions/allocations— relying on identical distributions, identical random number generators, etc., etc. This is why its results for different languages and frameworks are expected to be directly comparable.
There are two tests GCBurn performs:
Peak Allocation Throughput (“Speed Test”)
The intent here is to measure peak burst allocation rate — the rate of memory allocation assuming nothing else (in particular, GC) slows it down.
- Start T threads / goroutines, where each thread:
- Allocates 16-byte objects (objects having two int64 fields) as quickly as possible
- Do this in a loop for 1ms, keep track of total # of allocations
- Wait for all threads to complete and the allocation rate (objects per second).
- Repeat this ~ 30 times and print the max. measured allocation rate.
This is a more complex test measuring:
- Peak sustained allocation throughput (objects / second, bytes / second) — i.e. a throughput measured over a relatively long period of time, assuming we allocate, hold, and eventually release each object, and that sizes and hold durations for these objects are following distributions that are close to real life ones.
- Distribution of frequency and duration of thread pauses caused by GC — 50% percentile (p50), p95, p99, p99.9, p99.99 + min, max, and mean values.
- Distribution of frequency and duration of STW (global) pauses caused by GC
That’s how the test works:
- Allocate “static set” of desirable size (I’ll explain this further)
- Start T threads / goroutines, where each thread:
- Allocates objects (actually, arrays/slices of int64-s) following pre-generated size & lifetime distribution pattern. The pattern is, in fact, a list of tuples of 3 values: (size, ~floor(log10(duration)), str(duration) — ‘0’). Last two values encode “hold duration” — its exponent and the first digit in its decimal representation in microseconds. It’s an optimization allowing “release” operation be quite efficient & have O(1) time complexity per every allocated object — we trade a bit of precision in exchange for speed here.
- Once every 16 allocations, try to release the allocated objects whose hold duration have already expired.
- For every loop iteration, measure the time current iteration took. If it took more than 10 microseconds (typically the iteration should take less than 0.1 microsecond), assume there was a GC pause, so log its start and end time into a list for this thread.
- Keep track of the # of allocations and the total size of allocated objects. Stop after D seconds.
- Wait for all the threads to complete.
When the above part is completed, the following is known for each thread:
- The number of allocations it was able to perform, as well as their total size in bytes
- The list of pauses (pause intervals) it experienced.
Having these lists for each thread, it’s possible to compute the list of intervals of STW pauses (i.e. periods when every thread was paused)— by simply intersecting all these lists.
Knowing all of that, it’s easy to produce the statistics mentioned above.
Now, some important details:
- I’ve mentioned that the allocation sequence (pattern) is pre-generated. This is done mostly because we don’t want to spend CPU cycles on generating a set of random numbers for every allocation. The generated sequence is composed of~ 1M items of ~ (size, log(duration)). See BurnTester.TryInitialize (C# / Go) to see the actual implementation.
- Each GCBurn thread uses the same sequence but starts from a random point there. When it reaches the end, it continues from the beginning of the sequence.
- To make sure the pattern is absolutely identical for every language, we use custom random number generator (see StdRandom.cs / std_random.go) — in reality, it’s C++ 11’s minstd_rand implementation ported to C# and Go.
- And overall, we make sure that all the random values we use are identical across platforms — thread start points in the allocation sequence, sizes and durations in this sequence, etc.
- 99% of “typical” objects + 0.99% of “large” ones + 0.01% of “extra large”, where:
- “typical” size is following the normal distribution with mean = 32 bytes and stdDev= 64 bytes
- “large” size is following the log-normal distribution with underlying normal distribution’s mean = log(2 Kb) = 11 and stdDev = 1.
- “extra large” size is following the log-normal distribution with underlying normal distribution’s mean = log(64 Kb) = 16 and stdDev = 1.
- The size is truncated to fit into [32B .. 128KB] range, and later translated to array/slice size taking into account the reference size (8B) and array header size (24B) for C#, and slice size (24B) for Go.
- Similarly, it’s composed of 95% “method-level” + 4.9% of “request-level” + 0.1% of “long-living” hold durations, where:
- “method-level” hold duration is following absolute value of a normally distributed variable with mean = 0 microseconds, stdDev = 0.1 microsecond
- “request-level” hold duration is following similar distribution, but with stdDev = 100ms (milliseconds)
- “long-living” hold duration is following the normal distribution with mean = stdDev = 10 seconds.
- Hold duration is truncated to fit into [0 … 1000 seconds] range.
And finally, the static set is a set of objects following exactly the same size distribution that’s never released during the test — in other words, it’s our alive set. If your app caches or stores a lot of data in RAM (or has some memory leak), it’s going to be large. Similarly, it’s supposed to be small for simple stateless apps (e.g. simple web / API servers).
If you read till this point, you’re probably eager to see the results — and here they are:
Test-all runs the following tests:
- Peak Allocation Throughput test (“Speed Test”) — using 1, 25%, 50%, 75%, and 100% of max. # of threads the system can actually run in parallel. So e.g. for Core i7–8700K “100% threads” = 12 threads (6 cores * 2 threads per core w/ hyperthreading).
- GCBurn test — for static set size = 0MB, 1MB, 10%, 25%, 50%, and 75% of total amount of RAM on tested machine, and using 100% of max. # of threads. Each test runs for 2 minutes.
- GCBurn test — all the same settings as in the previous case, but using 75% of max. # of threads the system can actually run in parallel.
- Finally, it runs all these tests in 3 modes for .NET — Server GC +SustainedLowLatency (the mode you’ll likely use on your production servers), Server GC + Batch, and Workstation GC. We do the same for Go — the only relevant option there is GOGC, but we haven’t noticed any difference after settings it to 50%: it seems that Go anyway runs GC ~ continuously on this test.
So let’s start. You can also open Google Spreadsheet with all the data I used for charts here, as well as “results” folder on GitHub with raw test output (much more numbers are there).
Just a reminder, 1 operation on this test = an allocation of a 16-byte object.
.NET clearly beats Go on burst allocations:
- Not only it’s 3x (Ubuntu) … 5x (Windows) faster on a single threaded test, but it also scales better with the number of threads — widening the gap to 12x on 96-core monster, m5.24xlarge.
- Heap allocations are extremely cheap on .NET. If you look at the numbers, they are actually just ~ 3–4x more expensive there than stack allocations: you can make ~ 1B simple calls per second per thread — vs almost 0.3B allocations.
- It looks like .NET Core is a bit faster on Windows, and on contrary, Go is almost 2x slower in comparison to Linux.
An operation on this test is a single allocation following pseudo-real-life distribution described earlier; moreover, every object allocated on this test has a lifetime following another pseudo-real-life distribution.
.NET is still faster on this test, though the gap isn’t that big here — it’s 20 … 50% dependently on OS (smaller on Linux, larger on Windows) and static set size.
You may also notice that Go couldn’t pass “Static set size = 50% RAM / 75% RAM” tests — it failed with OOM (Out of Memory). Running the test on 75% of available CPU cores helped Go to pass “Static set = 50% RAM” test, but it still couldn’t make it on 75%:
Increasing the duration of the test (the findings presented here are for 2 min. duration) was causing Go to fail way more reliably on “Static set = 50% RAM” test —so seemingly, GC there simply can’t keep up with the allocation pace, if the alive set is large enough.
Besides that, there are no any significant changes between 100% and 75% CPU core test passes.
It’s also obvious that both allocators don’t scale well with the number of cores. This chart proves the point:
As you see, ~ 5x more cores translate to only 2.5x speedup here.
It doesn’t seem like memory bandwidth is the bottleneck: ~ 70 M ops/sec. translate to ~ 6.5 GB/sec., which is nearly 10% of available memory bandwidth on Core i7 machine.
Also interestingly, Go starts to beat .NET on “Static set = 50% RAM” case. Are you curious why?
Yes, that’s an absolutely shameful part for .NET:
- Go’s pauses are barely visible here — because they are tiny. The max. pause I was able to measure was ~ 1.3 seconds — for the comparison, .NET got 125 sec. STW pause on the same test case.
- Nearly all STW pauses in Go are really sub-millisecond ones. If you look more real-life test case (see e.g. this file), you’ll notice that 16GB static set on a ~ regular 16-core server implies your longest pause = 50ms (vs 5s for .NET), and 99.99% of pauses are shorter than 7ms (92ms for .NET)!
- STW pause time seems to scale linearly with the static set size both for .NET and for Go. But if we compare larger pauses, they are ~ 100x shorter for Go.
- Conclusion: Go’s incremental GC does work; .NET’s concurrent GC doesn’t.
Ok, probably I ’ve scared all .NET developers really well at that point — especially assuming that I’ve mentioned GCBurn is designed to be close to real life. So are you expected to get similar pauses on .NET? Yes and no:
- ~ 185GB (it’s ~ 2 billions of objects, and GC pause time is actually dependent on this number rather than working set size in GB) static set is way beyond what you might expect in real life. Likely, even 16GB static set is way beyond what you might see in any well-designed app.
- “Well-designed” actually means: “based on findings presented here, no developer in their right mind will be crafting a .NET app relying on multi-gigabyte static set”. There are many ways to overcome this limitation, but ultimately, all of them are forcing you to store your data either in huge managed arrays, or in unmanaged buffers. .NET Core 2.1 — more specifically, its ref structs, Span<T>, and Memory<T> simplify these efforts a lot.
- Besides that, “well-designed” also means “without memory leaks”. As you might notice, if you’re leaking references in .NET, you are expected to have longer and longer STW pauses. Obviously, your app will eventually crash — but note that before it happens, it might also become temporarily unresponsive — all because of growing STW pauses. And the more extra RAM you have, the worse it’s going to be.
- Tracking max. GC pause time & post-Gen2 GC working set size must be critical to making sure you don’t suffer from increasing p95-p99 because of memory leaks.
Being .NET developer at heart, I really wish .NET Core team addresses the max_STW_pause_time = O(static_set_size) issue sooner. Unless it’s addressed, .NET developers will have to rely on workarounds, which actually isn’t good at all. And finally, even the fact it exists will act as a showstopper for many potential .NET apps — think of IoT, robotics, and other control applications; high-frequency trading, games or game servers, etc.
As for Go, it’s amazing how well this issue is addressed there. And it worth to note that Go team was fighting with STW pauses very consistently starting from 2014 — and eventually, they’ve managed to kill all O(alive_set_size) pauses (as the team claims — seemingly the test doesn’t prove it, but maybe it’s just because GCBurn goes too far to expose this :) ). Anyway, if you are interested in details of how it happened there, this post is a good one to start from: https://blog.golang.org/ismmkeynote
I am still asking myself what out of these two options I would prefer — i.e. a .NET’s faster allocator or Go’s tiny GC pauses. And frankly, I’m leaning toward Go here — mostly, because the performance of its allocator still doesn’t look bad, but 100x shorter pauses on large heaps are quite attractive. As for OOMs on larger heaps (or need to have 2x more memory to avoid OOM)— well, memory is cheap. Though this might be more important if you run multiple Go apps on the same machine (think desktop apps & microservices).
In short, this situation with STW pauses made me envy what Go developers have — probably, for the first time.
Ok, there is one last topic to cover — namely, other GC modes on .NET (spoiler: they can’t save the day, but it still worth talking about):
Server GC in concurrent mode (SustainedLowLatency or Interactive) provides the highest peak throughput, though the difference with Batch is minimal.
Sustained throughput is also highest in Server GC + SLL mode. Server GC + Batch is very close as well, but Workstation GC simply doesn’t scale with the growing static set size.
And finally, pauses:
I had to add this table (from mentioned Google Spreadsheet — see the last sheet there) to show that:
- Workstation GC actually has smaller STW pauses only when static set size< 16 GB; going beyond 16 GB makes it less and less attractive from this perspective — and almost 3x less attractive on 48 GB case in comparison to Server GC + Batch mode.
- Interestingly, Server GC + Batch starts to beat Server GC + SLL on static set size ≥ 16 GB — i.e. batch mode GC actually has smaller pauses than concurrent GC on large heaps.
- And finally, Server GC + SLL and Server GC + Batch are actually quite comparable in terms of pause time. I.e. concurrent GC on .NET clearly doesn’t do much concurrently — even though it could actually be quite efficient on our specific case. We create the static set before the main test, so seemingly there is no need to relocate much — nearly all the job GC has to do there is to mark alive objects, and that’s exactly what concurrent GC is supposed to do concurrently. So why it produces nearly the same shamefully long pause as batch GC is a complete mystery.
- You might ask why we didn’t test Server GC + Interactive mode — actually, we did, but didn’t notice a significant difference with Server GC + SLL.
- Has O(alive_object_count) STW pause time on Gen2 collections — no matter what GC mode you choose. Obviously, these pauses can be arbitrary long — all depends on your alive set size. We’ve measured 125 sec. pause on ~ 200 GB heap.
- Much faster (3 … 12x) on allocation bursts — such allocations are really similar to stack allocations on .NET.
- Typically 20 … 50% faster on sustained throughput tests. “Static set size = 200 GB” was the only case when Go went ahead.
- You should never use Workstation GC on .NET Core servers — or at least you should precisely know that your working set is tiny enough for it to be beneficial.
- Server GC in concurrent mode (SustainedLowLatency or Interactive) seems to be a good default — even though it doesn’t differ much with Batch mode, which is actually quite surprising.
- Doesn’t have O(alive_object_count) STW pauses — more precisely, it seems it actually has O(alive_object_count) pauses, but they are still 100x shorter than for .NET.
- Nearly any pause there is shorter than 1ms; the longest one we saw was 1.3 sec. — on a huge~ 200 GB alive set.
- It’s slower than .NET on GCBurn tests. Windows + i7–8700K is where we’ve measured the biggest difference — i.e. seemingly, Go has some issues with its memory allocator on Windows.
- Go couldn’t handle “static set = 75% RAM” case — never. This test on Go was always triggering OOM. Similarly, it was reliably failing on “static set = 50% RAM” case, if you run this test long enough (2 min. = ~ 50% chance of failure, 10 min. — I remember just one case when it didn’t crash). Seemingly, GC simply can’t keep up with the allocation pace there, and things like “use only 75% of CPU cores for allocations” don’t help. Not sure if this might be important in real-life though: allocation is all GCBurn does, and most of the apps don’t do just this. On the other hand, sustained concurrent allocation throughput is typically lower than non-concurrent peak throughput, so a real-life app producing similar allocation load on a multi-core machine doesn’t look fictional.
- But even taking all of that into account, it’s actually quite arguable what’s better to do in case your GC can’t keep up with the allocation pace: to pause the app for a few minutes or to fail with OOM. I bet most developers would actually prefer the second option.
- Peak allocation speed scales ~ linearly with the core count on burst allocation tests.
- On the other hand, sustained concurrent allocation throughput is typically lower than non-concurrent peak throughput — i.e. sustained concurrent throughput doesn’t scale well, both for Go and for .NET. Seemingly, it’s not because of memory bandwidth: the aggregate allocation speed can be 10x lower than the available bandwidth.
Disclaimer & Epilogue
- GCBurn is designed to measure a few very specific metrics. We tried to make it close to real life in some aspects, but obviously, it doesn’t mean that the numbers it outputs are what you should expect in your real app. As any performance test, it’s designed to measure the extremum of what it’s supposed to measure — and to ignore nearly everything else. So please don’t expect more than that from it :)
- I understand the methodology is arguable — frankly, it’s hard to find anything that’s not arguable here. So leaving minor issues alone, if you have some ideas on why it might be terribly wrong to evaluate GC as we did, please leave your comments. I’ll definitely be happy to discuss this.
- I am sure there are ways to improve the test w/o significantly increasing the amount of work or code. If you know how to do this, please leave the comments as well, or simply contribute.
- Similarly, please do the same if you find some bugs there.
- I intentionally didn’t focus on GC implementation details (generations, compactions, etc.). These details are obviously important, but there are lots of posts on that, as well as on modern garbage collection in general. And unfortunately, there are almost no posts on actual GC and allocation performance. That’s the gap I wanted to close.
- If you are willing to translate the test to some other language (e.g. Java) and write a similar post, it would be simply amazing.
As for my “Go vs C#” series, the next post is going to be about the runtime and type system. And since I don’t see any need to write a few thousand LOC test for this, it shouldn’t take that long — so stay tuned!
P.S. If you like the post, please don’t forget to “clap” for it :)