Go vs C#, part 1: goroutines vs async-await
I am going to write a series of posts comparing some features of Go and C#. The core feature of Go — goroutines — is actually a very good point to start from. C#’s alternative for this is Task Parallel Library (TPL) and async-await support.
The implementations of these features are quite different:
- Async-await in C# is implemented as a method body transform provided by the compiler similarly to what C# does for IEnumerable<T> / IEnumerator<T> methods. The compiler generates a method returning state machine (an instance of compiler-generated type) that is responsible for evaluation of asynchronous computation.
- Goroutines are, in fact, regular functions in Go. All the magic associated with them happens when you start them with “go” syntax: Go starts them concurrently in a lightweight thread — in fact, a thread that uses a very small stack (that can grow though) and is capable of asynchronously awaiting on channel read operation by suspending itself and releasing the OS thread to another lightweight thread.
- There is no “await” concept in Go: instead, goroutines are supposed to use channels to communicate. Later I’ll explain why you largely don’t need it there.
- There are lots of other differences — I’ll mention some of them later. But overall, async-await in C# is something that’s built on top of the existing platform, i.e. it doesn’t require any changes in .NET CLR. And on contrary, goroutines are deeply integrated into Go runtime.
In this post I’ll focus on a relatively simple test:
- Create N goroutines, each one awaits a number on its input channel, adds 1 to it, and sends it to the output.
- Goroutines and channels are chained together, so that the message sent to the first channel eventually makes it to the last one.
Output for Go:
Output for C#:
Before we start discussing the results, some notes on the test itself:
- This test “pre-crafted” for Go — in C# you normally never need channels for async tasks to communicate. Tasks there typically call each other and asynchronously await for a result. Nevertheless, the only option Go has for goroutine communication is channels, so I’ve decided to design a test that uses them.
- C# doesn’t have an official implementation for channels yet. I’ve been using an implementation that’s going to be official soon: System.Threading.Tasks.Channels. Currently it’s available via NuGet, the package version is 0.1 at this moment.
- To make the comparison more fair, in addition to channel-based test in C# I’ve implemented an extra one relying only on async tasks. In this implementation each tasks awaits for its “input” task, adds 1 to its output and returns the result.
- C# code has a “warmup” logic running the same test for 1 message before the actual run for 1M messages, though Go code doesn’t. The reason is: .NET emits method code on invocation, i.e. the very first run of any “small” function takes much longer. Warmup logic ensures we don’t capture JIT compilation time.
Comparison of raw results:
- First run of this test takes almost exactly the same time both on Go and on C#
- Second run is much faster on Go: the speedup factor is ~ 4.3x. There is no second run in C# code, but there is nothing in C# that could make it faster.
- Task-based version is ~ 2.05x faster on C#, but it’s still ~ 2x slower than the second run on Go.
So why the second run on Go is so much faster? The explanation is simple: when you start a goroutine, Go needs to allocate a 8KB stack for it. These stacks are reused, i.e. Go doesn’t need to allocate these stacks on the second run. The proof:
Go allocates almost ~ 9GB for 1M goroutines and channels. Assuming each goroutine consimes at least 8KB for its stack, 8GB are necessary just for these stacks.
If we increase the number of messages passed on this test to 2M, it already fails on my machine. 3M messages, and it wills fail even if no any other apps (except some background ones) are running.
So the difference is more or less clear. Let’s think on why C# is generally slower on these test:
- System.Threading.Tasks.Channels is in preview state, i.e. its performance is probably far from perfect at this point. E.g. it’s clear that awaiting on a channel is ~ 2x more expensive than awaiting on a task.
- If we get rid of channels, task-based version is still 2x slower. Though note that there is an extra “await Task.Yield()” — I had to add it because when it absents, .NET tries to execute the continuation immediately on task return w/o returning to the main loop of the current background thread, and as a result, quickly exhausts the call stack and dies with StackOverflowException. In real-life it’s never a problem — you aren’t supposed to have long recursive chains in async code; nevertheless, it probably slows down this code by 1.5x or so.
- Even though tasks in C# are relatively lightweight objects, they are still allocated on heap. The state machine itself is a reference type as well. Heap allocations are relatively fast in all modern languages with GC — but still, they’re probably 5–10x slower than a similar stack expansion + call.
Now, let’s modify a test a bit, and decrease the number of passed messages to 20K — a number that’s much closer to the maximum we expected have in real life (20K open sockets on servers, etc.):
As you can see, C# gets closer to Go here:
- It beats Go during the first pass
- Channels-based test in C# is 2.7x slower
- Task-based test in C# is ~ 1.5x slower
And finally, the same test on 5K messages:
We see here that task-based test in C# outperforms the test on Go, though channel-based test on C# is still ~ 2x slower than the second pass on Go.
Why C# benefits from a smaller number of tasks?
- 5K test on Go uses ~ 5MB RAM, which is still less than L3 cache size for Core i7, but much more than L2 cache size; on another hand, it’s not quite clear why performance isn’t as good as it should be on the second pass — CPU anyway caches only the accessed subset of data.
- C# version, being prob. 10x more memory efficient, uses ~ 500KB of RAM on this test, which is much closer to L2 cache size for Core i7 (256KB per core).
Goroutines vs async-await: conclusions
Let’s highlight the most important differences:
- Goroutines are clearly faster. In real-life scenarios you can expect something like 2x … 3x per any await. On the other hand, both implementations are quite efficient: you can expect something like 1M “awaits” in C# per second, and maybe 2–3M in Go, which is actually a fairly large number. E.g. if you handle network messages, this probably translates to 100K messages per second on Core i7 in C#, i.e. way more on a real server. I.e. this isn’t expected to be a bottleneck anyway.
- Real-life performance of C# async-await must be close to goroutines — C# is more memory-savvy, and performance of most of production apps mostly depend on how large is their working set.
- “8KB stack per goroutine” also means there is a higher chance of getting OOM in Go in certain scenarios — e.g. if your server processes any message by starting a goroutine, but all the processors simply stuck awaiting for some external (or internal) service which is busy now. If the request rate is very high, you literally need seconds to get OOM based on tests above. All you need is to get 2–3M messages — and that’s on 32GB machine.
- C# does way more for asynchronous calls by default — that’s another reason why it’s slower. In particular, it passes ExecutionContext and SynchronizationContext through async-await call chain (i.e. there are multiple dictionary lookups for corresponding thread local variables per each call).
- C# model is more explicit / robust (though arguable whether it’s good or not — read further): all async code is decorated with async-await; besides that, there are lots of built-in primitives — in particular, a few schedulers (e.g. passing async calls back to the UI thread in UWP apps), support for cancellation, synchronization, etc. A good example of this is the channels library I used: it’s relatively easy to add support for channels in C#, but something similar to async-await in Go requires way more boilerplate code.
- C# model is more extendable: in fact, you can change almost anything there by adding your own schedulers, awaiters, and even your own task types. So if you really care about the performance, you can write much more lightweight tasks, or tasks pre-tuned for certain scenarios (a good example of this is ValueTask<T>, which is now a part of .NET). Another upcoming feature is support for async sequences (async streams)— which is also based on the same set of APIs (although it requires changes to the C# compiler).
- Goroutines are easier to learn. It seems there is almost nothing special you need to know to start using them: “go” keyword + channel syntax is all you need to know. On contrary, async/await in C# is definitely not enough to learn about async programming there. You need to know about Task / Task<T>, Task.Run and cancellation as a bare minimum. Real-life async programming implies you know about scheduling, .ConfigureAwait(false), how task builders work, how exceptions are handled, when to use ValueTask<T>, etc. — i.e. that’s a lot more that in Go.
- Goroutines don’t suffer from “async all the way” problem. Async-await implies that if you make a chain of function calls (A calls B, B calls C, … Y calls Z), and both A and Z are async functions, B … Y also have to be async functions, otherwise the model won’t work (non-async Y can’t await for Z, non-async X can’t await for Y, etc. — which means they either have to “start and forget” corresponding async functions, or wait for them synchronously, or become async as well). On contrary, there is no such constraint in Go: you can read from a channel in any function, and no matter what, it’s always an asynchronous operation. That’s actually a big advantage, since you don’t have to plan on what’s going to be asynchronous ahead of time. In particular, you can write a query(…) method invoking some query provider to get the result, and this provider can do this either synchronously or asynchronously dependently on the implementation — but you, as an author of query(…) method, don’t have to think about this while you write it.
- Consequently, there is way less overhead associated with async code in Go: “async all the way” means any potentially asynchronous API must be asynchronous in .NET, i.e. you’re expected to have more async tasks created there, more heap allocations, etc.
- This also explains why there is no need for async-await in Go: since any function supports asynchronous wait (on a channel) and can be started concurrently (with “go” syntax, i.e. as a goroutine), any function returning a regular result can run some asynchronous logic inside — all it needs is to start another goroutine, pass it a channel, and await for the result on this channel. That’s why most of APIs in Go look as synchronous, though in reality they are asynchronous. And frankly speaking, this is pretty amazing.
Overall, the implementation differences are quite significant, as well as the implications.
There is a good chance I’ll write more robust / real-life test for async-await-goroutines some day and discuss it in another post. But since this is definitely an interesting topic, I feel I have to at least reference someone else’s benchmark here. Unfortunately, there aren’t many of these — here is the best micro-benchmark close to real life scenarios I found so far:
Two year ago I've developed a data ingestion system and I'm planning to migrate it from Windows to Linux and Docker.stefanprodan.com
That’s a simple web server benchmark with a middleware deserializing JSON and shooting an HTTP request to an external service. The description is right there; the final result for .NET Core is in comments (check out the whole thread to understand why his original benchmark for .NET was incorrect): https://stefanprodan.com/2016/aspnetcore-vs-golang-data-ingestion-benchmark/#comment-3158140604:
- Go handles ~ 9K requests / second (concurrency level = 100, mean time per request = 15ms). There are multiple results for Go, so it’s actually unclear which one to take — I picked the best one I saw.
- .NET Core handles ~ 8.1K requests / second (concurrency level = 50, mean time per request = 6ms)
- I suspect this test is done on .NET Core 1.0 (based on the date of the post), and .NET Core 1.1 is noticeably faster. The author promised to update the results when .NET Core 2.0 is released.
- As you may find, the results are a bit weird: .NET Core reports shorter response time, though request rate there is lower, + there is a difference in concurrency level. So maybe you can run this test at home and share your findings :)
That’s all for today. Note that I am totally not an expert in Go — all Go code shown here is probably 50% of all Go code I wrote so far. Thus if you’re from Go camp, you’re definitely welcome to comment this post — I’ll be happy to extend or edit it based on your feedback.