I had the chance to discuss coroutines with some folks at work and some friends online, talk about their merits, use cases, pros and cons vs other programming and execution models and schemes, and I thought it ‘d be nice to write about it here, least I have to repeat all that again when this topic comes up.
The tl;dr is, if you care for performance(throughput, latency) and your thread(s) are executing tasks/jobs that may be long-running and/or block, and if you appreciate having a programming model that makes sense, then you need to use co-routines.
For the sake of this discussion, let’s assume you are building a simple key-value “nosql” data store. Here are some benefits you could get from employing coroutines instead of other execution models.
Context switching is expensive. Potentially, very expensive. You also want to reduce lock contention to absolute minimum, and you also want to minimize sharing across OS threads. To that end, you want to make optimal use of your threads; keep them busy, try to reduce if not eliminate moving data across them via queues and other means.
Let’s say you have 1 or more threads(typically and optimally, as many as your CPU cores — consider pinning each thread to a single core) and each of them accepts network connections, parse requests and execute them to completion, and eventually responds back to the client.
Let’s then assume that 100 requests/second on average are coming in, and as you parse each of them, you execute them immediately. If the hypothetical dataset managed is small in terms of size and it can fit all in RAM, then those requests will likely not have to block because they will hit the kernel FS/block cache (i.e the OS schedule will not need to put the thread to sleep until a filesystem operation can be completed so that the scheduler can resume the thread again).
Even in this ideal scenario, a request can require a lot of processing time — maybe it needs to sort and merge and post-process data, or otherwise take more than, say, 100ms to execute. So if on average a request takes 100ms, a single thread can execute 10 requests/second. Now, let’s explain what this means.
- You don’t have predictive service and response times. If any single request takes, 500ms, even if all other take 50ms, then all other ‘ready’ requests waiting for execution will stall and, in turn, your application will become slow. Again, it only takes one ‘struggler’ to destroy performance.
- If any single request executes an IO operation that needs to block, the OS scheduler will put your OS thread to sleep until the operation completes. This can take a few microseconds, but can also take many milliseconds, or maybe over a second. You can’t know in advance. In the mean time, all your requests, you guessed it, stall and, again, your application(s) that depend on their response stall and become slow.
Here is how we would like this to work instead:
- No matter the frequency of incoming requests and however long each may take to process to completion, we want to be fair to all of them. If a single request takes 1 second, or 10 seconds, that’s fine — the client likely expects it to be slower given the kind of work it has to do, but don’t penalize every other request. We want to give all requests a chance to proceed with their execution. That way, every request will complete in time proportional to the the kind of work it has to do and how busy the thread is, not based on the aggregate processing time of all other requests that were dequeued before it, until it was given a chance to run.
- We want to be able to know if an I/O op (read, write, etc) is going to block, and if so, do what we can to move that operation off this thread, and yield control to another ready request, until that expensive I/O operation has been executed on a different thread, at which time, the paused request will get a chance to resume execution. That is, reduce the likelihood of a blocking operation stalling all requests.
In our hypothetical datastore, a thread may need to run some system task (“background task”) from time to time. It may need to compact some files, run a GC, generate some reports. Those are usually very long running tasks, and because they are not “online”, we don’t care very much about their running time. That is to say, our priority is processing the clients requests, and when we get the chance, use the thread to process any background tasks.
In other words, we need to support priorities based execution. Higher priorities requests/tasks/jobs should be executed before any lower priority ones. We don’t mind if our compaction task, which could take 10 minutes to execute to completion uninterrupted, takes 25 instead, if we get to process our ready client requests and none of them stall. That’s a great compromise.
If you choose not to use coroutines but want to approximate a similar programming model, then you need to use FSMs or chain tasks together, where each task maintains its own state and you somehow, manually and explicitly, maintain ‘progress’ pointers. In the case of chained tasks, you will need a different object/value state for each possible ‘continuation’ too. Also, supporting yielding from anywhere except the top-level (arbitrary call-stack depth) won’t work, unless you want to spend a lot of time building a very complex trampoline system that won’t work for all cases anyway. If this all sounds like something you wouldn’t want to do, it’s because its something you wouldn’t want to do.
Contrast that with a typical coroutines model. You can yield from any call-stack depth. You don’t need to maintain state, or partition execution into different objects that then you can chain together(one executes the other on completion — chained continuations). You don’t need a very fancy scheduler either. All you need is to yield explicitly when you want(cooperative multitasking). In fact, you can extend the basic yield semantics to support await (yield until another coroutine you created completes and provides you back with a result you need) and other more sophisticated functions.
Obviously, there are some downsides to using coroutines. But, for most kind of applications/services you shouldn’t be concerned with them, and for this kind of hypothetical service, you have no real choice anyway, considering what was discussed earlier — if you care for throughput, latency and performance, not to mention readability and productivity.
There is always going to be some overhead. Saving and restoring registers is cheap, but it’s not free. What’s perhaps more important is that you will likely get an increased L-x cache miss rate, because you will be jumping from one coroutine to another which will likely be accessing different memory regions. But you can minimize that with clever and responsible yielding and scheduling policies.
All told, the overhead will be negligible, but it won’t be zero. I personally think its very much worth it, given the benefits outlined earlier.
I have had the chance in the past to implement coroutines abstractions based on FSMs and tasks chaining. See High Performance Services using Coroutines and CloudDS — another datastore. Neither programming model felt right and the overhead from chaining is actually very high, all things considering. I always wanted to use coroutines instead, and so I had the chance to write a very light-weight coroutines/fibers framework this week (an alternative implementation to the makecontext/getcontext/swapcontext POSIX APIs) for x86–64 in assembly, where only the absolute minimum registers are saved and restored — libc’s implementation incurs the cost of two system calls for saving and restoring signals mask state and also track floating point registers that you probably don’t want to track.
I then built a simpler scheduler on top of it, and, putting it all together, made such a huge difference in a ‘toy’ service I wrote to compare it with other services I worked on in the past, based on other programming models.
Coroutines use in the wild
The lovely folks at DataStax are considering a similar execution model for an upcoming Cassandra release, which is great news for their users, as I am confident it will great improve both throughput and latency.
Tarantool, a very fast “nosql” datastore uses Lua, non-blocking I/O and fibers. It uses Marc Lehmann’s libcoro. See “implementations” for links. They built fibers and a scheduler on top of the coros. You can see how they did that in their codebase.
RethinkDB uses coroutines in order to avoid OS thread blocking. Read how this works and how they improved their coroutines implementation performance.
Microsoft’s Orleans framework, which makes it easy to build high performance and scalable computing applications, is based on actors. The runtime runs 1 OS thread per CPU core, and many actors are managed by each thread.
Actors yield when they start an I/O, which gives other ready actors a chance to run — execution resumes when I/O completes. No blocking calls are allowed. That’s how it achieves high throughput and fair scheduling. Information provided by Sergey Bykov and Orlean’s documentation.
.NET’s TPL with its async/await runtime support makes implementing coroutines very easy.
Naughty Dogs’s PS4 engine used for The Last of Us Remastered edition (and likely, all other newer games) is based on fibers and cooperative multitasking. Watch the presentation (“Parallelizing the Naughty Dog Engine Using Fibers”) by Christian Gyrling.
ScyllaDB is a NoSQL column store that implements Apache Cassandra’s semantics and interfaces. It is extremely fast — according to their claims. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. All operations are expressed as chained continuations, based on future and promises, and the scheduler of each OS thread can suspend and resume processing, yielding very high performance. Read more about it here.
Resumable Functions in C++17
If you are building services, or care for concurrent and parallel processing, consider coroutines. If you a C++ programmer, C++17 will introduce to http://blogs.msdn.com/b/vcblog/archive/2014/11/12/resumable-functions-in-c.aspx which will make so much easier to implement coroutines in C++.
Read the draft, where it describes the semantics, differences between stackless and staful coroutines, the conceptual model, the core resumable function object abstraction, and more.
I truly believe this will be a game-changer. Here are the design goals ( from the the linked draft):
- Highly scalable to billions of concurrent coroutines
- Highly efficient resume and suspend operations comparable in cost to a function call overhead
- Seamless interaction with existing facilities with no overhead
- Open ended coroutine machinery allowing library designers to develop coroutine libraries exposing various high-level semantics, such as generators, goroutines, tasks and more
- Usable in environments where exceptions are forbidden or not available
Well, it turns out, C++ heads adopted “coroutines” instead of “resumable functions” (I think that’s a good idea). You should watch James McNellis’s “An Introduction to C++ Coroutines” ( https://www.youtube.com/watch?v=YYtzQ355_Co ) for more information. Like they did with the design and implementation of C++ lambdas, it seems they are going to get coroutines right as well.
The most elegant and complete open source coroutines implementation I have come across (just 2 files, offering a very simple 3 functions long API) is Marc Lehmann’s libcoro. If you don’t want to roll your own and/or need to support many different OS flavors and platforms, you should definitely use this great library.
Facebook’s folly library includes support for fibers, based on boost context ( which is used for the lower level context switching).
Facebook’s HHVM supports asynchronous operations based on Folly’s coroutines implementation directly as part of the language and runtime.
See also my implementation on GitHub. It supports priorities scheduling and it is tuned for efficiency.
Futures, Promises and Continuations
Futures and promises ( and chained continuations based on them ) is an alternative to use of coroutines for concurrency and parallelism. It is a very simple but very expressive and efficient programming pattern, where value providers(promises) return futures, that will eventually provide a value back to the consumers(callees). The value may or may not be immediately available.
The callee invokes a get() or getValue() or other such similarly named method on the future to obtain the value. If the value is not readily available, then the thread will block until it is made available. Alternatively, a context switch to another ‘ready’ execution context(coroutine), replacing the blocked on the future value context, can be used. The Seastar Project futures implementation will do just that. If the value is not available, it will switch to a new coroutine. Other implementations may just block, or do other interesting/useful things.
What really makes futures shine - other the very simple but powerful provider/consumer model - is how they can be used for continuations. Most future implementations provide a `then` method, which accepts usually a lambda. That lambda is executed once the value has been made available, and also return a future. In turn, you can add another continuation using then on that returned future, and so on, so forth. See Futures for C++ at Facebook for how this works in practice, the documentation of futures and promises for the seastar project as well as CPPReference’s pages on futures and promises.
You can use STL’s std::promise and std::function already, and there are lots of STL classes and functions that use it, but it doesn’t yet provide a then method for continuations(it is coming). However, you can and should consider using Facebook’s or Seastar’s implementations.