High performance services using coroutines
I have long been concerned with figuring out the optimal (because there is no perfect way to do this) strategy to accepting requests and scheduling them in multiple OS threads, so that the latency is reduced to absolute minimum(that is, as close as it would take to execute a single request running sequentially on a dedicated to it, single thread, which wouldn’t block).
Before going any further, please read Jeff Darcy’s “Server Design” blog post. He has some great advice; he lists what he calls the ‘Four Horsemen of Poor Performance’:
- Data copies
- Context Switches
- Memory allocation
- Lock contention
The real problem is (2) and (4), because (1) and (3) are straight-forward to address; use reference counters(you may need auto-release pools for delaying free/delete, RCU or hazard pointers to avoid some edge cases with that), and pre-allocate(or allocate in bulk) and re-use for (3).
It’s all about minimizing context switches and lock contention. I have tried all kind of different strategies/designs+implementations in the past, and studied even more, and I am still not convinced any of them got it right. Or, that any of my implementations was even close to optimal, either. But it’s a goal worth pursuing, and this this here post is about my latest ideas and experiments.
One of our major infrastructure services is CloudDS, our distributed high-performance datastore. Each CloudDS process spawns a few threads, most of them being worker threads. There is a single network I/O thread, which is for accepting connections and multiplexing I/O.
Each incoming request is scheduled by a scheduler based on golang’s scheduler (heavily modified, etc). I have tried all kinds of different scheduling algorithms and this worked the best, although I am still working on a better alternative.
So, when a request arrives, a new task is created for that request and scheduled by the scheduler. It’s picked up by an idle worker thread, and, depending on the request type, it may spawn more tasks, one or more of them depending on other tasks (that is, once all dependent tasks complete, the task that depends on them is scheduled), and so on, so forth. This allows for arbitrary complex dependencies chains, and for great parallelism (for many requests, multiple tasks are created, each operating on a sub-set of the data required to fulfill the request). This is nice, and keeps all worker threads busy at all times, but comes at a heavy cost.
Scheduling is not free. For our scheduler, scheduling is lock-free on the fast-path, but it’s still expensive, and contention is an issue. There are per-thread queues (work-stealing queues) and so consumption(from worker-threads) is mostly contention-free. But, you guessed, it’s still expensive. By expensive, I mean it’s almost ‘free’ compared to other similar scheduling alternatives, but overhead is high enough to impact response times, sometimes even higher than the actual processing of the request itself. This is further compounded by the fact that responses to the client or requests/responses to other peers have to be routed via that single network I/O thread. Moving away from a single network I/O thread to many is trivial, and planned, but it still won’t be optimal.
No, the solution I propose and will be implemented is what I call ‘optimistic sequential execution’, based on my implementation of stackless co-routines. It works like this:
There are multiple threads (e.g as many as logical CPU cores), and each does its own network I/O and processing of the request, as long as the request _never_ blocks. Because it really comes down to that. Avoiding stalls.
If you just execute all incoming requests and you are dealing with disk I/O, then a read that can’t be satisfied from the kernel cache will block the thread until the data is paged in. Any delay, especially when blocked by the kernel, will impact all current requests processing and will delay processing network I/O on that thread. Imagine this happening for many requests/second. Crazy-high latency is just around the corner.
Most CloudDS requests, depending on their type and scope/breadth of data they access, are in sub-milisecond range(this includes coordinating with peers, because CloudDS is distributed and usually relies on quorum consistency semantics).
This is not high, but if you consider that over 40% of that time (on average) is due to context switches and scheduling overhead, then you can see why this is such a major issue for us. Not because we are greedy or we are realistically need responses that take less than say, 0.4ms, instead of 08. - 0.9ms, but because the reason they take that much longer is the ‘wrong’ kind of reason.
Here is a simplified description of how this would work.
Each thread accepts connections and manages network I/O from all of them. Each new request is encapsulated into a new coroutine (from a coroutines free-list, no memory allocation necessary).
The coroutine is scheduled in the thread-local coros scheduler. If while processing a request(coro), it needs to read() and therefore potentially block, it will instead determine if pages are resident (VMAs) using mincore(). This is very cheap. It takes about 100micros for a 32MBytes span, 50us for 16MBytes, and so on. If all that’s required to be read is in the kernel cache, all is well, request will proceed executing the read and use that data. If not, the coroutine will yield, and at the same time, be scheduled(migrated) to another set of threads responsible for the ‘slow’ requests(ones that need to block). Those will just schedule the coro in their own scheduler, and perform the read again and continue.
You can’t realistically avoid blocking, so you want to keep an accepted request/coroutine in its own thread, and provide all corourines a fair chance at completion(by yielding to the scheduler often enough), and move all slow ones to a set of threads dedicated to serving them. That’s the core idea of optimistic sequential execution.
By the way, it takes about 8250 microseconds to read 16MBs if data is resident in the kernel cache. If it’s not, it can take(depending on the medium, current i/o utilization, disk controller, etc, etc) 65,000us (from what I have found benchmarking this operation under various conditions and environments). Accessing that data via an mmap()ed region takes slightly less (12kus and 55kus on average, respectively). Imagine having a thread processing 1000 reqs/second and half of them requiring cold-data access. That’s the very definition of slow.
Alternatively, we could just yield the coro, waiting for a new special coroutine that would run in a background thread just executing readahead() and then re-scheduling(making runnable) the original coro, waiting for readahead. But that is slightly more expensive. Also, coros may often yield to other runnable coros, giving a chance to other coros to complete sooner. If you have a coro that requires a lot of processing to complete, yielding often so that other coroutines get a fair chance to complete sooner relative to processing effort requires is great. That’s also important in order to avoid latency spikes.
It would have been great if AIO on Linux wasn’t broken, but it is. The new readv2() syscall will help a lot, but it still hasn’t been accepted in the mainline kernel.
It turns out, there is no need to dispatch the final client response from the thread that accepted the connection; you can modify the epoll state and write() to the connection from another thread (see AeroSpike implementation). You need to be careful though about how you are going to deal with EAGAIN/EINTR errors before you can send the response. Its not a big deal, but needs to be considered.
This gives us almost perfect throughput and threads utilization. Coroutines are very cheap (a few bytes overhead) and very fast to schedule (also, thanks to free lists, there is no memory allocation required to create them). Because you can yield from them, or yield waiting for another coro to complete (as well as other niceties), it reduces the need for state objects; multiple structs/classes instances that may depend on other such instances, which is tricky and error prone as a programming model.
I built a prototype/service this week, a simple HTTP server that encapsulates new requests into coroutines and uses mincore() and readahead() on background threads to avoid blocking the network I/O+worker threads and I got some better than expected results. It’s also quite beautiful, decomposing requests and processing steps into coroutines, yielding and waiting for other coroutines seems like a natural way to solve this problem. It is almost as fast as our special-purpose httpsrv HTTP server,(about 1% slower due to coroutines scheduling and processing overhead, which is negligible), but without the potential for latency spikes from reading cold data.
More info and code coming soon.