Async IO in Rust (part III)

13 min readJan 3, 2016

This is the third article in series of articles about the design of Rotor, a library doing asynchronous IO in Rust. It has been about three months since my first article was published. And my first asynchronous code written in rust is almost a year old. That code still works in production (in a non-mission-critical role). This is to say that it’s time to stop rewriting rotor on every occasion and start writing real production code with it. And here is how.

Rotor Core Library

The intention to create “rotor” was desire to build a middleware which allows to compose multiple things to a single event loop. This is valuable because almost every application out there uses multiple protocols for communicating with outside world. And you don’t want to implement a single protocol twice.

Rotor 1.0 will only contain minimal things that are needed for composing libraries and nothing more

If you have followed rotor before, you know it contained an abstraction for streams, which had some opinionated view on how buffering looks like. Many of these things are now in `rotor-stream` which is just a utility library. You may ignore it. You may use multiple versions of it inside a single main loop and so on. So the backwards-compatibility concerns are weaker for it.

So What is in Core Library?

Here is a minimum that rotor crate contains:

A base state machine that maps `mio::Handler` actions almost directly to state machine actions
A way to compose two machines into single one with the same interface (we have two ways: a macro and a generic, see below)
A `mio::Handler` that holds a collection (specifically a Slab) of the state machines
All necessary boilerplate for communication between state machines (a Future/Port pair, see below)

That’s basically it. The rotor doesn’t even know anything about the sockets. You create sockets directly (via mio). You are ought to store them in a state machine, but its not enforced in any way. You can structure your code whatever you like.

Let’s reiterate basics a little bit:

To have multiple state machines in a single event loop, we must differentiate between them. So we use a Slab which stores state machines, and we differentiate them by the mio token.
Multiple state machine implementations are combined using an enumeration which has an option for each state machine implementation. On every action, the value is matched and the respective state machine is dispatched. The alternative would be to put trait objects into Slab, which is possible too (but presumably less performant so is not a default).
You have a timeout event that is dispatched to the correct machine, but you can’t carry a value in a timeout. Carrying a value would be a headache when doing composition. Anyway, the usual practice is to recheck the time anyway, so it’s not a problem.
Similarly with messages, you get a wake-up event and it’s your responsibility to check the appropriate Future object. But we have the object that guarantees that message comes to the right receiver.
We have a way to pass global context to all state machines. Where every state machine picks only needed traits on a context. I have written about that before.
At this level of abstraction, we must ensure that every error can be handled and no error can make our application panic. Even resource exhaustion (like epoll can’t add another socket). Except, of course, out of memory which you can’t handle in Rust.

This is all that needed for 80% of use cases. Few things are going to be added too, but let’s overview how rotor evolved, before discussing the todo list.

Rotor Changes

Here are the current decisions. If you think that some of them are bad, please report them to issue tracker. I’m going to stabilize rotor core as quick as possible, so we can build applications without fear that they will break tomorrow.

First let’s look at `rotor::Machine` trait:

The methods “ready”, “timeout” and “wakeup” should be obvious from the text above.

The triple “spawned”, “spawn_error” and “create” methods are in charge for creating new state machines and are needed for accepting sockets. Since there is an `Accept` abstraction in `rotor-stream`, most users don’t need to deal with the methods directly. Details will be covered in the documentation. The important thing here is that we are trying to cover every possible failure mode here. This is important for future-proof implementation, even if `rotor_stream::Accept` doesn’t handle all errors well.

Every event handler receives a self by value and a mutable object `Scope`. And every event handler returns some complex return value, that contains a state machine and may be some other data. More on return values later.

Scope Type

If you have read my previous articles, you might have noticed that Scope has been dropped in favor of more rich return value. After writing more code, it became clear that it’s impossible to handle everything through the return value. The crucial case is creating a “Future” object, which allows to send the message to this state machine from another one.

Technical details: to send a message to a state machine we send a notification to the main loop (that is a functionality of mio). The notification contains ‘mio::Token’ that identifies the state machine. The Scope internally contains token, but it’s just an implementation detail of current mio event loop and rotor core. We my change it in future.

Currently, Scope object is a thing that:

Is passed between all function calls of a state machines as an argument
Contains a mutable reference to the Context (see the first article for details about context).
Has methods to register a socket and/or a timeout within the event loop (and event is always delivered to this state machine)
Allows to create a Future/Port pair which allows to send messages between state machines (`Scope::create_future()`)

To give you some feeling of the code here is how rotor_stream::Stream is initialized:

It registers a socket in an edge-triggered mode to skip re-registering on changing internal state. But since the same Scope is received in every action handler, it could do register in level-triggered mode and modify it on every occasion.

And yes, unlike in first versions of the rotor, the Scope is now a single type (generic over a Context). Mapping scope to a different object at each level of abstraction was a nightmare.

Async Result

In the previous article, there was also a generic `Async` object, which was going to represent a result object of an asynchronous operation. It turns out that it’s either useless or just it’s hard to figure what it should look like. The core rotor has a similar Response object, which has three constructors:

The `ok` constructor is similar to `Result::Ok` and returns a state machine to the main loop. Perhaps, the `done` constructor is similar to `Option::None` which means this state machine done its work and should be removed. And finally, the `spawn` constructor means “this state machine is in state A and please create another one from B”.

The ‘rotor::Response’ type is not an enum but has private internal structure only for future compatibility. Probably some Async type may be invented in future, or another state added. Although, most likely it will be frozen and become a public enum.

The `rotor::Machine::Seed` type (N in Response) is a type that may be used to create a state machine. While you could send a state machine to the loop (as in previous versions of the rotor), it would get there uninitialized. So instead of creating another state in every machine we have a separate “initializer” or “seed” type. The machine is created with method `create(seed, scope)`. So when it is put into a slab it’s already initialized and registered via main loop (using scope).

It seems other protocols need some different return value. Here are some types used in rotor ecosystem so far (enums are simplified a bit):

The protocol of stream is a bit verbose. On each action you want to return the state machine itself, the next condition you expect to be called on and maximum time that condition might be waited for. This may look complex, but it helps to have a super simple protocol parsers. Like this:

The HTTP protocol outlined above shows that return values are sometimes different in different actions of the same protocol. When you receive request headers, you can already provision whether you are going to buffer the whole request to the memory or read it by chunks. Also, you can put a timeout on the whole request processing. Then you just handle each small operation. But in case timeout happens, and you didn’t provision that request will be so large, you can adjust the timeout by returning a new deadline value from the timeout handler.

In both cases returning `None` means this state machine has nothing more to do. And in both cases, return value of Option is enough.

Mutable Arguments

Early in the development of rotor, we opted into treating state machines as immutable things passed by value. And turned out to be a great decision. Still you might notice mutable things on occasion.

The most used one is a `Scope`, it allows to “mutate” an event loop by registering sockets and timeouts. Also, Scope contains a mutable reference to a Context, which is our way to mutate a global state if your application has one.

The rotor-stream has a `Transport` abstraction. Which allows you to inspect input and output buffers. You can write output and consume input in any event handler. While in theory a user can mess up the buffers, most invariants are handled by rotor_stream anyway. For example, if you left something in the output buffer it will eventually be flushed to the network. If you require some bytes of the input and they are in the input buffer your handler will be dispatched immediately. If you don’t consume bytes and require them again your handler will be called continuously consuming 100% CPU, so you will notice it quickly and a fix will be easy.

On the other hand, mutable view on the buffers simplifies the return value and allows you to do many interesting optimizations. For example, you write to the contiguous buffer instead of returning data in small chunks (each being allocated on the heap). Another example, HTTP keeps small chunks in buffer without consuming to be able to push larger chunks to the upper abstraction layers without copying.

Another example is the `Response` object in the rotor-http. It allows you to write response progressively, by setting response status first, then adding headers, finally writing some body, possible in few chunks too:

Because each event handler gets a mutable response object, you are free to call specific methods at any time. The response object encapsulates a mutable view of an output buffer, and everything is written to the buffer directly instead of being kept in inefficient heap structures like a hash map of headers or similar. Additionally, Response object has an internal state machine that prevents errors of trying to write things out of order, and even guards user of hard errors like sending Content-Length twice.

It turns out that combining state machines passed by value and carefully designed mutable views is what makes asynchronous programming simple and efficient.

Composition

The composition was the key point in the development of rotor.
There are two types of composition:

Horizontal, where you put together multiple implementations of the same protocol
Vertical, where you stack protocols on each other

The vertical layers do not need any kind of boilerplate for the composition, they stack on each other by design.

The first kind of composition is only implemented for top-level state machines (`rotor::Machine`). There are two options, either `rotor::Compose2` (where 2 is the number of machines combined) or `rotor_compose!()` macro. Macro gives you better names for each child state machine for example:

This creates an enum with two options, each embedding a state machine of some type (and also enum for composing Seed’s of the machines). In the case of using `Compose2` type, you get just `A` and `B` options.

Another interesting case that will require more investigation is collecting multiple children state machines at some middle layer. I.e. for HTTP2 (and perhaps any other multiplexed protocol) you need to keep multiple request state machines for a single connection. At a glance, you just need a vector of state machines inside connection handler.

The Todo List

It looks like we have a smallest possible core of rotor. But few things are still missing to make it 1.0:

1. Global timeouts and messages
2. A way to quit the main loop
3. Interchangeable main loops

The distinguishing feature of a global action (either timeout or notification message) is that it can inspect multiple state machines. E.g. it can scan a pool of client connections and shut down inactive ones or do some similar maintenance tasks. Currently, every state machine can only access its own state and a Context (which doesn’t include Slab of other state machines, it’s impossible because of rust borrow checker).

Hopefully, #1 and #2 will not change already existing APIs. But #3 may change a lot. The very tempting idea is to run the same code in both the epoll-based event loop and in userspace TCP stack without losing performance.

There are a lot more work in rotor-stream, but since you can possibly combine multiple versions of rotor-stream in an application it’s not so crucial. The most interesting thing that should be implemented in rotor-stream is `sendfile` optimization and using memory-mapped files for buffers.

The Benchmarks

Note, that the micro-benchmarks below are totally unscientific. But people like the benchmarks, so I post them here not to show that rotor-http is the fastest, but to show there isn’t something totally wrong going on. All tests are run on a desktop-grade i7–4790K (4.00GHz), Ubuntu trusty, kernel 3.16.0. Code compiled with Rust 1.5.

Hyper, current master (a4230eb), with default number of threads (10):

> wrk -c 400 -d 60 -t2 http://localhost:1337
Running 1m test @ http://localhost:1337
 2 threads and 400 connections
 Thread Stats Avg Stdev Max +/- Stdev
 Latency 29.08us 63.19us 26.92ms 99.74%
 Req/Sec 150.89k 102.84k 275.90k 47.92%
 18017291 requests in 1.00m, 1.51GB read
Requests/sec: 300267.47
Transfer/sec: 25.77MB

Hyper with 50 threads:

> time ./wrk -c 400 -d 60 -t2 http://localhost:1337
Running 1m test @ http://localhost:1337
 2 threads and 400 connections
 Thread Stats Avg Stdev Max +/- Stdev
 Latency 98.77us 340.63us 51.77ms 99.13%
 Req/Sec 203.51k 19.43k 246.10k 76.33%
 24300108 requests in 1.00m, 2.04GB read
Requests/sec: 404919.33
Transfer/sec: 34.75MB

Rotor, single threaded (hello_world_server example from rotor-http):

> wrk -c 400 -d 60 -t2 http://localhost:3000
Running 1m test @ http://localhost:3000
 2 threads and 400 connections
 Thread Stats Avg Stdev Max +/- Stdev
 Latency 2.20ms 606.03us 198.18ms 90.77%
 Req/Sec 91.00k 12.71k 188.25k 88.16%
 10784514 requests in 1.00m, 524.53MB read
Requests/sec: 179722.03
Transfer/sec: 8.74MB

Rotor, 2 threads (“threaded” example from rotor-http, 6a55b79):

> wrk -c 400 -d 60 -t2 http://localhost:3000
Running 1m test @ http://localhost:3000
 2 threads and 400 connections
 Thread Stats Avg Stdev Max +/- Stdev
 Latency 1.02ms 524.95us 202.31ms 91.49%
 Req/Sec 196.18k 24.94k 264.13k 64.11%
 23443997 requests in 1.00m, 1.11GB read
Requests/sec: 390084.90
Transfer/sec: 18.97MB

Rotor, 10 threads:

> time ./wrk -c 400 -d 60 -t2 http://localhost:3000
Running 1m test @ http://localhost:3000
 2 threads and 400 connections
 Thread Stats Avg Stdev Max +/- Stdev
 Latency 0.85ms 2.33ms 205.30ms 95.56%
 Req/Sec 245.43k 30.84k 384.37k 80.10%
 29085176 requests in 1.00m, 1.38GB read
Requests/sec: 484700.17
Transfer/sec: 23.57MB

Here is the nginx 1.4.6 on the same machine, with 2 workers:

> time ./wrk -c 400 -d 60 -t2 http://localhost:3000
Running 1m test @ http://localhost:3000
 2 threads and 400 connections
 Thread Stats Avg Stdev Max +/- Stdev
 Latency 4.99ms 14.02ms 852.80ms 89.76%
 Req/Sec 191.65k 24.34k 265.37k 86.25%
 22882230 requests in 1.00m, 5.20GB read
Requests/sec: 381356.17
Transfer/sec: 88.72MB

Nginx with 10 workers:

> time ./wrk -c 400 -d 60 -t2 http://localhost:3000
Running 1m test @ http://localhost:3000
 2 threads and 400 connections
 Thread Stats Avg Stdev Max +/- Stdev
 Latency 570.06us 1.63ms 200.91ms 98.12%
 Req/Sec 246.49k 46.93k 321.76k 58.58%
 29424576 requests in 1.00m, 6.69GB read
Requests/sec: 490386.32
Transfer/sec: 114.09MB

Note, that nginx serves empty_gif instead of just hello world and sends more headers. That’s why it serves much more bytes.

Here is similar golang example, go 1.5. With GOMAXPROCS=1:

> time ./wrk -c 400 -d 60 -t2 http://localhost:3000
Running 1m test @ http://localhost:3000
 2 threads and 400 connections
 Thread Stats Avg Stdev Max +/- Stdev
 Latency 3.98ms 1.21ms 202.91ms 77.23%
 Req/Sec 50.37k 2.02k 65.77k 92.83%
 6014951 requests in 1.00m, 734.25MB read
Requests/sec: 100242.23
Transfer/sec: 12.24MB

Go with GOMAXPROCS=2:

> time ./wrk -c 400 -d 60 -t2 http://localhost:3000
Running 1m test @ http://localhost:3000
 2 threads and 400 connections
 Thread Stats Avg Stdev Max +/- Stdev
 Latency 2.49ms 1.14ms 33.77ms 80.48%
 Req/Sec 82.23k 12.61k 96.63k 77.92%
 9815339 requests in 1.00m, 1.17GB read
Requests/sec: 163500.80
Transfer/sec: 19.96MB

Go with GOMAXPROCS=10:

> time ./wrk -c 400 -d 60 -t2 http://localhost:3000
Running 1m test @ http://localhost:3000
 2 threads and 400 connections
 Thread Stats Avg Stdev Max +/- Stdev
 Latency 2.01ms 2.44ms 204.44ms 89.92%
 Req/Sec 122.10k 8.51k 214.50k 74.56%
 14589585 requests in 1.00m, 1.74GB read
Requests/sec: 242809.56
Transfer/sec: 29.64MB

And I was just curious enough to test the fasthttp implementation for golang, which is advertised as potentially 10x speedup comparing for “net/http”. Let’s see.. GOMAXPROCS=1:

> time ./wrk -c 400 -d 60 -t2 http://localhost:3000
Running 1m test @ http://localhost:3000
 2 threads and 400 connections
 Thread Stats Avg Stdev Max +/- Stdev
 Latency 1.92ms 1.65ms 201.72ms 99.78%
 Req/Sec 104.89k 7.03k 126.57k 89.33%
 12526574 requests in 1.00m, 1.73GB read
Requests/sec: 208770.50
Transfer/sec: 29.47MB

Go, fasthttp with GOMAXPROCS=2:

> time ./wrk -c 400 -d 60 -t2 http://localhost:3000
Running 1m test @ http://localhost:3000
 2 threads and 400 connections
 Thread Stats Avg Stdev Max +/- Stdev
 Latency 1.13ms 353.32us 5.93ms 76.20%
 Req/Sec 176.42k 31.72k 228.52k 60.58%
 21063512 requests in 1.00m, 2.90GB read
Requests/sec: 351034.14
Transfer/sec: 49.55MB

With GOMAXPROCS=10:

> time ./wrk -c 400 -d 60 -t2 http://localhost:3000
Running 1m test @ http://localhost:3000
 2 threads and 400 connections
 Thread Stats Avg Stdev Max +/- Stdev
 Latency 537.22us 719.53us 44.44ms 96.91%
 Req/Sec 235.20k 36.37k 333.38k 72.50%
 28082763 requests in 1.00m, 3.87GB read
Requests/sec: 467981.18
Transfer/sec: 66.05MB

Note: in all tests which have more than 2 threads of the target application, wrk is also on the same machine, so it competes for the resources (there are only 4 cores/8 hyperthreads):

Sorry, no graphs here, because the tests aren’t too good. But the conclusion is that asynchronous implementation looks like faster than golang’s default http implementation. I’m not sure which optimizations allow fasthttp to be faster on single thread (perhaps buffer pooling and avoiding header parsing for hello world example), but even on two threads optimizations don’t help. And nginx is slightly faster (even in non-fair test), but I don’t think it’s significant.

Comparing rotor-http to hyper is hard. While it looks like rotor shows some performance improvement, the test is too sensitive to the number of threads used, and that number of threads is probably only useful for the specific microbenchmark.

Note that I have not profiled rotor yet, so may miss some important performance regressions. Also, more interesting benchmarks are about memory and CPU usage on a large number of keep-alive connections or partially received responses. But I did not test those complex scenarios yet.

Wrapping Up

At the time of writing “rotor” has not been touched for about two weeks. While the most of the work on rotor-stream and rotor-http was going during this time. This probably means that basics are pretty okay, and that you can try sufficiently different approaches for building on top of the rotor.

Additionally, there is an almost full implementation of HTTP protocol and we have verified that there are neither any performance issues nor any important obstacles for building applications. There are carefully chosen trade-offs, though, that you can avoid if you don’t build on top of rotor-stream.

Now it’s time to write more documentation and tests for the implementation. As well as make few more protocols, in particular, the DNS and the HTTP client. Perhaps, they offer different challenges comparing to server implementation.

Update: reddit link (/r/golang), hackernews