Rust concurrency: the archetype of a message-passing bug.

11 min readJun 17, 2020

Whenever I have a chance, I extol the virtues of message-passing and event-loops in structuring concurrent workflows in Rust. However, in my wide-eyed enthusiasm, I will make it sound almost as if it cannot go wrong.

So today, let’s go into some details about the archetype of buggy message-passing code.

I think this bug applies to pretty-much all message-passing code(the exception might be using synchronous bounded channels).
Even though I think native threads and Crossbeam channels are easier to work with than their equivalent from the async/futures ecosystem(“what do you mean theimpl Futuretrait bound is not met? It’s the channel from the !^@#% futures crate!” 15 mins later: “ah, ok…”), I think the power of message-passing, and event-loops, is really in how they logically abstract over the underlying concurrency mechanism.
What you need is “something representing a channel/source or events”, and “something representing a thread of execution”, that’s it.
And this bug is essentially about “how to break those”.

This story is based on actual events. In certain cases incidents, characters and timelines have been changed for dramatic purposes. Certain characters may be composites, or entirely fictitious.

The pledge

Ok so first of all, we’re using Rust. So that means we can be totally Fearless™ in our approach to concurrency, like:

By leveraging ownership and type checking, many concurrency errors are compile-time errors in Rust rather than runtime errors.(source: https://doc.rust-lang.org/book/ch16-00-concurrency.html)

While this is true, in the sense that you can’t accidentally share some variable between threads, this doesn’t prevent you from writing buggy concurrent business logic.

So, the Rust compiler “has your back” when the it comes to a certain classes of mistakes, but cannot actually show you how to ensure your logic is something you can reason about, and that shows deterministic outcomes(up to a point).

So then, what can? Well, you can try using locks, although that will probably require pairing them with a condvar, or, if you can, go the “easier” route and use a combination of:

components running an “event-loop”,
channels, for sending and receiving messages(aka events).

The two are intrinsically linked: a component “running an event-loop”, is simply a “thread of execution” that happens to be driven by a loop where messages are received sequentially on one, or several (via a select), channels.

One example of a “thread of execution” that is not a native thread is Chromium’s sequence. I think that one really has a great name since it accurately conveys the concept.

The magic consists of having multiple of such “event-loops” communicate with each other, enabling you to build complicated(yet hopefully as simple as possible) concurrent workflows with deterministic characteristics.

What is this determinism based on? I think it comes down to two things:

Multiple sends from the same “thread of execution” are received in order.
While an “event-loop” is handling a given message, it only operates on data it owns uniquely.

Re 2, it should be noted that combining shared-state with event-loops is usually a recipe for subtle disaster. The reason for that is that while shared-state can be effectively “protected” by a lock, the lock itself doesn’t offer any kind of synchronization with the message-passing. So while it may appear to work, your logic is likely to be fundamentally broken by faulty assumptions about (a non-existent) synchronization between the messages, and the lock(s).
The fix is usually to, for a given workflow, either only use messages, or only use locks(and then pair them with a condvar for signalling).
Finally, when I write “for a given workflow”, this still allows for workflows to be nested to each other, where each “level” can be using different models. For example you could have a top-level workflow involving communicating event-loops and messages, and then one of these components could still internally own nested workflows, and those could then involve say some locks and condvars(example here).

And when does this determinism breakdown? Let’s take a look in the next section.

The turn

Let’s start with an actual real-world example, and then try to turn it into something more abstract.

So, as I was working on this PR in Servo, I was surprised by some intermittent failures of a test that showed up after a change(not the final change in the PR by the way, let’s say it showed up in between commits).

The background for this is the following:

There is a workflow crossing components running independently(previously described here).
The two components are named script, and net.
The goal is for script to stream chunks of data to net, and then shutdown the workflow once done, or if there is an error.
The workflow is “pull-based”, where it is net that requests the next chunk, and then script tries to read it from a source and forward it on.
On the side of script, it involves two threads: one for receiving IPC messages, known as the “IPC router thread”, and another one for running code in a runtime, where “reading data” from the source involves running code in this runtime. Let’s call it the “script thread”.
When reading data from the source, three things can happen: we get a chunk of data, we get a “done” flag, or we get an error.

So before the PR, the “IPC router thread” would simply receive messages, in a kind of “event-loop”, from the net component asking for a chunk of data. The “script thread” would then receive a message from the router thread, and then run some user-supplied code in order to “try to read a chunk”.

When the result of the operation became available, the “script thread” would then directly send the result to net , which would either be a chunk of data, or a “stream is done” signal.

Furthermore, in the “done” case, the “script thread” would, to trigger the “shutdown” of the workflow, first drop it’s sender to net, and then also send a message to the “IPC router thread”, telling it to drop it’s sender to net as well. The idea was that this would then disconnect the corresponding receiver in net, which would then result in dropping the sole sender used to request chunks of data, and this would then disconnect the receiver on the “IPC router thread”.

Wow, what a shutdown workflow.

So this was actually working fine(with some exception), however the problem arose when I added some code to make a distinction between the “done” and the “error” case.

Basically, both “done” and “error” should result in shutdown of the workflow, however the “error” case required first doing something to propagate the error elsewhere in net.

And this is where the “archetype message-passing bug” showed up.

My initial take on propagating the error would see the “script thread” follow the following sequence in the “error” case:

Propagate the error via a message to net.
Shutdown the workflow like before, via a message to the “IPC router thread”, which would then propagate the shutdown of this particular workflow to net.

And the bug manifested itself through a test, the one asserting that the error would propagate, intermittently failing.

This actually took a few println! to figure out, and soon the problem became clear: I was expecting two messages from two different threads to magically synchronize their ordering.

Remember what I wrote above:

Multiple sends from the same “thread of execution” are received in order.

Yep, it doesn't work when you send multiple messages, from multiple threads.

This is very clear when you read it like that, but it can creep-up in business logic undetected until a test starts failing intermittently.

So this is what the “script thread” and the “IPC router thread” were doing:

script sends the “error” message to net, expecting it to be handled and the error to propagate.
script sends the “done” message to the “IPC router thread”, expecting that thread to then send a message to shutdown the workflow to net.

Result: two messages, one “error” and one “shutdown” are sent to net, from two different threads, yet the logic expects “error” to be received, and handled, by net first.

Yes so if using synchronous bounded channels, you wouldn’t run into this problem. However that could also dramatically reduce the throughput of the various components. Topic for another post…
On another note, as was pointed out by matklad, the “shutdown last” part of the problem could be prevented by not sending an explicit “shutdown” message, instead relying on dropping senders to signal shutdown. In this case, regardless of how threads are scheduled, so long as the “script thread” is holding on to it’s sender, and/or has already sent the “error” message, then that message would be guaranteed to be handled at some point, before the “disconnect” signal would travel to the receiver.
However, this is about the “archetype” of message-passing bug, and there are a uncountable situations where you might need to perform operation 1 followed by 2, where 2 is not “shutdown”, and the bug might show-up. When 2 is “shutdown”, you could prevent this class of errors by relying on dropping senders for signalling.
In this particular case, I had to include an explicit shutdown message.

That was a sneaky one, and the reason is that the “script thread” would actually send the messages “in order”, however not to the same thread, and intermittently the “shutdown” message, sent by the “IPC router thread”, would be received before the “error” message sent by the “script thread”.

Even though the “IPC router thread” would only send the “shutdown” message AFTER having received the second message sent by the “script thread”, but “second” message here is meaningless because these are not messages to the same thread, hence no ordering can be expected.

Yep, I know, this is somewhat of a flippant way to write about some code. But almost all message-passing bug can be trace-back to something like that. What makes them sneaky is the complexity of the business logic, which can hide the various fictitious ordering assumptions that are built into it.

Well, glad there are tests to catch these things, however you have to be lucky to have these tests to begin with. If this can sneak into business logic, it also means it can not be covered by tests. Servo is lucky to be able to rely on the shared Web Platform Tests for enabling a broad coverage of various logic…

Next, let’s see how we can bring back some order to this chaos.

The prestige

So the ordering breaks-down, because two messages are sent to two different “thread of execution” running two different “event-loops”.

Actually the HTML spec has a nice concept abstracting over this, the parallel queue:

So this “parallel queue” is essentially the prototype of a “thread of execution” running an “event-loop”, defined in this case as continuously running algorithms steps that have been enqueued to it(similarly to sending a message on a channel, really).

This is useful because it makes it very clear that if you enqueue two sets of steps, to two different parallel queue, they will be run independently and therefore not synchronize their ordering in any way(this is like sending two messages on two channels to two different threads).

Also, if you have two different in-parallel algorithms both enqueuing steps to the same parallel queue, you also cannot know which set of steps will run “first”(this is like having two threads send two messages on the same channel to the same third thread).

You only get ordering guarantees when queuing steps to the same parallel queue, from the same algorithm.

Once again:

Multiple sends from the same “thread of execution” are received in order.

So in our case described earlier, we want to send two messages:

Error(if there is a one),
Stop/Shutdown(always when the workflow is finished, although potential errors should be handled first).

Those two messages need to be handled in that order by net. Yet they are sent by script from two different threads, the “script thread” sends the potential “error”, then it sends the “shutdown” to the second “IPC router thread”, which then, after shutting down it’s local part of the workflow, sends it also to the net component.

But we always want the net component to first handle any potential “error” message.

So, using “parallel queues”, we need to ensure the “error” and “done” steps are enqueued in order, to the same queue, from a single algorithm.

Using Rust channels and some sort of “thread of execution” running an “event-loop”, we need to send two messages on the same channel, from the same thread.

And off-course, that thread can then send further messages to another thread.

So the solution is to use one “thread of execution” running an “event-loop”, ok let’s just call it a “parallel queue”, to serialize the “error” and “stop” operations, and then from that parallel queue we can enqueue further steps onto another parallel queue(the one in net).

Simply stated: we need both messages to go through the IPC router thread, which then would communicate with net. In other words we cannot have the “script thread” send two messages, one to net, and the other to the IPC router thread, where it then sends another message to net, and expect the “script thread -> net” to be handled before the “IPC router thread -> net” message.

So we restructure the operation, where the “script thread” simply either sends an “error” or a “done” message to the “IPC router thread”, and it’s that thread that is then responsible to either:

A. First send the “error” to net, if there was an error, and then shutdown the workflow, or

B. Immediately shutdown the workflow when receiving the “done” message.

It shows up by introducing this enum, modelling A or B:

https://github.com/servo/servo/pull/26906/files#diff-ae0ff1fd98b06dfb13baf427cdffc28aR64

and in the “IPC router thread”, when receiving either the “error” or “done” message from the “script thread”, the appropriate sequence of action is taken like:

https://github.com/servo/servo/pull/26906/files#diff-ae0ff1fd98b06dfb13baf427cdffc28aR136

the stop_reading operation on the IPC router thread will then simply either send the “error” or “done” message, at:

https://github.com/servo/servo/pull/26906/files#diff-ae0ff1fd98b06dfb13baf427cdffc28aR205

And finally, in net, the messages will be handled in a way that propagates a potential error:

https://github.com/servo/servo/pull/26906/files#diff-aa469beb5619907dbccd88364264b9b8R544

So, for the few who haven’t navigated away by now, let’s recap:

We had a bug that was caused by having:

One parallel queue enqueue two set of algorithmic steps onto two different parallel queues, all the while expecting some sort of ordering to be preserved.

and this was fixed by instead having:

One parallel queue enqueue a single set of algorithmic steps onto a single other parallel queue, and having these steps then result in a local state change when run, followed by enqueuing another set of steps onto yet another single parallel queue(which when handled would also result in local state changes).

The result is actual deterministic behavior, where any error will propagate before the shutdown of the streaming workflow.

The end

Ok, I think this article had the highest ratio of lines in English/lines of code, this was really zooming-in into something that could easily be waved away as an irrelevant small bug, and yet I think this actually lies at the basis for almost every bug you encounter with message-passing code.

Discuss on Hacker news.

For the follow-up, where this problem is modelled using TLA+, see the below:

Modelling the archetype of a message-passing bug with TLA+

Two years ago, I tried to describe “the archetype of a message-passing bug”, using both a real-world example in Rust…

medium.com