In my last article, I wrote among others things about the benefits of using native threads versus async tasks.
This article will zoom-in a little further on the topic, highlighting two observations to using async/await in a system whose goals include achieving some parallelism.
So, if you make the conscious choice of going for concurrency without any parallelism, say, by using a single-threaded async runtime, then the below article will not apply to your system. Many systems written in Rust and using async/await do however try to achieve parallelism.
Problem 1: concurrency is not parallelism(and “awaiting” is not “ensuring throughput”).
Let’s take a piece of code:
Now, what I’m about to say is going to sound obvious:
Any of those calls to
await blocks(or rather “yields from”) the current task, until the result of the call becomes available.
There is no parallelism to this!
Yes, while the task “logically blocks”, the executor might be able to run another task on the thread, which doesn’t block. But that’s assuming you wrote the code in such a way that there is something to run to begin with.
Ok, so now you’re thinking: “who told you there would be any parallelism?”, and the answer is: no one.
However, since I have been reading various calls to “use async/await as the default” and predictions that “one day no-one will be using threads anymore”, it bears pointing out:
async/await doesn’t give you, at least not magically out of the box, one thing that usually matters a lot when using multiple threads(whether directly, or indirectly via a threaded runtime): parallelism.
Now, if your parallel system can be modeled as a set of independent tasks, each of which is going to frequently yield back to the executor, then maybe you can get some system-wide parallelism in the sense that there will always be a task doing some work on each of the native thread underpinning the runtime.
And here’s the problem: unless you’re designing a very specific kind of system, you will be hard-pressed to build a large set (hundreds, or thousands) of independent tasks that frequently yield.
An easier, and more general, way to model a parallel system, is as a set of heterogeneous coarse-grained components, who remain largely black-boxes to each other, and internally own, and operate on, some data.
The goal of such coarse-grain component-based system design is to group data together, in a coarse-grain way, so as to avoid concurrency as much as possible, yet attain parallelism for specific workflows that cross component barriers.
I think it’s a pretty general approach, that can fit many different type of workflows. The “yield often” model of async tasks, the one where something like async/await really shines, on the other hand, is pretty specific in my opinion.
And why is a notation like
await not so much of a game-changer, and perhaps even a distraction, when writing code for a system that doesn’t fit well in the “yield often” paradigm?
Let’s look at an example.
An example: Fetch
Let’s look at a very specific, and real-world, example, of parallelism.
Are you familiar with Fetch? It allows web-developers to fetch resources over the internet, asynchronously. Yes, that is “asynchronously”, via a promise that eventually resolves.
So surely, since the JS API is async, it must be a really good idea to use async/await in Rust to implement it?
Nothing could be further from the truth, and it actually highlights the difference between parallelism and concurrency.
Ok, here comes the actual code.
First, let’s look at one part of how fetch is implemented in Servo:
Now, let’s take a look at a naive attempt to use the “async/await secret sauce”:
So this time, instead of doing all this complicated stuff with the router, we just call
Now this doesn’t do what we want: the
How would the non-blocking version, still using async/await, look like?
Something along the lines of the below:
Now, this is basically the same as the first version, just using a different set of API. In order to get some parallelism to our async code, we had to spawn another task(and assume a threaded runtime).
Now, surely, spawning a task is more efficient than using this router thread, as Servo currently does(let’s leave the whole IPC stuff out of the picture for now)?
Well, perhaps, but by how much? For a given script process, there is a single thread backing all the calls to
ROUTER.add_route. It doesn’t get much more efficient than that(in terms of threading I mean, sure, perhaps a few allocations could be shaved off of the call to
What about the dreaded “cost of context switching”?
Surely, if we use tasks, than the task spawned in example 3, could (eventually) run on the same native thread as the current task, or otherwise be multiplexed onto another native thread together with other tasks?
I’m going to link again to my favorite article on concurrency online: https://www.chromium.org/developers/lock-and-condition-variable
If you scroll down about 2/3, there is a paragraph entitled “The fallacy of thread context-switch cost”, I will save my ink here and refer to the arguments made there.
So what I’m trying to convey here is that “concurrency is not parallelism” goes a little further, in the sense of “calling
await is not going to ensure a high throughput for your code”. Achieving a high throughput for a given component(and the opportunities for parallelism that will follow from it) will require hard work in terms of writing non-blocking logic, whether using threads or tasks.
Problem 2: I don’t (want to) care about scheduling
There is another, more subtle, problem with the
await notation: It makes your code dimly aware of the scheduling of tasks. That’s a problem because it is literally the last thing I want to concern myself with when trying to reason about parallel code.
So when calling
await, the current task blocks, or rather, yields, and the executor is then able to schedule another task to run on the current thread.
I couldn’t care less!
I do, however, really care about the fact that the
await call is going to yield from the current task. Why? Because one thing I can do is try to write code that doesn’t logically block. That’s how one can get (mostly task-)parallelism, almost for free.
I simply cannot write parallel code thinking about a runtime that “somewhere in the background” is going to schedule out of the current task and into another, while another task might already run in parallel(or next be scheduled on the current thread?).
In other words, I find it really hard to reason about throughput across components. It’s hard to think something along the lines of “if I yield now, then this other thing can run in the meantime, so even though the current task blocks on the
await, another task can run on the current thread”.
On the other hand, I find it relatively easy to reason about throughput inside a given component. That doesn’t require reasoning about the scheduling of different components. It just requires writing logic that doesn’t block, except at the beginning of the component-specific event-loop on a (set of) channel(s).
Another way to look at this: I don’t want to write code thinking about the task executor, or the OS thread scheduler, I just want to write sequential logic based on an event-loop that is private to a given component, and handles one message at a time.
Now, if all of the components in your system are written so that efforts are made that they do not block internally(unless necessary), what do you think will be the systemic outcome?
And yes, this doesn’t necessary have to be done with native threads. I’m perfectly happy to write non-blocking event-loops, running in parallel of each other, using tasks. Actually I don’t care that much whether it’s a native thread, or something else(For example, I’ve read about the
Sequenceconcept in Chromium’s scheduling approach, and I think it’s great. That’s basically the same as what you can do with native threads and channels in Rust, if somebody could abstract the native thread from that, awesome!).
Another challenging part of trying to write logic that often yields is that it probably means breaking up components into many finer units, and that probably means you’re going to end-up with shared-state, due to your inability to isolate state in sufficiently fine-grained units. And then the shared-state is likely to make parallelism harder.
It’s simply much easier to have a coarse-grained component, owning a bunch of coarse-grained data, and operating on it locally and sequentially(while running in parallel to other components), one message/event at a time.
What you end-up with is a bunch of event-loops, running in parallel, operating on local data, and communicating with each other.
You can do it with tasks too, and counter-intuitively, you are then likely to get more parallelism when avoiding peppering your code with
await statements, and instead focusing on having each task run an event-loop(where the only place your task yields is a the “top of the loop”, when awaiting on a channel or selecting on a set of futures/streams).
Whether you use threads, or tasks, how those are scheduled should be the last thing on your mind.