Programming Servo: three years, 100 commits.
So, after a bit more than three years, and having reached the arbitrary number 100 in commit count, I think it’s time to take some time to survey the experience so far.
Also, even thought I liked to read books about “patterns”, and I considered my Django apps “pretty well structured”, I really had no idea what “programming in the large” was.
After that initial commits, I spend a few more months doing other fairly easy things(although it was hard enough at the time), and it was only in February 2017 that I had my first taste of an issue labelled “Hard”: “Implement structured clone callbacks” https://github.com/servo/servo/pull/15519.
As a self-taught engineer who once had to independently figure out what a “variable” and a “class” was, I remember very well that working on that structured clone stuff felt about as hard as learning to program all over again(by the way, since that stuff was pretty much all
unsafe, it might have been a better idea to pick a different “Hard” issue at the time, but hey, I made it through).
It was also weirdly captivating and addictive, and I have been hooked ever since. Over time, I figured out what I like the most, which I would described as “working with event-loops”, especially coordinating among them, and using those to build fairly convoluted concurrent and/or multi-process workflows.
Here are some things I’ve noticed along the way:
Why did I go from a few easy commits, to becoming a reviewer on the project three years later?
The answer is community.
If you look at the Servo contributor graph, you see that over the years not only has the project attracted more than 1000 contributors, it also takes a full 22 commits to get in the top 100. To me, this means that the project’s engagement with the community as been both broad, and deep. People tend to stick around for more once they get started(For comparison: it takes 102 commits to reach the top 100 for rust itself, so there is still some way to go for Servo to reach that level of engagement).
What’s so great about Servo’s community? I think it’s hard to put your finger on it, but one thing I’ve noticed is that the “reviewers”, while sometimes indeed really do seem to know everything, never make it feel like they know everything. In other words, there is plenty of room to have discussions, and people will take your arguments seriously.
I also should mention the help I got from Paul Rouget, who initially introduced me to the project and provided invaluable support along the way. Coming from web development, I was initially skeptical I could tackle working on the underlying engine, but you convinced me otherwise. Thanks Paul!
Incremental progress works
How can you continuously improve your skills as a programmer? It’s simple: always work on something that is slightly harder than the last thing you worked on.
“CV time” doesn’t count. You can work for 10 years and yet make little progress. It’s not the amount of time you spend programming that counts, it’s how much progress you make by doing things that are incrementally harder.
So, every time I picked-up an issue in Servo, I made a point to choose one that appeared harder than the last.
It goes a little like this: at the start of almost every issue, there is an initial period were you feel lost and have no idea how to go about it. Then, as you read the spec, survey existing code, or perhaps look into how something was done in Gecko, you slowly build a mental model of the solution, and start writing some code incrementally towards it(off-course your perception of the problem then further changes along the way).
After you’ve finished one such issue, you might go back and do some smaller stuff still related to it, such as fixing a few bugs that perhaps have come up since. That can be fairly relaxing and probably necessary as some sort of recovery phase. But don’t fool yourself, you’re not making progress until you actually pick up the next project and get that feeling of being at a loss again.
Obviously, the above is nothing new. See for example https://norvig.com/21-days.html
The native thread is dead, long live the native thread!
Here comes my favorite rant.
I’m always surprised when I read somewhere the prediction that “one day, nobody will directly use threads anymore”. Or the advice that one should “use async/await by default” as an approach to concurrency. Let’s say my experience on Servo has been very different, and I’ll try to explain why.
In my opinion, there is one reason Servo is still tractable despite it’s size, and it is the use of a simple approach to concurrency used to model individual components: a native thread, running it’s own event-loop, and handling one message at a time.
And “simple” doesn’t necessarily mean “contains little code”. In fact,
script is an example of a large component in Servo( a
cargo check alone will take 5 mins). Yet the answer to the question “how does
script run?” can be summed up as “one task at a time”.
“Simple” also doesn’t mean that all your component can do is run an event-loop on a thread. The way to think about it is that this “event-loop” represents the “main-thread” of the component, and other “background threads” can be spawned off of it. So for example
scriptis running an event-loop, yet it also has any number of web-worker threads and layout-threads.
Another example is the
netcomponent, which also runs an event-loop, and then owns a thread-pool on which parallel instances of the “fetch” algorithm are spawned. Those parallel fetches further share a Tokio runtime on which they spawn the actual networking part of the algorithm. And when the time comes to communicate the results, or progress, of a fetch, back to script, it’s done by sending a message to
netrun in different processes, that message is not directly handled as a task on the event-loop of script. Instead, it’s received by an IPC router thread in the script process, from which a task is then queued(again by sending a message), on the script event-loop(for an actual example, see here).
I assure you, there is a kind of madness to the logic.
So, despite the huge amount of code that potentially runs as part of “one task”, if you want to understand “how the (main event-loop of the) component runs”, you just need to add one
println! here. Doing so will tell you exactly what “events” are being handled, one at the time.
The component, like most others in Servo, runs something like the below algorithm:
- Block on receiving on a channel, or on a select of channels(thank you Crossbeam),
- Handle the message received, (almost)without any concurrency coming into play(except non-blocking sends on channels),
- Go back to 1.
So, while step 2 can get incredibly complicated, in trying to understand what goes on you will have one enormous benefit: the code is single-threaded/sequential in nature.
So what’s the problem with async and tasks? The problem is that using those breaks down that simple model.
Perhaps some example code is due.
Let’s first take a look at the event-loop of
script in Servo:
As you can see, there is a single “yield” point, where the thread might block if no message is available. The actual event handling that follows the receiving of a message is purely sequential.
Ok, ok, I admit there are a few more points where
script might block, as can be seen for example below:
This is referred to as “blocking the event-loop”, and avoided if possible.
Now let’s take a look at an async example, this time from Facebook’s Libra:
We can further look into one of those async method calls, for example
process_proposal_msg , where we can find further yield points:
So, while there is a resemblance to the select used in the script event-loop, the similarities end there.
The big difference is what happens after the select wakes-up. In Servo, the handling of the message received from the select is sequential. You’re talking single-threaded code running one statement after the other, without yielding, there(and that’s hard enough, believe me).
In the Libra example, the code inside the select is itself async, which is another way of saying it is concurrent(even if you’re using a single-threaded runtime). Can you describe “how this component runs” using a few simple steps, like I’ve done above for Servo? Let’s try:
- Await on the select(so far so good).
- Handle the result of the select, awaiting a number of nested async computations.
- Back to 1.
In theory it’s fine, the “nested async computations” at Step 2 will execute sequentially, but what happens when something goes wrong? Trying to debug the async code at step 2 is going to be a lot harder than the equivalent sequential code in Servo. Why? I think we can assume that sequential code is easier to understand and debug than concurrent code(even when there is no parallelism and it all runs in a single thread).
Note: I understand that in the Libra example above, only one “event” received on the
selectis handled at a time. And it might even be possible that lots of those
awaitcalls, do not actually await any work happening in another task(they are “sync” calls with an “async” signature). My point is that the task handling one event at a time doesn’t do so entirely sequentially. Each await potentially yields control back to the executor. So your task is running “one event at a time, with a bunch of yield points nested in between making progress on handling that one event”. So yes, the outcome is “sequential”, it’s not like your task is broken-up in many parallel tasks, but it’s still more complicated than actual “sequential” code, that doesn’t have any yield points(except at the top of the event-loop).
In Servo’s approach, each component is internally sequential(at least the main event-loop of a component is, while other parallel computations can be spawned by code running on that event-loop, see for example Fetch). The component will communicate with other components running in parallel using message-passing(preferably without blocking). Those message-passing workflows can indeed be somewhat hard to debug, but at least you can rely on the internal logic of (the main event-loop of) each component being single-threaded.
Looking at https://github.com/libra/libra/issues/2152, and https://github.com/libra/libra/issues/1399, it seems like the Libra devs are also moving to something more like a “single thread/single event-loop per component” model. It also looks like that code is not as async as it looks.
Does the Servo approach simply fit what Servo is trying to do, and could other types of system be modeled fully using async/await?
I think that there isn’t one particular thing that “Servo is trying to do”. There is really a bit of everything, from networking, to graphics, to running code in a VM. And that’s the challenge of a large system, it’s going to consists of various parts, you’re going to have to keep them isolated from each other, and they all will have different runtime “needs”(and the needs of each component is different from the needs of the system as a whole).
So while a given component, let’s say your networking component, might internally own an async runtime and spawn internal async computations, as part of a larger system, I would still model the component with a thread as the outer layer(I would probably argue for a single-threaded async runtime, running inside that thread).
I would not try to model components of a larger system individually as tasks(spawning other tasks?), run multiple components on the same async runtime, or try to communicate between them using futures. Why? Because that would force upon each component an async model of computation, which is unlikely to be a good fit for each, and it would also represent a loss of isolation in terms of the runtime characteristics of each component(despite the fact that most “async runtimes” come with a flavor of “spawn a long-running computation” API, that’s not the same thing as spawning a thread representing a component that “usually doesn’t block, but sometimes must”).
Actually one further complication in Servo is the existence of, and the need for, process boundaries. Those are partly required as mitigation for Spectre, partly for increased robustness of the system(when a tab crashes, the browser as a whole doesn’t). If anything, I think these will become increasingly prevalent in other types of system too.
For other relevant discussions of things like “the (fallacy of the)cost of context-switching”, and a broad overview of concurrency applied to large programs, I refer to the excellent: https://www.chromium.org/developers/lock-and-condition-variable (the title doesn’t do it justice, scroll down about half-way to “Alternatives to mutexes” for a few real gems of paragraphs, while the mutex/condvar part might actually be the best practical intro into the topic on the internet).
Thanks for reading, and here’s to another, or your first, 100 commits in Servo, happy new year!