Programming Servo: Zen and the art of removing blocks from your system
It all started with a simple issue, a “quick fix”, or so I thought.
I didn’t communicate that I wanted to take it on, or even bothered to assign it to myself, as I just assumed the PR would be up for review and merged before that could ever matter…
110 conversations, 4 commits, and 16 files changed later, and almost two months after opening the PR, we finally could close the issue.
(Ok, those 110 conversations also include the project bot, but it does look impressive, doesn’t it?).
What follows is the story of a journey, one of introspection, patience, perseverance, and maybe even a brief insight into the meaning of it all while we’re at it.
Zen, and the art of removing blocks from your system
A brief Intro
For those good folks who just tuned in, a brief intro into the topic.
Servo is a prototype implementation of the Web written in Rust. It aims to leverage Rust’s features and achieve “better parallelism, security, modularity, and performance” compared to existing web engines. That’s a tough goal to set yourself, since those “existing web engines” are built by some of the world’s largest companies with probably the biggest budget to build “secure, parallel, modular, and performant” anythings, but we’re hoping Rust can help off-course…
Having myself written my first line of Rust as part of Servo, and coming from a non-system web-development background, I think it’s simply the best way to learn Rust.
Also, it’s not just about Rust, as you soon realize implementing the Web is a fantastic endeavor in itself regardless of language or tools. Why? Because the Web is the perfect example of how large and “complex” systems should be build: by a distributed community of people working in parallel on competing implementations, and with constant feedback from each other and the end-users of the system.
You don’t “design” complex systems, you evolve them. It shows in the approach to maintaining a HTML Living Standard, versus a “spec”, with the aim of “continuously maintaining the specification rather than freezing it in a state with known problems, and adding new features as needed to evolve the platform”.
So when you contribute to Servo, you also soon start getting involved with the actual standard, as well as with the test-suite, and realize that both change at the speed of merging PR’s(which can be as slow as it needs to be).
A not-so-brief digression into “open” forms of debate
From a personal perspective, I’m always amazed about how discussions are possible both in Servo or in the related projects like the living standard. If you show-up with a half-decent argument, people will actually engage with it, and in general people tend to disagree with each other more than agree, which doesn’t prevent decisions from being made, instead it seems to ensure more optimal decisions are made.
As an aside, I was reading Ed Thorp’s book the other day, which contains a brief discussion of the concept of the “wisdom of crowds”.
This “wisdom of crowds” works great in certain situations. Examples are polling people to estimate the amount of beans in a barrel, or the weight of a pumpkin. In these cases, taking the average of all answers gives you something more accurate than most individual guesses. You can speculate that this might be the origin of the expression: “Your guess is as good as mine”.
On the other hand, in other situations, usually logical “true/false” type of questions, the wisdom of crowds turns into what Thorp refers to as the “lunacy of lemmings”. An example mentioned in the book is the Madoff fraud. For decades, the “crowd” was asked the question “Madoff: genius or fraud?”, and most people kept answering “genius”. The average answer turned out to be disastrously wrong, and the few who warned of impending disaster saw their argument ignored and swept under the rug by a cheery consensus.
In an open-source project, you’re hopefully faced with hard technical questions on a regular basis, and if the wrong decisions are made, you will usually only realize it when it’s too late, except maybe for a massive re-write. And unlike when trying to guess the weight of a pumpkin, when faced with hard technical questions, you can’t rely on the wisdom of crowds.
It doesn’t matter if 9 out of 10 think we should go for this or that design and it’s fine, if there is 1 dissenter, it pays to carefully consider those arguments. You can’t disregard the 1 dissenter simply in the light of the 9 others who are in agreement, and you need to be careful not to put too much weight on your own “knowledge”.
And now, let’s discuss how to remove blocking logic from parallel systems.
Frankly, I think the above is more interesting, but since you’ve made it that far, I’ll oblige.
Servo consists of multiple concurrent components, running potentially in various processes, and at the scale of the system as a whole, those components are seen communicating with each other via (ipc-)channels and running their internal logic as event-loops.
The way to think about it is that a given “tab” in your browser consists of a given “script event-loop” component running a web-page(or several), in a process. Then, in a different process, there is a central hub called the “constellation” that owns senders to all the “script event-loops” that are currently running(and shares a sender to itself with each of them). In another thread(but in the same process, at least last time I checked), the “embedder” is running. The embedder is where the actual browser UI would be running.
So, when a “script event-loop” wants to navigate one of the webpages it manages(since a tab can have all sorts of frames and so on, and actually several tabs could run in the same “script event-loop”, but that’s for another day to discuss), the “script event-loop” will send a message to the “constellation” to ask it to navigate that “page”.
The reason why script needs to send a message to the constellation is because a “navigation” could require manipulating a “script event-loop” that might not be the same as the one were the navigation was initiated, or spinning up an entirely new “script event-loop” if the target of the navigation should have one of its own. In other words, “navigating” involves some coordination outside of the boundary of a given “script event-loop” hence it’s the constellation, managing all “script event-loops”, that is in charge of that.
When the constellation receives such a “navigate” message from a “script event-loop”, the first thing to do is to give the embedder a chance to handle this in some other way, and let the constellation know whether to proceed or not with the navigation.
As you can see, previously, the constellation would send a message to the embedder, with the message itself containing a sender, and then block on receiving the reply(Frankly, not sure why that was done with an ipc-channel, by the way, since the embedder is in-process, anyway…).
The problem there was that the constellation blocks waiting on the reply. It means that while it’s waiting, it cannot continue handling other messages, and it might even deadlock, if the embedder itself is blocking waiting for a message from the constellation.
In other words, this “ACK channel” pattern is handy, but often it’s something you’d want to remove at some point with the async version of the same logic.
As usual in my articles, while this is arguably “system-level” code, we’re discussing high-level logic. We’re not going to discuss things like thread-affinity, the (relevant or not)cost of context-switching, the time spent on the CPU waiting or not, the intricacies of the scheduler, and so on(for this, I recommend this very interesting talk from a Google engineer, the cutting edge over there doesn’t seem to be “how many light-weight tasks can you multiplex on a native thread”, it’s rather “how many native threads can you manually schedule on a core without ever context-switching”: https://www.youtube.com/watch?v=KXuZi9aeGTw).
We’re just going to be looking at this from the point of view of the logic of the code.
Why would you start optimizing lower-level constructs, when your code itself simply has “BLOCKING” written all over it? A first step towards more parallelism would seem to be making the logic in your code non-blocking. Once that is done, you might want to look at the actual parallelism happening from a lower-level perspective and optimize for it in various ways.
So how can we implement the equivalent logic, yet in a non-blocking way? Simple: when the constellation receives a “navigate” message, instead of blocking while asking the embedder whether to handle it, we’ll restructure the flow of communication so that a reply from the embedder would later arrive as just another message on the event-loop of the constellation(the constellation runs what I like to refer to as a “non-event-loop”, essentially running one message at a time as received on (a set of)channel(s)).
So instead of:
- When a “navigate” message from script comes-in,
- Ask the embedder for permission, block waiting on yes or no,
- If yes, perform the navigation
We will do:
- When a “navigate” message from script comes-in,
- “schedule a potential navigation”, by storing the state required for the navigation,
- and then send a message to the embedder to “ask” if we should actually navigate(this is a non-blocking send on an unbounded channel).
- Continue handling other unrelated messages, while we “don’t wait” for the embedder’s reply.
- At some point, one of those messages will in fact be the reply from the embedder, if the reply is yes, take the state stored at 2 and do the actual navigation.
So this time, when the constellation sends a message to the embedder, a
Sender is nowhere to be seen, instead the message is just sent and “forgotten about”.
But can we really just “forget” about the message? What about this navigation that we might have to do if the answer is “yes”?
In order for the “reply” from the embedder to be properly handled, the constellation is going to have to store some state in order to be able to “resume” where it left off, and handle the actual navigation work.
Simple enough, that’s all the state we need to store be to able to perform the actual navigation when the reply comes in.
Tada! We’ve just transformed what was a blocking operation, into a non-blocking one. Note that while this required adding a piece of state to the constellation, it’s just thread-local state, and doesn’t require any locking to manipulate.
The constellation doesn’t share state with the embedder, or with any “script event-loop” or any other component, instead, it shares senders to itself, and runs an internal event-loop using the corresponding receivers.
So that’s it, we’re done, right?
Not so fast
And that’s when it got harder…
Nope, looks like we have to dig deeper…
Yep, there we go down the rabbit hole…
Lesson number one of blocking
If any part of your otherwise parallel system blocks(logically), you will end-up with other logic implicitly relying on that particular block, even though there is no reason for it.
People write code based on the current behavior of the system, and so a block suddenly becomes a feature that additional code will rely on.
In this particular case, the constellation could now keep running and handle more incoming messages, while the navigation was “pending”. Previously the constellation would block waiting for the embedder’s answer, and then immediately do the actual navigation.
So other code basically required the constellation to not handle any other messages after the “navigate” message, until the “navigation” actually had started. But this “you got to navigate immediately while handling this initial navigate message” was not documented anywhere, and it was not even intended. It just happened that that was the behavior, and so other code had been written that implicitly relied on it.
And so we ended-up with three additional seemingly unrelated commits.
However, the magic of the web is that this actually made us realize we hadn’t implemented parts of the spec. Instead of properly implementing the spec, our “false compliance” with it had been based on an unintended hack which relied on a blocking operation in an unrelated place. Removing that block exposed this unintended reliance on it, itself exposing the parts of the spec that required implementation in order not to have to rely on that block.
And thus it suddenly all made sense, and all those present felt a deep sense of calm and purpose.
The end of this article, and the beginning of your own journey?
Ok, I think it’s time to put an end to this. Thanks for reading, and go enjoy what’s left of that Sunday…