Programming Servo: just hanging around
“Concurrency is hard” is a common refrain in our industry, and to that could be added “Stopping concurrency is harder”, which could further be supplemented with “Stopping other people’s code running concurrently is even harder”, or something like that.
Servo, being a web engine, is all about running other people’s code. And while that code perhaps isn’t concurrent in and of itself, it is run by the engine concurrently to other parts of it.
When your own concurrent code hangs for some reason, for example in the case of a deadlock, you can usually debug the problem, fix it, and move on.
However, when running other people’s code, by definition that code can pretty much do whatever it wants, and you can’t fix it. The hanging could also be intentional, as in the case of malicious software trying to run some CPU intensive code in a loop on your machine.
So Web engines have to deal with the problem in some way, and Servo, being a prototype of a Web engine, this had to be done actually only quite recently, and off-course in a more basic way than what the full-featured engines are doing.
Let’s take a look at how this was done.
Full gore over at https://github.com/servo/servo/pull/27016
The usual suspects
Let’s first take a look at all the “actors” involved in the workflow we are about to dissect:
The above is essentially a simplification of, with some additions to, the birds-eyes overview of Servo.
We have two processes, the “main process” containing the constellation, and a single “content process”, where the “other people’s code” would be running.
In both processes, we see a few more components, each running an event-loop.
Also note that this “other people’s code” can spawn further concurrent code, in the form of “dedicated workers”, which can in turn spawn more of their own, forming a kind of graph.
So here’s the problem encountered in Servo: when you have some piece of “other people’s code” hanging(running in a loop without yielding back to the engine essentially), then when you clicked the “close” button on the Servo window, nothing would happen. Basically you couldn’t shut-down the browser when a web-page was not allowing you to.
This appeared a bit strange to me at first, since we’re talking different threads, and even different processes, between the code that handles the “close” button click on the window(in the main process), and the code running “other people’s code”(in the content-process). And I soon found-out it was because when shutting down, the constellation would actually wait on receiving a “exited” message back for each “content process”.
So my first reaction was to actually remove this waiting(after a timeout was hit), and then I realized that wasn’t actually fixing the underlying problem. So we had to actually find a way to stop the hanging code…
Anatomy of a hang
How exactly was Servo hanging on code running in a web-page?
So note the “script-thread”, and the “dedicated worker” in the image above. See how they each contain a “Spidermonkey”? That is actually an engine embedded within an engine, that focuses on running Javascript and Webassembly.
Now, while Spidermonkey does sometimes use some background threads for certain things, when one of Servo’s threads, like the “script-thread”, calls into Spidermonkey to run some code, that happens right there and then on that “script-thread” itself.
So that means that the “script-thread” will be “busy”, for as long as that code is running on Spidermonkey. That in turn means that it will not be able to handle other things, like for example an “exit message” coming from the constellation.
How it should work
So how do we expect the “exit workflow” to work? A bit like the below:
This is meant to show that:
- The constellation sends a “exit message”
- The script-thread, when receiving it, sends a similar message to any dedicated workers it would have spawned.
- Each worker sends a similar message to each nested worker it might have spawned.
- The reverse is sending an “exited message” when one has, or is about to, exit(ed).
How it goes wrong
It goes wrong when any of the threads running Spidermonkey is busy running code, and hence cannot handle the “exit message”.
It could be the “top-level” script-thread:
Or it could be any of the workers, or nested workers.
Spidermonkey to the rescue
Fortunately, Spidermonkey offers us an API to remedy this problem:
We can add an “interrupt callback” via JS_AddInterruptCallback
, and we can “have it called” by using JS_RequestInterruptCallback
.
The only question is: where, as part of this workflow, are we going to call JS_RequestInterruptCallback
?
Enter: the Background Hang Monitor
The BHM is a component of Servo that was previously introduced here. And while it’s initial purpose was actively monitoring, and/or sampling, hanging threads, it has now also received an additional responsability as a means to ensure hanging “other people’s code-running” threads properly exit when the constellation wants them to.
This is the gist of the matter:
So first of all, this is were things went wrong with the layering of the components, as you can see, however I hope the point comes across.
Also, I’ve removed the “send exited message” arrows, to keep it somewhat readable, since we’re now focusing on the other part involving the BHM.
Now, in addition to the “exit message”, the constellation now also sends an “interrupt” message to the BHM, which results in interrupting Spidermonkey on the script-thread, and the script-thread does something similar for it’s workers, who each do something similar for their own workers, ensuring a clean recursive shutdowns of all these threads.
Note that I guess it looks messy as in “which message is received first?” There is actually some additional code that ensures that it’s always the interrupt that is first handled by the BHM, before the exit messages are. What if the threads are not hanging? Then the interrupt is simply a no-opt. The only thing that does matter is that you can’t interrupt Spidermonkey after it has already exit.
How is this actually implemented, in Rust, you ask? Well actually it’s kind of interesting, since it uses cutting-edge features of Rust such as a “boxed trait”. Since we don’t actually want the BHM to know anything about “how to interrupt Spidermonkey”(or for it to depend on the “script” crate), a trait is used to give a generic “signal exit” capacity for the BHM, with regards to each component registered with it.
It looks like this:
As you can see, the Servo code-base is really pushing the envelope on the generic side of things.
And how does the script-thread implement this trait?
Quite simply like this:
It’s worth looking a bit further into this ContextForRequestInterrupt
, which features another bleeding-edge use of the Rust type-system, known as “implementing Send
”.
So this is kind of interesting, because it’s not actually safe to just send a “context” to other threads and perform all sorts of operations with it, as literally the only thing safe to do from another thread(as far as I currently know, so that will probably change next week) is JS_RequestInterruptCallback
.
So we wrap the pointer, which would have been obtained from Spidermonkey by the way, inside a struct that is Send
, and that can only perform that very operation on it.
For good measure, here is what things look like from the BHM’s perspectives:
And that’s it basically.
There is some more code dealing with a bunch of details, but essentially it works: no more hanging threads preventing shut-down of Servo.
Obviously there is a bunch of other stuff that could be done on top of this infrastructure, such as implementing a kind of “kill slow script” workflow, and so on. That is left as an exercise for the reader.
I’ve got to delete a bunch of screenshots from my desktop now…