Programming Servo: A background-hang-monitor.
Or, how I learned to stop worrying and love suspending, profiling, and resuming threads…
On hanging threads, signal handlers, and backtraces.
When one of those components seemingly hangs on something, how can you find out what it is hanging on? Maybe a backtrace of what that component is doing at that time would be useful?
That’s easy, for that we have the
backtrace-rs crate, right?
Well, there’s a catch: how do we call
Backtrace::new() from a thread that is hanging?
Isn’t there something called a “signal handler”? Maybe we can just send a signal to the hanging thread, get a backtrace, and either print it right then and there, or otherwise make it available to another thread in the system for further processing?
So, all that code in Gecko is really necessary after all? For a moment I really thought I could replace it with just a
Next step: let’s find a way to suspend, sample, and resume threads, using Rust.
The full-frontal is available at https://github.com/servo/servo/pull/21673 (still in progress as of writing this, actually)
But first, let’s digress into my favorite topic
See that? It’s what I like to call a “non event-loop”. The “non” stands for “it has nothing to do with async I/O”.
And this is how you start it:
What’s so great about a glorified for-loop? Well, exactly the fact that it is “just a loop”, which means that it consists of simple sequential steps, with the multi-threading part restricted to receiving messages, or events one might say, on the channel.
There is no lock contention or having to worry about the ordering of your locking. Also, any state mutation happens in purely single-threaded fashion, there is no “how, and when, did that thing inside an
Arcget into that state?”.
There is only: “for each message you receive, do 1, and then 2, and then maybe 3 if this or that, and so on…”.
If, in those steps, you end up mutating “something”, well, any bug will be pretty easy to diagnose: just put a print statement with that “something” at the top of the loop and another one at the bottom. If you see anything that changed in the “something” that you don’t like, it’s time to take a closer look at the code executed in between those prints, and that code empathically is not multi-threaded in nature…
Here is what the background hang monitor looks like in terms of state:
As you can see, the state of the monitor consists mainly of a bunch of
MonitoredComponent and mutating those does not require any locking. A simple
&mut will do, since it all happens in the same thread and in a sequential fashion. Below is the “hang monitor checkpoint” performed at each iteration of the “non event-loop”, after incoming messages have been handled:
So maybe this message handling requires complicated locking? See for yourself:
Isn’t it great when you can spend a good chunk of a blog post discussing the absence of something?
I love channels, but it might be even better to hide them…
Noticed how that
init function above didn’t return a
Sender, but rather a
In an early iteration of the background monitor, a bunch of raw
Sender where shared around the system, used for three things:
- Registering components for monitoring.
- Notifying the monitor of the start of an activity by a registered component.
- Notifying the monitor of the start of a registered component going into “waiting” mode.
Now, we instead have two traits.
One dealing with registering of components:
And a second one, incidentally returned by the “register” method of the first one, and dealing with sending notifications to the monitor from an component already registered for monitoring.
The benefit of this are twofold:
- The whole background hang monitor, and the methods of communication with it, remains hidden from the rest of the system. Currently, those traits are actually implemented by wrappers around
Sender, yet if this were to change, the rest of the system wouldn’t notice.
- We can put those traits in a minimal crate used by the rest of the system, whereas the background hang monitor can live in it’s own crate that only a few other crates in the system rely on. This is mainly nice because if you make a change to the monitor crate, you don’t need to recompile all the crates using the traits…
And now, let’s finally get to the thread sampling.
In order to sample a hanging thread, there’s a few things you need to understand first. Or, if not completely understand, at least understand enough to be able to write Rust code dealing with the matter that will compile(and not crash at runtime, since most of it is unsafe)…
Here’s what needs to be done:
- Suspend the thread. Mac OS and Windows have dedicated API’s for this, while Linux can be done by registering a signal handler for the
SIGPROFsignal(yes the “PROF” part stands for “profile”, or at least that’s what I think), and then sending that signal to the thread from another one(the monitoring/sampler thread).
- Inspect the “registers” of the thread, again different structure for different platforms. On some platforms it’s called the “context”, on others the “thread state”. All variants will contain two things that are important: 1. the “instruction pointer”, which points to the current frame, and 2. the “frame pointer”, which points to the instruction pointer of the previous frame(actually, the frame pointer will not always be present, see https://github.com/rust-lang/rust/issues/48785).
- Store the instruction pointer, and use the frame pointer to “walk up the stack”, storing each instruction pointer as you walk up, giving you a list of pointers that essentially give you access to the call stack.
- Resume the suspended thread.
- Resolve the list of pointers collected under 3 to a list of symbols which can be use to print out a proper backtrace. And this can actually be done using
Note that 2 and 3 represent a “critical section”, in which you can’t acquire locks since the suspended thread might have acquired them previously. Trying to acquire a lock that is held by a thread currently suspended, is apparently the surest way you have to achieve deadlock. For more info: https://dxr.mozilla.org/mozilla-central/rev/b0b856065d5b7ad2996f707e6e797d0d72afd803/tools/profiler/core/platform-linux-android.cpp#339
And now, a call for help
The reader who got through the banter so far, is about to find out this entire article was but a barely disguised bait to get your help.
Here’s the catch: only Mac OS is (pretty much) done, and Servo now needs the help of one, or several, rustaceans with access to, and the motivation to do the work on, these platforms:
Thanks for reading, and happy contributions!
A discussion of the Background-hang-monitor in the light of the “non-event-loop” concurrency pattern: