Fixing Servo’s Event Loop

Gregory Terzian
9 min readJun 28, 2024

--

Servo is a program, written in Rust, implementing the Web platform — forming the core of a potential user agent — through a host of specifications. The beating heart of an application running on the Web might well be the HTML event-loop, and in particular the window event-loop. This is a story about efforts, past and ongoing, to reduce the brokenness of Servo’s event-loop implementation.

Up to the task

An HTML event-loop consists essentially of three things: task queues, tasks, and their processing model. This processing itself consists, again essentially, of three steps: dequeue a task from a task queue, run the task’s steps, and finally, perform a microtask checkpoint(which runs all tasks enqueued on the microtask queue up to that point). For clarity: whenever a piece of Javascript runs on a web page, it happens from within a task, either at the second(run the steps of a task) or third stage(perform a microtask checkpoint) of this processing model.

But “where do those tasks come from?” and “where are they going?” are probably questions on your mind right now. Tasks can be queued from within the steps of another task — this is for example how, and only how, tasks are enqueued on the microtask queue— and they can also be queued from elsewhere. This “elsewhere” is defined as steps running in-parallel to the event-loop, on a kind of simplified event-loop: parallel steps also run one set of steps at a time, and use queues to synchronize among themselves. Their main difference with the HTML event-loop is that they do not have access to the contents of a web page; instead, to influence a web page they must queue a task on the HTML event-loop.

An illustrative example is that of making a fetch call: when doing so, your JS code gets a promise, and the HTML event-loop launches steps in-parallel to fetch the resource. Those parallel steps will then queue a task on the HTML event-loop in order to resolve the promise, which will then queue a microtask to call into the handler of the promise.

Event-loops and painters

The purpose of the window event-loop is not only to run Javascript, it is also to synchronize its content with what is presented to the user. For this, a window event-loop comes with dedicated plumbing: in-parallel, whenever a rendering opportunity is noted, a somewhat-special task must be queued to update the rendering. This task consists of a long list of steps — for example step 8 are the resize steps, and step 14 runs the animation frame callbacks — culminating in the updating of the user interface.

What is important to note is that all these “update the rendering” steps are performed in a given order, and from within the same task. This means that, for example, if a resize event is fired as per the resize steps, and the attached event listener then does something to resolve a promise(which queues a microtask), then the handler of the promise will only be called after the entire “update the rendering” task has run(at the next microtask checkpoint), and not, for example, before any animation frame callback is called( in the same task at a later step).

This brings us to the broken aspect of Servo’s implementation: the different steps of the “update the rendering” task were, essentially — and here we are ignoring some details— run in separate tasks.

But before we dive into the confusion that are the details of the broken implementation and the details of the still ongoing fix, perhaps we should first clarify the problem further? So far we have used logic in the form of prose, introducing concepts such as event-loops and tasks, and have probably relied on assumptions about concurrency primitives. Would there be a way to, instead, rely on a more abstract and yet more precise form of logic, and have this logic checked by a computer? Yes: enter TLA+ and the TLC model checker.

Modelling the Processing Model

Let’s start with the simplest possible representation of our event-loop: a set of tasks, a set of steps, each paired with a counter to keep track of how often they have been run, and a single action that runs a task and a step.

https://gist.github.com/gterzian/da5ad2b0e969848cea62de05a06760d4

We have added an invariant named StepAtomicUpdateInvariant, which asserts that the counters for all steps always remain in lockstep. The idea is that these are the steps of our “update the rendering” task, which is a long list of steps that should run in the same task(increment in lockstep in our model).

The invariant is obviously not respected, because RunTask(t, s) runs a task and only a single step, making steps increment out of sync. Here we are modelling a broken event-loop, where the “update the rendering” steps are run as part of different tasks.

We can now model a fixed event-loop: one in which a dedicated “Update” task will run all steps, and all “Other” tasks will do nothing(we are not interested in their steps).

https://gist.github.com/gterzian/da5ad2b0e969848cea62de05a06760d4

We have now two actions: RunUpdateTask(t), which models our “update the rendering” task and runs the relevant steps, and a RunOtherTask(t), which models any other task.

Our updated invariant now also asserts that the “Update” task counter increments in lockstep with the two steps. It is respected by this spec, since all steps counters are incremented together with that of the “Update” task as part of a single action.

We can now ask ourselves: is our spec fine-grained enough to be realistic? Does our program run a task, and all it’s step, as one big atomic chunk? The answer is probably a no: while the event-loop does run one task at a time — and the state of the event-loop should only transition one task at a time — implementations can do all sort of things as part of running this task, such as acquiring locks or sending on channels. Also, by modelling our task as a single atomic action, we cannot model the ordering between steps(the conjunctions of a TLA action have no concept of “ordering”).

So we can move to a final specification, which will model running a task as a series of steps. We call it FGEventLoop, as it is more fine-grained than EventLoop.

The difference with EventLoop is a running boolean per task, indicating whether it is running, and “running a task” consisting of multiple actions: it starts with an action that sets the task’s running boolean to true(only if there are no other task already running), and end’s with another action setting this boolean back to false. In between, and only in the case of the “Update” task, several other actions can run, in our case one for each step.

This new setup means we also need to update our invariant, here renamed FGStepUpdateInvariant and consisting of three conjunctions:

  1. When the update task is not running, then the steps and the update task must all have the same counter. This covers the part that steps must be updated from within, and only by, the “Update” task.
  2. When the “B” step counter is that of the “Update” task counter, then the “A” and “B” step counters must be equal. This covers the part of the relative ordering(here alphabetical) of the steps.
  3. The last one asserts that only one task is running at a time.

Finally, we also check that our fine-grained spec implements our coarse-grained one, by way of a refinement mapping, which you can check by adding BarSpec not as an invariant, but as a property to check in the TLC model(the purpose of this article is in fact not to explain the event-loop, it is to show-off the use of a refinement mapping).

It is interesting to note that both the first conjoint of FGStepUpdateInvariant and the first part of the refinement mapping illustrate that “running a task” acts as a critical section: while inside, certain invariants, like step_state[a] = step_state[b], can be temporarily invalidated, but they must be valid by the time the task is not running anymore.

Observant readers may have noticed the absence of a microtask checkpoint in our model: this could indeed be interesting to add, for example one could add multiple steps draining the microtask queue after all the steps of a task have finished running, and an invariant that the microtask queue must be empty when no task is running. We will leave this as an exercise to the reader, and instead, armed with our newfound clarity of mind regarding the properties of our event-loop, move on to the Rust implementation.

Finally, some code.

Servo’s implementation of the HTML event-loop is found within a module at component/script/script_thread.rs, and it consists of, well, a loop, running inside a thread. This loop then selects on a set of channels, and then runs a kind of sub-loop handling one message at a time(with some additional stuff that can be seen as batching and prioritization).

Some of these messages, but not all, are called “tasks”, but that distinction is probably broken and needs some work. So for now we will assume the event-loop runs one task, or one message, at a time, and that messages and tasks are the same thing.

As an example of our previous, broken, implementation, let’s look at how animation frame callbacks were handled:

https://github.com/servo/servo/blob/168d43f24a18184afadc6aa8646537289c85860a/components/script/script_thread.rs#L2047
https://github.com/servo/servo/blob/168d43f24a18184afadc6aa8646537289c85860a/components/script/script_thread.rs#L2998

As you can see, when the script-thread would handle a TickAllAnimations message containing a AnimationTickType::REQUEST_ANIMATION_FRAME, the animation frame callbacks would be run(for that particular document). This was wrong, because those should run not as a single task, but rather as step 14 of an “update the rendering” task. The fix was actually more complicated, but necessary:

https://github.com/servo/servo/blob/1d66ea2b2795cb7afcac787be1014f28dc7ad029/components/script/script_thread.rs#L1869

Here we do two things:

  1. “Note a rendering opportunity”, and
  2. “Note a pending animation tick”.

The second part essentially stores the contents of the message for later processing as part of an “update the rendering” task, which is queued(if one hasn’t been queued yet) by the call to rendering_opportunity. The code inside that task then looks like the below.

https://github.com/servo/servo/blob/1d66ea2b2795cb7afcac787be1014f28dc7ad029/components/script/script_thread.rs#L1707

As you can see, the initial fix still left more TODO’s than actual code, which is why we need your help(the resize observer steps have since been implemented).

An example of a TODO is moving the full-screen steps to the “update the rendering” task. The trick to do this is similar to that for the animation frame callbacks: whenever the need to run those steps is noted, instead of queuing a dedicated task, store the required state somewhere where it can be found later, note a rendering opportunity, and then move the actual steps to inside the “update the rendering task”(where the state stored above is retrieved). The issue provides further details, and you are also welcome to attend one of our triage meetings to discuss it further.

At the end of the article are other relevant issues for your consideration.

Besides these TODOs, there is also an opportunity to further simplify the interaction between script, which, besides running the “update the rendering” task, uses layout to compute the data structure used to update what is presented to the user and send it to the compositorand webrender(with the compositor being probably the main source of rendering opportunities noted in parallel). There is currently a lot of back and forth and “noting of rendering opportunities” in many places, which seems like a good candidate for consolidation. This is still at the discussion stage.

And that’s it, thank you for reading.

--

--

Gregory Terzian

I write in .js, .py, .rs, .tla, and English. Always for people to read