Using our tools to fix our own bugs

Mark Williamson

Follow

Published in

Time Travel Debugging

6 min readAug 19, 2022

--

How Undo uses time travel debugging as part of the CI (Continuous Integration) test and development loop.

This is the story of how we quickly found and investigated a flaky failure in our CI system using the power of time travel debugging. It’s based on what we actually did when the failure was identified and captured earlier this week (with a few minor simplifications).

This post includes screenshots of our work-in-progress LiveRecorder Observatory integration. We’re already using this to solve problems but a few unresolved text-formatting quirks are visible in this post.

Flaky failures strike (again)

Recently, we made some changes that affected our detach code (which is used to remove our instrumentation from a recorded process). The code behaved correctly when we tested and merged those changes… Job done, right?

Unfortunately, software development is rarely so straightforward. Shortly afterwards, we saw flaky test failures in CI tests of our detach code.

Our first line of defence against such issues is having a team who are alert and ready to flag these failures when they occur:

Screenshot of a Slack thread about the flaky test failure, including a link to a failed Jenkins job.

In other words, despite testing thoroughly before the merge, a flaky issue slipped through the net and was now disrupting our CI. This kind of failure is a drag on development and makes it harder to manage the quality of the product.

Investigating the logs

By the exchange above, Bert had already kicked off a soak test, repeatedly running the flaky test within CI until it failed again. He used our test system’s “self-recording” mode to capture the failure as it happened.

In other words, we are using LiveRecorder to record itself.

The link supplied above takes us to the failure output in Jenkins:

Scrolling down in Jenkins, we can see that our code panicked because a thread did not exit during a detach, even though we expected it to.

Various diagnostics are captured below, including a backtrace generated by libunwind. But we can do better than that. At the bottom of the failure report, we have a link to our internal deployment of LiveRecorder Observatory:

A screenshot of the URL generated to our internal Observatory deployment.

LiveRecorder Observatory

LiveRecorder gives us the power of time travel debugging in the web browser — our new integration with VS Code in the browser (built on top of OpenVSCode Server) is called LiveRecorder Observatory.

By clicking the link in our CI system we can jump right into Observatory, giving us a full-featured debug session with a deterministic recording of the failure, plus source code and symbols.

This link takes us to a list of artifacts captured during the test failure. Notice that the two bottom entries feature a “Launch UDB” button (UDB is the replay and debug component of LiveRecorder). In this instance, we want to launch thenanny-4140.undo file for debugging — the Nanny module contains the code that crashed.

Screenshot of the artifacts captured from this test for use in Observatory. — A list of artifacts associated with the test failure — including links to debug LiveRecorder recordings.

We jump to the LiveRecorder Observatory loading screen, then to a VS Code environment within the browser. This loads any available text-based logs as well as the LiveRecorder recording of the failure.

Once the recording is loaded, we have a VS Code debug environment in our browser, positioned at the end of recorded history (note the call stack in the bottom left — the recorded process is deep inside our panic-handler, as we’d expect given the failure above):

Screenshot of the Observatory debug interface (based on VS Code) at the site of the original crash. — Observatory’s debug interface is based on VS Code and the UDB debugger, served straight to the browser.

The first thing we’ll do is to Reverse Step Out until we’ve got to the source of the failure.

Close up of the Reverse Step Out button.

This operation actually un-calls functions by traveling back in time to their call sites, unwinding changes to global state that have happened since the original error.

We originally saw a panic during s_wait_termination() but, for this write-up, we can safely skip the internals of this function. We found our first real clue to the bug at the functions call-site:

Screenshot of the Observatory debug interface, stopped at the call site of s_wait_termination() — Stopped at the call site for s_wait_termination() — here we found our first clue to this bug.

Let’s inspect some variables at this point to understand what’s going on.

As an Emacs user, I’m a fan of hybrid interfaces that give me access to a command line. I’m going to print the values I’m interested in using a debugger terminal we’ve added to the VS Code interface:

Printing the ID of the thread we’re handling gives us 4121.

Here we can see the TID of the rchild (short for “record child”) thread we’ve just detached from. Let’s double-check what the PID of the recorded child actually was:

Printing the known PID of the process we recorded also gives us 4121. — As mentioned above, some temporary display glitches are visible in our internal deployment of Observatory.

We were attempting to detach the main thread (also known as the thread group leader). In our system, this must be done after we’ve first detached the other threads in the process.

The rtid_states_count() function queries how many threads are known to our engine. We can call that function from the debugger to evaluate it at the current point in history. This will tell us how many threads were known at this time:

Printing a count of currently-attached threads gives us 3.

Uh-oh — this is a bug!

We’ve tried to detach the main thread too early, while other threads were still attached. Now we just need to know why that’s happened.

At this point, we can use Reverse Step Over to walk back in history, line-by-line:

This takes us to the code that was meant to skip over the main thread until it was the only one left. The exact details of the surrounding code aren’t important but the intention is that the if statement’s body is only executed if we’re looking at the main thread and no other threads are running.

Screenshot of the code that was supposed to identify and skip over the main thread. Shows a comparison between the TID of the thread under consideration with the PID we retrieved for the recorded child.

The problem, as we’ll see, is with the equality test to the left of the &&. Here we’re identifying the main thread by comparing its TID to the PID of the process we recorded —which sounds pretty reasonable. Lets see what the values are here:

Screenshot of the printed TID of the thread under consideration — it is 4121.

OK, this is as we expected. How about the value of record_child()->pid.raw?

We can’t call record_child() directly as it’s a macro in this build:

#define record_child() \
    record_child_fn(__FILE__, __LINE__, __FUNCTION__)

We can still call the underlying function, we just need to give dummy values for the arguments that would usually be expanded from __FILE__, __LINE__ and __FUNCTION__. Lets do that:

Screenshot of printing the PID of the recorded child process, as evaluated in the source code. It has the value 0.

Here’s our problem. Some part of the code, presumably related to the detaching process, has set this PID value to zero. We can find out how that happened using the last command:

Screenshot of the UDB “last” command being invoked for the unexpected 0 value.

Screenshot of the “last” command’s result, showing the source line that explicitly clears the PID value we were looking for.

This line — an earlier part of our detach routine — had actually zeroed a value that we’d used later to identify the main thread.

Ultimately, this led to us detaching the threads in the wrong order. With this identified, the fix was a simple one-liner. While Greg applied the fix, I wrote up our findings in GitHub.

I added links directly into the recording’s history within LiveRecorder Observatory for future reference. Clicking on these jumps straight to the point in history that I’m describing:

A screenshot showing my write-up of the bug in our internal GitHub, including URLs that link into LiveRecorder Observatory. — Linking directly into a recording’s history using LiveRecorder Observatory makes it easy to discuss and collaborate on bugs from within your bug tracker.

A happy ending

Within a few hours of detecting a flaky failure (and with only minutes of manual debugging / coding expended) we’d merged a fix to the problem:

A screenshot of the fix being merged in our GitHub instance.

And the code base is safe once more … until the next installment …

Using our tools to fix our own bugs

Flaky failures strike (again)

Investigating the logs

LiveRecorder Observatory

A happy ending

Written by Mark Williamson