Using our tools to fix our own bugs
How Undo uses time travel debugging as part of the CI (Continuous Integration) test and development loop.
This is the story of how we quickly found and investigated a flaky failure in our CI system using the power of time travel debugging. It’s based on what we actually did when the failure was identified and captured earlier this week (with a few minor simplifications).
This post includes screenshots of our work-in-progress LiveRecorder Observatory integration. We’re already using this to solve problems but a few unresolved text-formatting quirks are visible in this post.
Flaky failures strike (again)
Recently, we made some changes that affected our detach code (which is used to remove our instrumentation from a recorded process). The code behaved correctly when we tested and merged those changes… Job done, right?
Unfortunately, software development is rarely so straightforward. Shortly afterwards, we saw flaky test failures in CI tests of our detach code.
Our first line of defence against such issues is having a team who are alert and ready to flag these failures when they occur:
In other words, despite testing thoroughly before the merge, a flaky issue slipped through the net and was now disrupting our CI. This kind of failure is a drag on development and makes it harder to manage the quality of the product.
Investigating the logs
By the exchange above, Bert had already kicked off a soak test, repeatedly running the flaky test within CI until it failed again. He used our test system’s “self-recording” mode to capture the failure as it happened.
In other words, we are using LiveRecorder to record itself.
The link supplied above takes us to the failure output in Jenkins:
Various diagnostics are captured below, including a backtrace generated by libunwind. But we can do better than that. At the bottom of the failure report, we have a link to our internal deployment of LiveRecorder Observatory:
LiveRecorder Observatory
LiveRecorder gives us the power of time travel debugging in the web browser — our new integration with VS Code in the browser (built on top of OpenVSCode Server) is called LiveRecorder Observatory.
By clicking the link in our CI system we can jump right into Observatory, giving us a full-featured debug session with a deterministic recording of the failure, plus source code and symbols.
This link takes us to a list of artifacts captured during the test failure. Notice that the two bottom entries feature a “Launch UDB” button (UDB is the replay and debug component of LiveRecorder). In this instance, we want to launch thenanny-4140.undo
file for debugging — the Nanny module contains the code that crashed.
Once the recording is loaded, we have a VS Code debug environment in our browser, positioned at the end of recorded history (note the call stack in the bottom left — the recorded process is deep inside our panic-handler, as we’d expect given the failure above):
The first thing we’ll do is to Reverse Step Out until we’ve got to the source of the failure.
This operation actually un-calls functions by traveling back in time to their call sites, unwinding changes to global state that have happened since the original error.
We originally saw a panic during s_wait_termination()
but, for this write-up, we can safely skip the internals of this function. We found our first real clue to the bug at the functions call-site:
Let’s inspect some variables at this point to understand what’s going on.
As an Emacs user, I’m a fan of hybrid interfaces that give me access to a command line. I’m going to print the values I’m interested in using a debugger terminal we’ve added to the VS Code interface:
Here we can see the TID of the rchild (short for “record child”) thread we’ve just detached from. Let’s double-check what the PID of the recorded child actually was:
We were attempting to detach the main thread (also known as the thread group leader). In our system, this must be done after we’ve first detached the other threads in the process.
The rtid_states_count()
function queries how many threads are known to our engine. We can call that function from the debugger to evaluate it at the current point in history. This will tell us how many threads were known at this time:
Uh-oh — this is a bug!
We’ve tried to detach the main thread too early, while other threads were still attached. Now we just need to know why that’s happened.
At this point, we can use Reverse Step Over to walk back in history, line-by-line:
This takes us to the code that was meant to skip over the main thread until it was the only one left. The exact details of the surrounding code aren’t important but the intention is that the if
statement’s body is only executed if we’re looking at the main thread and no other threads are running.
The problem, as we’ll see, is with the equality test to the left of the &&
. Here we’re identifying the main thread by comparing its TID to the PID of the process we recorded —which sounds pretty reasonable. Lets see what the values are here:
OK, this is as we expected. How about the value of record_child()->pid.raw
?
We can’t call record_child()
directly as it’s a macro in this build:
#define record_child() \
record_child_fn(__FILE__, __LINE__, __FUNCTION__)
We can still call the underlying function, we just need to give dummy values for the arguments that would usually be expanded from __FILE__
, __LINE__
and __FUNCTION__
. Lets do that:
Here’s our problem. Some part of the code, presumably related to the detaching process, has set this PID value to zero. We can find out how that happened using the last
command:
This line — an earlier part of our detach routine — had actually zeroed a value that we’d used later to identify the main thread.
Ultimately, this led to us detaching the threads in the wrong order. With this identified, the fix was a simple one-liner. While Greg applied the fix, I wrote up our findings in GitHub.
I added links directly into the recording’s history within LiveRecorder Observatory for future reference. Clicking on these jumps straight to the point in history that I’m describing:
A happy ending
Within a few hours of detecting a flaky failure (and with only minutes of manual debugging / coding expended) we’d merged a fix to the problem:
And the code base is safe once more … until the next installment …