Scientific Debugging, Part 2: Difficult Observations

Engineering Insights

Talin
Machine Words
Published in
5 min readJan 11, 2019

--

This is part two of a three-part series on scientific debugging, in other words, using the scientific method to debug software problems. Part one is here; part three is here.

In the first part we covered some fairly simple examples of how to locate problems through the process of hypothesis and experiment. Now we’ll look at some more complex examples, where the experimental results are not so easy to observe.

Intermittent Errors

What if an error only happens some of the time? What if the occurrences are both infrequent and random? In this case, we can’t depend on the output of a single experiment, because our observation might happen just at a time when the bug decides not to manifest. Any logical conclusions we drew from a single data point would be invalid, or at least highly questionable.

Instead we need to do what any good scientist does when confronted by noisy data: we need to convert an uncertain probability into a statistical certainty. This is accomplished by sampling: running the experiment many times and looking at the overall result.

In our previous scenario, we envisioned a bug in a web service where names got displayed erroneously in uppercase. Now, let’s imagine that the names appear in uppercase only very rarely; less than one time in a thousand on average. How do we track this down?

You are going to have to check thousands of output values, but it’s not practical to observe them manually; even if you had the patience, eventually your eyes are going to start to glaze over and you will become inattentive. You’ll likely miss the error when it finally shows up.

Instead, what you can do is write special code to detect the error when it happens, and to signal you with an alert. You can then run the experiment many times, and count the number of alerts that have been produced. Even if the bug is low probability, if you run the suspect code enough times then the chance that the bug will appear approaches certainty (assuming that it’s truly random and not merely dependent on input data or some other hidden condition).

For example, you could print out on the browser debug console whenever the JavaScript code sees a string that is all uppercase; you could then check those strings against the known values in the database to see if they are the same. (You have to be careful to avoid false positives, since it’s possible that there may be some names in the database that are intentionally uppercase.)

You can do this same experiment on the server side as well, by adding appropriate debugging code: a conditional test and print statement. Run the program enough times and you’ll have a good idea as to whether the uppercase conversion is happening before, or after, the point where you are sampling the value. By doing this repeatedly in different parts of the code, eventually you can narrow down exactly where (and when) the data is being modified.

Inscrutable Data

What if the data is of a form that is not easy to inspect? What if you can’t tell if the data is correct during a close inspection?

Let’s say you are writing a computer game, and you are in the process of coding a cool particle system effect that generates millions of animated particles on the screen. And let’s imagine you notice that a small number of particles are the wrong color; perhaps they are too bright or are insufficiently saturated.

Unfortunately, the RGB values for your particles are represented as floating-point numbers, and it’s difficult to eyeball a floating point RGB triplet and intuitively tell if the color saturation is high enough, especially when looking through thousands of such triplets. There’s no way you can simply inspect the data and tell if it’s right or wrong — it’s too complex.

This calls for another kind of experiment, where you perturb the data in system and see what happens. In the case of an animated particle system, there are typically multiple processing stages: each particle goes through a generate, animate, and render stage. What we can do is to add code to one of those stages — say, the render stage — which sets the color to a known value, such as neutral gray or bright red. Then run the program and see what the result looks like. If all of the particles are now a uniform color, then chances are that the error occurred in a processing step that was prior to your modification; if you still see small numbers of particles that vary in color and are less saturated, then there’s a strong argument that the bug is in a later processing stage.

In many software systems, there are ways that you can subtly tweak or perturb the data being processed in a way that doesn’t break the overall operation of the system, but does have a noticeable effect on the output. If you can come up with a data modification that interacts with the bug that you are trying to locate, you can use this to devise experiments that will help characterize and narrow the possible scope of the bug.

Experimenting with Version Control: Bisection

If a bug was introduced relatively recently, there is another kind of experiment that may be possible: revert the code back to a previous state and see if the bug still happens.

However, if your project is large and you have many contributors, you might have to check dozens, if not hundreds, of committed versions. You can speed up this process by doing a binary search over the version history, a technique commonly called bisection.

For example, you could start by going back 100 versions in the commit history. If the bug is still present, jump back further. If the bug is not, then try going back 50 versions, then 25, and so on. Each subsequent test reduces the possible range of versions where the bug was introduced, until finally you know the exact set of changes which must contain the bug. However, your job is not done; you still have to pinpoint which of those changed files and source lines contains the error, but hopefully your search will have been made far easier by drastically reducing the size of the problem.

Unfortunately, it may not always be possible to revert the code too far back in history, and the reason is because the rest of the world has moved on; the old code may no longer be runnable in the current environment.

This concludes part two. In part three of this series, I’ll talk about cases where simple experiments won’t work, and you’ll need to devise a more complex strategy.

--

--

Talin
Machine Words

I’m not a mad scientist. I’m a mad natural philosopher.