Debugging: From Panic to Logic

Published in

GAMMA — Part of BCG X

12 min readAug 1, 2019

Debugging is an unpleasant word, but it’s a normal, unavoidable part of coding. Perhaps you’ve felt a bit helpless when facing bugs, even though you followed all the best coding practices (e.g. starting with those listed in Appendix I). But, hey, it happens — even to the best of us.

In this post, we propose a simple framework to structure your strategy for fixing persistent bugs. It starts with some fundamental questions to ask yourself, then follows with a five-step approach to identify the sources of bugs and their underlying root causes, and finally, offers suggestions to resolve the issues for good. This framework has been inspired by years of experience fixing machines on a glass factory shop floor. Despite being developed in a completely different setting, it has proven to be a surprisingly effective way to deal with the panic associated with last-minute project bugs.

Our inspiration: Field-proven expertise

When I [David Galley] worked on the shop floor as a young engineer, I was impressed by the wisdom of the plant workers. To deal with machine breakdowns, they had developed an incredibly simple, yet effective approach based on common sense and experience.

Picture this scenario: The employees are working on the very hot shop floor of a glass manufacturing plant when one of the high-speed glass-forming machines breaks down.

Watch a video of a glass bottle-making machine in action

The first problem they face is that they cannot stop the flow of molten glass. They can redirect the flow to avoid clogging the machine or harming other workers, but they’ll lose a lot of good glass during the time spent fixing the machine.

Since the machine was working perfectly well before, the glass workers first asked themselves what happened or, more specifically, what changed? The root cause of that change is exactly what they need to identify to find the source of the breakdown — and fix it for good. Here are the process steps they always followed to get the machine working again:

Step 1: If the command or movement was not triggered correctly, it was an electrical issue.
Step 2: If there is no energy to sustain the requested command/movement, it was a pneumatic issue (the machine use air pressure to enable cyclical movements).
Step 3: If the actual movement the machine was making was not within its parameters, or if it had changed for any reasons, then this was a mechanical issue (for example a part just broke and needed to be changed).

By systematically guiding the search where it is relevant, this simple framework (electrical/pneumatic/mechanical) enabled the workers to respond calmly and effectively during an otherwise stressful situation. We believe we can apply a similar approach to debugging.

Our approach: Bug search is about being systematic and structured

You may think debugging is mainly a matter of intuition, experience, and knowledge of the code base. While that may be true for simple bugs, it can be difficult to tackle the most complex ones this way.

Typically, a data scientist or engineer begins the process by digging into the code. After a gruesome couple of hours, a lot of coffee, and some swearing, he or she manages to fix the issue. It may work, but potentially lead to messy patching and a high technical debt (hindering further developments).

A more systemic approach, one we call the “scientific method of debugging,” has helped our projects in many ways, because it:

Enables us to more efficiently divide and conquer
Helps us team better because we now share the same debugging mindset
Makes it easier for us to identify weak spots in our model and find ways to make our debugging more robust
Helps us develop a thorough quality assessment process for future developments

To begin our work, we ask ourselves three fundamental questions:

What has changed since the last known working state? We propose an approach inspired by the glass workers, except that where they tracked the process steps to allow for proper movement, we will track the process steps of our overall code base — from ingestion to outputs.

How did the identified changes lead to a bug? Once we’ve decided where to look first, we dive deep into the area of interest. We’ll get our hands dirty — but do it while being systematic and structured. Several methods are available in the literature (See Appendix II). All have in common what we call a “reductionist approach”:

Reproduce the bug in a controlled environment
Formulate hypotheses on the potential root causes
Change one thing at a time to prove or discard hypotheses
Repeat the steps until the scope of potential root causes is narrowed down to one

How can we keep this issue from happening again? Now that we have identified the offending root cause, it’s time to eradicate it. Our goal is not just to correct the bug, but to prevent it, or a similar one, from popping up again.

5 common sources of error that lead to a bugged model

The following are some of the source of errors we’ve experienced first-hand, structured along the process steps of our code base:

1. Data: Is this a data problem? How has the data changed since the last working state?

2. Context: Is this a context problem? Did we use equivalent or identical configuration versus the last working state?

3. Algorithm: Is it an algorithm problem? How has the algorithm and model/data transformation logic changed since the last working state?

4. Consumption: Is it a consumption problem? Has the way that the model outputs are processed into consumption products changed?

5. Infrastructure: Or is it an infrastructure problem? Has the infrastructure significantly changed or reached its limits?

Source 1: Data

Is the raw data the same? Typically, this covers databases updates, with new data added and old data removed or updated. Although changes may go unnoticed at the aggregated level, underlying granular inputs may vary. Here you may also run into unexpected edge cases or data quality issues. For example, when reviewing data on stock in a factory, you’d implicitly expect the stock data to be always positive or null. A stock of bolts with a value -10 can make the model crash.

Underlying root cause: This is a data-quality issue.
Potential solution: To avoid this type of bug coming back to haunt our projects, we implement a daily data-quality monitoring script, which scans the refreshed databases for suspicious items.

Did something change with our fetch methods? Typically, this covers queries to databases and ETLs. This is about code, and versioning tools are the best way to identify potential culprits. Perhaps a refactoring was done without a specific edge cases in mind.

Underlying root cause: Acceptance tests were not run as part of the code review, so the code change was merged without being fully tested on this specific configuration.
Potential solution: For our projects, we’ve implemented a large set of end-to-end integration tests to cover multiple configurations. Each code change has to run on all of these tests and provide acceptable results.

Is the data loaded by the model the same? This is slightly different from the first data issue. If the raw data is identical (or identically acceptable given your model’s specification) and if the fetch methods are the same, the problem may be that your model is not using the correct data inputs. Typically, there could be errors with I/O methods or cache issues, in which old cached data is silently overriding new queries.

Underlying root cause: I/O reliability and transparency issue.
Potential solution: Implement a switch for cache functionalities.

Is it a new scope? Perhaps you’re now running the model on a new geography, site, or setup, or with new data. Perhaps the model fails because you’ve run into an edge case — something that wasn’t properly planned for. Perhaps that new scope was meant to be very similar to the working one, when in reality the signal in the data is different or the feature engineering needs adjustments.

Underlying root cause: The problem may be the underlying assumptions of the model.
Potential solution: Clarify the requirements for the current toolchain to work: What do the requirement implicitly assume about the data? What is required to extend this?

Source 2: Context and Configuration

Are the model parameters the same? “Context and configuration” issue refers to all the key parameters the model needs to be able to run. Typically, these can be hyperparameters for a machine-learning model, or the index for a particular country. Often found in the form of a yaml file, context and configuration parameters can be hard to manage and track. These files are prone to error because they rely on manual user inputs.

Underlying root cause: The problem may come from lack of configuration traceability and model user discipline.
Potential solution: On our project, we tackled this by setting up specific templates for users to use and limit the risk of errors, and included as much as we could into our versioning tool.

Source 3: Algorithm

Is the model behaving as expected? This is typically where we look first. The algorithm is often subdivided in different tasks, and each of them should be checked, starting with the spot where the problem seems to appear, and then working backwards.

Underlying root cause: If you run into this bug, the underlying root cause is probably a logic flaw in the code; If you use acceptance tests, they are probably not exhaustive enough to catch the break in logic.
Potential solution: A common pitfall with this issue is to just fix the bug, rather than moving to a test-driven development mindset and designing unit tests to ensure this or something very similar does not occur again. Remember: The most effective, longest-lasting solution is to eradicate the root cause! In our team, we implemented end-to-end tests and requested each owner of a pull request to make sure their code passed the tests before pushing it. The last thing you want to do is add code that breaks the work of others. Test and, if necessary, fix your code before merging it.

Are the data contracts between different parts of the model still respected? As mentioned above, the algorithm will generally be structured along several blocks or consecutive tasks. Although each of them may still perform correctly, the interdependencies between them can be broken.

Underlying root cause: This is an integration issue.
Potential solution: Improve the communication between different tasks: what they expect as inputs or outputs. Tests can help, as well as a less-siloed approach to different part of the model. Put simply, communication is the keyword, but improving it can be harder than it seems.

Do you run into the limitations of your underlying libraries? Odds are that while creating your model you chose not to reinvent the wheel. Instead, you’re probably leveraging an existing programming language and several great libraries… and perhaps a couple of more exotic, if slightly less-maintained, ones. Your model may have run into one of the numerous open issues on these libraries. The root cause here is usually well-documented within the online communities.

Underlying root cause: This is most likely a public library issue, but you should also beware of limitations of data types and representations, typically for date/time types, as well as floats that create known issues. These issues once led to a rocket crash and several casualties. On a less-dramatic level, they can cause changing results or unexpected crashes.
Potential solutions: Find a way to get around the issue. In our case, we rewrote part of the code using another library that was less readable but more stable.

Source 4: Consumption

Did the post-processing of results change? Usually, when we want to assess the results and outcome of a model, we cannot review each line of the result database or log. Instead, we focus on KPIs, plots, samples, and aggregated views. These are often developed by business users and various team members. In our case, we would sometimes spend hours trying to figure out why some KPIs changed, when actually, it was a consumption-view change. Since these are typically outside of versioning systems, changes are harder to monitor.

Underlying root cause: The problem stems from the absence of change monitoring and versioning of non-code parts of the model, as well as lack of end-user discipline and communication.
Potential solution: Align with end users on product specifications and versioning. Set up end-to-end tests, including these final products.

Did integration of results change? In the same way you think about input data issues, keep in mind the outgoing pipeline. Usually consumption products query model outputs, and perhaps some other data from various sources. Perhaps the bug appears because of changes in this query process, or because cache issues lead to outdated model results being displayed. It could also be that the outgoing pipeline has not been properly adapted to reflect model changes or updates.

Underlying root cause: The problem lies in the absence of monitoring and versioning of external dependencies.
Potential solution: If you run into this bug, you may want to check your outgoing infrastructure to make sure it queries results from the right places and expects the right format.

Source 5: Infrastructure

Is there anything new about the environment that could explain the difference? The obvious culprit here may be environment variables that may lead, for instance, to faulty read/write operations. You can identify this issue by the presence of supporting systems or software crashes that lead to connection errors, memory errors, blue screens — and a trail of sweet, lovely, error messages.

Underlying root cause: The underlying systems may lack robustness or are not suited to the purpose.
Potential solution: Change or update systems; the experienced engineer will know how to patch things or increase resilience to unpredictable behaviors. In the long run, it is useful to take a step back and consider the broader environment.

Could it be hardware? It sounds sneaky and unlikely, but hardware is what makes everything work together — and can make everything fall apart. One of the most famous examples was in 2000, when J. Dean and S. Ghemawat identified the source of Google engine’s inability to return timely hits on search engine queries. Short of such extreme examples, more mundane issues like multithreading or parallelization can have unintended side effects.

Underlying root cause: The troublemaker may be an offending infrastructure, not adapted to your problem.
Potential issues: These may be quick workarounds, but to prevent this type of (often persistent) bug to re-occur, think about your overall set up, your memory usage, and your I/O efficiency — among other infrastructure issues.

Implement a generalizable — if not necessarily universal — framework

Of course, unique situations require unique responses. But regardless, there are two general processes that we have found to be applicable in most cases:

Scan the different process steps where the offending issue could come from (data, context, algorithm, consumption, infra). Conduct a structured, hypothesis-driven and systematic search that progressively narrows down the sources of offending causes.
Focus on eradicating the root causes, instead of simply patching the symptoms. Hastily patched bugs invariably strike again and again. Don’t be tempted by quick-fixes… For a long-lasting solution, you need to get to the bottom of things!

We suggest that our debugging solution is very similar to the Continuous Improvement Process or Kaizen. This makes sense, since manufacturing industrial goods with high quality standard and delivering production-grade code are comparable processes (see Appendix III).

You May Know It, But Do You Practice It?

Everything we’ve laid out in the recommended framework might seem obvious. The real challenge is to act on it. As we said at the beginning of this article, many coders think of debugging as a process guided by experience and intuition. Those are important elements when debugging, but they are not the only approach.

The next time a bug rears its ugly head, let’s not get carried away, confused or overwhelmed by the infinite possibilities that may have caused it. Instead, let’s do our teams a favor by taking a deep breath, and then implementing a rational and scientific approach, such as the one we have described above.

Acknowledgements

We would like to thank all our team members for their reviews to improve this post, with a special thank to Cloves Almeida, Niels Freier, Bertrand Bordage, and Amine Bouamama

Appendix I:

The literature is prolific on best practices for preventing bugs and mitigating their impact (but, not exhaustive):

Adopt proven coding design patterns to set up and structure the code base. E.g., https://www.tutorialspoint.com/python_design_patterns.
Set up a good architecture to ease debugging
Set up your “error handling” or “exception handling” properly.
Ensure transparency with clear names and documentation.
Implement unit and integration tests.
Leverage version control tools.

Appendix II:

For more information on testing and debugging, see:

The Art of Software Testing, by T. M. Thomas, T. Badgett, C. Sandler, G. J. Myers
Debugging by D.J. Agans

Appendix III: