TDD and legacy code: creating a snapshot with approval tests

Zeger Hendrikse
NS-Techblog
Published in
7 min readJan 29, 2024
Images in this post come from Pixabay.

A question that I am often asked is: how do I start with test-driven development (TDD) on my legacy code base? This immediately raises another question: when do we categorize code as legacy code? Isn’t all existing code legacy code by definition?

To me, legacy code is simply code without tests — Michael Feathers

But you may argue: sometimes I see code without (sufficient) tests, but the code reveals its intention so well, that I am not afraid to modify it. So perhaps an even better definition may be:

Legacy Code is valuable code you’re afraid to changeNicolas Carlo

In any such code base, any change may inadvertently (and easily) change existing functionality. So how do I apply my change/fix/modification in such cases? There are three possible approaches:

  1. The naïve approach: edit and pray.
    Just edit the code, test manually, and hope for the best. This is both risky and stressful.
  2. The ideal approach: write the tests first.
    Reverse engineer the specifications from the code, write the automated tests, refactor the code, and make your modification(s). This is an expensive and time-consuming process.
  3. The pragmatic approach: Approval tests.
    Generate an output that you can snapshot, use test coverage to find all input combinations, use mutations to verify your snapshots, and make your change(s).

In this article, we take a look at the latter approach. It is one of the techniques explained in Michael Feather’s book “Working Effectively with Legacy Code”.

Approval Testing

When using approval testing, new test results are first (manually) approved, before being added to the approved set of test results.

Approval tests are often also referred to as characterization tests, golden master tests, snapshot tests, locking tests, and sometimes even regression tests.

Let’s first see what approval tests are and how they work. Consider a simple calculator with just one method that adds up two integers:

We can test this function using the approval test library. This library is available for almost all languages and its usage is largely self-explanatory:

Note the generic AAA structure of a unit test still applies: arrange, act, and assert.

When running the test (using pytest), the test fails and two files are created: CalculatorTest.test_add_simple.received.txt and CalculatorTest.test_add_simple.approved.txt. The idea is to approve the result(s) in the received version, by copying the contents of the received version over to the approved version. When you run an approval test in most IDEs, a diff-window will often pop up automatically comparing these two files!

After approving and re-running the test once more, the test passes. So far, so good, as nothing substantially different happens compared to running a regular unit test.

However, approval tests shine once you realize that you can specify sets of different input parameters and that the approval test library will automatically generate all permutations:

Indeed, in the CalculatorTest.test_add_combinatoric.received.txt we see that all combinations are tested:
args: (1,4) => 5
args: (1,3) => 4
args: (2,4) => 6
args: (2,3) => 5

This makes approval testing a very powerful tool for creating snapshots of existing (legacy) code bases because we can create a suite of snapshots that comprehensively covers all functionality with relative ease.

Let’s now see how we can apply this concept of approval testing to a legacy code base by looking at two coding katas in somewhat more detail: the Gilded Rose and the Bugs Zero katas.

The Gilded Rose

Aged brie is one of the items sold by the Gilded Rose Store.

The Gilded Rose consists of an implementation of a shop where goods degrade in quality as they approach their sell-by date. It is part of the five coding exercises to practice refactoring legacy code. The provided (legacy) code is a real horror: one big incomprehensible bowl of spaghetti conditionals. Worse, there are no tests!

The ultimate goal of this kata is to add a new category of items, namely “conjured” items which degrade in quality twice as fast as normal items. Before touching any code, we want to take a snapshot, to avoid unintentionally changing any existing behavior.

Creation of a golden master

Approval tests may be used to create a snapshot of legacy code.

We are looking for a way to capture the output so that we can create a snapshot. From the code, we can see that the update_quality() method in the Gilded Rose class updates all the items used to initialize the Gilded Rose class.

What we can do is create a helper method in our test class do_update_quality(name, sell_in, quality) that creates a gilded rose instance with just one item, updates it, and returns the updated item. We can then invoke this helper method by using the verify_all_combinations() function offered by the approval testing library. It takes as arguments the method to be invoked, in this case, our helper method, and the parameters required for that helper method.

The approval test framework uses the string representation of an Item to generate the output that is written to the snapshot files.

After running pytest (which fails because we haven’t approved the result yet), we obtain the GildedRoseTest.test_update_combinatoric.received.txt and GildedRoseTest.test_update_combinatoric.approved.txt files. After approval, the test passes.

Next, we use test coverage to come up with more item values in lines three to five that force any uncovered lines to be touched as well. We can generate a coverage report using the pytest-cov plug-in, or even better, your favorite IDE.

After a run with just one item type, we have obtained a coverage of 41%. The coverage report shows which parts are fully covered (green), which parts are only partially covered (yellow), and which parts aren’t covered at all (red). This report was generated using pytest-cov.

The goal is to obtain a coverage of 100% by adding new values to our set of input parameters. Obvious new values are item names that are found in the if-statements as well as additional item sell_in and quality parameter values.

Once we’ve achieved a code coverage of 100%, we’re still not done, because we can’t be sure of the edge cases. We can now apply mutation testing: we temporarily change either one of the numerical values in the production code and check if the approval test fails. If it doesn’t, this (numerical) edge case is not yet covered, so we need to extend our set of input parameters further.

Since the scope of this post is limited to approval testing to make snapshots of legacy code, we won’t work out this kata any further. We refer those who are interested to the references section that can be found at the end of this post.

By the way, the subsequent refactoring phase contains quite some interesting refactoring techniques such as the lift-up conditional and replace conditional with polymorphism, so the refactoring phase is definitely worth investing your time into as well!

The Bugs Zero Kata

The Bugs Zero kata consists of a buggy implementation of a trivia game and is designed to practice refactoring code to ward off bugs. It is also mentioned in the five coding exercises to practice refactoring legacy code. Before we fix any bugs, we want to make sure that we don’t accidentally introduce new bugs while fixing existing ones!

As the existing code only outputs to the console and uses a random number generator to simulate the roll of a die, creating a golden master is a little more interesting than in our previous case of the gilded rose.

Creation of a golden master

The general approach is to create a game runner class first. We use the seed() function to force a series of random numbers that are identical for each run.

The game runner class runs a trivia game, which results in output printed on the console. As we have fixed the seed of the random number generated, the console output is identical for each run.

Luckily, Python offers some easy ways to capture the output from the console, one of which has been used in the code snippet below:

The console output is captured and used as input for the approval tests.

After approval of our first test run, we already obtain a very solid coverage:

Our code coverage report after just one run.

To discuss the steps how to reach 100% is beyond the scope of this article, but we refer the interested reader again to the references.

Conclusion

Approval tests can be very useful in several cases, one of which is the creation of a golden master. In addition, approval testing can be useful for

  • APIs that return JSON or XML
  • Functions that return complex objects
  • Assertions of strings longer than one line

Note that in our case where we are using approval tests to create snapshots, we should eventually replace these approval tests with meaningful unit tests, since:

  • Existing behavior is captured, including bugs
  • It is hard to find out why an approval test has failed (long feedback loop!)
  • People will get used to mindlessly approving the received output without a thorough investigation of whether the change is as expected or not
  • Approval tests don’t explain what the code does.

References and credits

  • Understand legacy code blog by Nicolas Carlo.
  • The Gilded Rose README and references therein. Both the Python and Javascript versions contain instructions on how to obtain 100% coverage, including mutation testing.
  • The trivia game README and references therein.

--

--