Mutation Testing with Python

Test the tests — automatically, by applying common mistakes

Martin Thoma
Aug 10 · 7 min read

We need to kill the mutants — no, I’m not a villain from the X-Men comics. I’m a software engineer who wants to improve unit tests.

In this article you will learn what mutation testing is and how it can help you to write better tests. The examples are for Python, but the concepts hold in general and in the end I have a list of tools in other languages.

Why do we need mutation testing?

Unit tests have the issue that it’s unclear when your tests are good enough. Do you cover the important edge cases? How do you test the quality of your unit tests?

Typical mistakes are slight confusions. Accessing list[i] instead of list[i-1] , letting the loop run for i < n instead of i <= n , initializing a variable with None instead of the empty string. There are a lot of those slight changes which are usually just called “typos” or “off-by-one” mistakes. When I make them, I often didn’t think about the part thoroughly enough.

Mutation testing tests your unit tests. The key idea is to apply those minor changes and run the unit tests that could fail. If a unit test fails, the mutant was killed. Which is what we want. It shows that this kind of off-by-one mistake cannot happen with our test suite. Of course, we assume that the unit tests themselves are correct or at worst incomplete. Hence you can see a mutation test as an alternative to test coverage. In contrast to test coverage, the mutation testing toolkit can directly show you places and types of mistakes you would not cover right now.

Which mutation testing tools are there?

There are a couple of tools like cosmic-ray, but Anders Hovmöller did a pretty amazing job by creating mutmut. As of August 2020, mutmut is the best library for Python to do mutation testing.

To run the examples in this article, you have to install mutmut:

pip install mutmut

In other languages, you might want to try these:

Why isn’t branch and line coverage enough?

It is pretty easy to get to a high line coverage by creating bad tests. For example, take this code:

def fibonacci(n: int) -> int:
"""Get the n-th Fibonacci number, starting with 0 and 1."""
a, b = 0, 1
for _ in range(n):
a, b = b, a + b
return b # BUG! should be a!
def test_fibonacci():

This smoke test already adds some value as it makes sure that things are not crashing for a single input. However, it would not find any logic bug. There is an assert statement missing. This pattern can quickly drive up the line coverage up to 100%, but you are then still lacking good tests.

A mutation test cannot be fooled as easily. It would mutate the code and, for example, initialize b with 0 instead of 1:

- a, b = 0, 1
+ a, b = 0, 0

The test would still succeed and thus the mutant would survive. Which means the mutation testing framework would complain that this line was not properly tested. In other words:

Mutation testing provides another way to get a more rigid line coverage. It can still not guarantee that a tested line is correct, but it can show you potential bugs that your current test suite would not detect.

Create the mutants!

As always, I use my small mpu library as an example. At the moment, it has a 99% branch and 99% line coverage.

$ mutmut run- Mutation testing starting -These are the steps:
1. A full test suite run will be made to make sure we
can run the tests successfully and we know how long
it takes (to detect infinite loops for example)
2. Mutants will be generated and checked
Results are stored in .mutmut-cache.
Print found mutants with `mutmut results`.
Legend for output:
🎉 Killed mutants. The goal is for everything to end up in this bucket.
⏰ Timeout. Test suite took 10 times as long as the baseline so were killed.
🤔 Suspicious. Tests took a long time, but not long enough to be fatal.
🙁 Survived. This means your tests needs to be expanded.
🔇 Skipped. Skipped.
1. Running tests without mutations
⠧ Running...Done
2. Checking mutants
⠸ 1818/1818 🎉 1303 ⏰ 1 🤔 6 🙁 508 🔇 0

This takes over 1.5 hours for mpu. mpu is a small project, with only about 2000 lines of code:

Language     files          blank        comment        code
Python 22 681 1399 2046

One pytest run of the mpu example project takes roughly 9 seconds and the slowest 3 tests are:

1.03s call     tests/
0.80s call tests/
0.41s call tests/

In the end, you will see how many mutants were successfully killed (🎉), how many received a timeout (⏰) and which ones survived (😕). Especially the timeout ones are annoying as they make the mutmut runs slower, but the code and the tests might still be fine.

Which mutations are applied?

mutmut 2.0 creates the following mutants (source):

  • Operator mutations: About 30 different patterns like replacing + by - , * by ** and similar, but also > by >= .
  • Keyword mutations: Replacing True by False , in by not in and similar.
  • Number mutations: You can write things like 0b100 which is the same as 4, 0o100, which is 64, 0x100 which is 256, .12 which is 0.12 and similar. The number mutations try to capture mistakes in this area. mutmut simply adds 1 to the number.
  • Name mutations: The name mutations capture copy vs deepcopy and "" vs None .
  • Argument mutations: Replaces keyword arguments one by one from dict(a=b) to dict(aXXX=b).
  • or_test and and_test: andor
  • String mutation: Adding XX to the string.

Those can be grouped into three very different kinds of mutations: value mutations (string mutation, number mutation), decision mutations (switch if-else blocks, e.g. the or_test / and_test and the keyword mutations) and statement mutations (removing or changing a line of code).

The value mutations are most often false-positive for me. I’m not certain if I could write my code or my tests in another way to fix this. I’ve briefly discussed it with the library author, but apparently he does not have the same issue. If you’re interested in that discussion, see issue #175.

How can I get a HTML report with mutmut?

$ mutmut html

gives you

Image for post
Index page of the mutmut HTML report. Image by Martin Thoma.
Image for post
The complete report. Image by Martin Thoma.

As you can see, the index claims that 108 mutants survived and the HTML report only shows one. That one is also a false-positive as a change in the logging message does not cause any issue.

Alternatively, you can use the junit XML to generate a report:

$ pip install junit2html
$ mutmut junitxml > mutmut-results.xml
$ junit2html mutmut-results.xml mutmut-report.html

The report shows this index page:

Image for post
Test report generated from JUnit XML. Image by Martin Thoma

Clicking on one mutant, you gets this:

Image for post
Mutant #3 was killed, but mutant #4 survived. I did not use the global variable “countries” anywhere in the tests. Image by Martin Thoma.

The issue with this generated HTML report is that it shows many results for a single line of code and no grouping. If the failures were grouped by file and if one could see the code in which lines with surviving mutants would be highlighted, it would be way more useful.

Mutation Testing for Machine Learning Systems

I’ve searched for cool applications of machine learning to generate mutants in code, but I’ve only found “Machine Learning Approach in Mutation Testing” from 2012 (12 citations).

I was hoping to find data-based code mutant generation techniques. For example, one could search for git commits which are bug fixes by examining the commit message. If the fix is rather short, this is a kind of mutation one could test for. Instead of generating all possible mutants, one could sample from the mutants in a way to first take the most promising ones; the ones that are most likely not perceived as a false-positive.

Other work was more focused on making machine learning systems more robust (DeepMutation, DeepGauge, an Evaluation). I don’t know this stream of work well enough to write about it. But it sounds similar to techniques I know:

  • To overcome scarcity in training data, various data augmentation techniques such as rotations, flips, or color adjustments are applied. You can actually see those as mutations.
  • Also, in the GAN setting where you have a generator and a discriminator, you could argue that the generator produces mutants and the discriminator should tell them apart.
  • In order to force the network to learn more robust features, a technique called dropout (Tensorflow, Lasagne)is commonly used. You could say that a part of the input or the internal representation is randomly mutated by setting it to zero

Want to know more about unit testing?

In this series, we already had:

In future articles, I will present:

  • Static Code Analysis: Linters, Type Checking, and Code Complexity

Let me know if you’re interested in other topics around testing with Python.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Martin Thoma

Written by

I’m a Software Engineer with focus on Data Science, Machine Learning. I have over 10 years of experience with Python.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Martin Thoma

Written by

I’m a Software Engineer with focus on Data Science, Machine Learning. I have over 10 years of experience with Python.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store