Who watches the watchers? Mutation Testing
Have you ever thought about the quality of your tests? In addition to code coverage and the number of test suites, how can we measure if our tests are good enough? How many changes do I have to apply to the production code before detecting a bug using our automated tests? I you find these questions interesting, keep reading because you’ll find this blog post quite interesting.
Why should I care about Mutation Testing?
Whenever we develop a program, we, as engineers, should always guarantee this software is working as expected and any future change we implement when adding some new features or updating any behaviour doesn’t break it. To be able to do this, we frequently use automated testing. Automated tests are our safety net and a guarantee of the correct behaviour of our program. Obviously, all of them run on CI and use many different testing strategies! Let’s take a look at the list of automated testing strategies we use at GoodNotes:
- Classic unit tests, including parametric ones.
- Integration tests for storage and network usage.
- HTTP stubbing.
- Contract tests for the API.
- Screenshot tests.
- UI tests mocking API integration.
- Property-based tests.
- End-to-end tests.
- etc
Up to this point, we may think having all these tests and using many different testing strategies is more than enough to ensure the quality of the project. However, how can we measure the quality of our tests? What tools can we use to ensure our safety net is safe enough and any future change breaking the app will be detected automatically? It’s time to use mutation testing!
Originally proposed by Richard Lipton et al in 1979, Mutation Analysis (AKA Mutation Testing) is an automated (or not) testing strategy we can use to check if our tests will help us and identify bugs when our software evolves over time. It can be a more pragmatic, complex, and interesting approach than just using code coverage or the number of tests as a quality metric. It can also be used as a forensic tool for our codebase.
The idea is simple. We first get a test suite passing, then we apply some random changes to our code, ensuring these are valid in build time, and last but not least, we run our tests again. If we don’t see our tests failing… we fail with them!
Build-time valid code mutations should make our tests fail unless these mutations are semantically equivalent to the original code, obviously! If we change our production code and the behaviour changes, we should get a test letting us know there is something we should check. Maybe it’s not a bug, but a behaviour we updated, so we should update our tests accordingly to ensure this new behaviour is covered.
Up to this point, you may be thinking this is too much for your app. And maybe right, but now think about payment systems’ software. How would you feel if I randomly change the software computing your payslip and tests are still passing? Critical paths like payments or health-related software are great candidates if we want to use mutation testing, but this testing strategy may not be so important for other elements where fault tolerance is cheaper.
Now that we know how this tool works, it’s time to review the list of mutations we can apply before executing our tests. Stryker, a well know mutation-testing tool got a list here.
Time to get your hands dirty!
Would you like to practice mutation testing? We’ve got you! We created an exercise for you. You can clone this repository and follow the exercise. In this repository, you’ll find a TypeScript module written already implementing a simplified version of a kata named “KataPotter”. You’ll only find the production code, so your goal is to write all the tests you want and then evaluate if the tests written are able to survive mutations. These are the rules the ``checkout`` function follows in order to check the cart prices for this kata:
These are the steps you have to follow in order to complete this exercise:
- git clone https://github.com/pedrovgs/typescriptkatas
- cd typescriptkatas
- git checkout practice-mutation-testing
- yarn install
- yarn test
- Write as many tests as you want and check code coverage: yarn test && open coverage/index.html
- yarn test:mutate && open reports/mutation/mutation.html
Once you get to this point, you’ll find that you have quite a high coverage (probably 100%) however when you evaluate your tests using mutation testing, you’ll find there are some mutations surviving!! In my case, I got 4 mutations surviving, 1 timeout, and 58 mutations killed!
This report from the mutation testing tool is just great. Stryker (the tool we are using) has found we can replace a “*” operator with a “/” operator in our code and nobody would notice!! How is this possible? Because there was a missing test scenario we didn’t include it!
What about other programming languages?
Would you like to use it tomorrow in your projects? Here we got a list of mutation testing tools for different languages:
- JavaScript / TypeScript, C#, Scala: https://stryker-mutator.io
- Swift: https://github.com/muter-mutation-testing/muter
- Python: https://mutatest.readthedocs.io
- Rust: https://github.com/llogiq/mutagen
- Java & Kotlin: https://pitest.org
Conclusions
As you can see, mutation testing is able to identify changes in your code that are not covered by your tests, even when your code coverage can be 100%. After running this very same exercise with the GoodNotes engineering team we reached some conclusions we’d like to share with you:
- Coverage by itself without assertions doesn’t help.
- This is not the fastest tool, but that doesn’t mean it’s not interesting. It means it’s not designed to be used with slow test suites.
- Evaluating mutations only on files changed for every PR would be a nice integration point for CI.
- More interesting than code coverage.
- When you find a mutation surviving, it’s a chance to include a test or refactor your code.
- It’s complementary to other testing strategies and coverage (obviously).
- Can help to identify coupling sometimes.
- Not safe for slow test suites.
- Worst naming ever.
- Some mutations would never be applied by humans or are equivalent so this means → Bad signal/noise ratio for us…
- This strategy demonstrates how type safety and some FP features to reduce risk when coding.
- A good starting point could be to include it for incremental changes in your PRs.
- Way more useful in 1979!
This article was written by: Pedro Gómez & Tomás Ruiz-López, Senior Software Engineers at GoodNotes.