Mutation Testing in PHP: quality measurement for code coverage
How do you evaluate how good your tests are? Many people rely on the most popular score, the one everyone knows — code coverage. But this is a quantitative, rather than a qualitative metric. It shows how much of your code is covered by the tests, but not how well these tests are written.
One of the ways of addressing this is mutation testing. This is a tool which makes small changes to the source code and re-runs the tests against it, allowing you to identify ineffective tests and low-quality coverage.
In this article, I show you how to organise mutation testing for PHP code and what issues you might encounter. Enjoy!
What is mutation testing?
To show you what I mean, I’ll give you a few examples. They are simple, a bit exaggerated in places, and might at first glance seem obvious (although real-life examples are often quite complicated and difficult to see for yourself).
Let’s consider this situation: we have a basic function, which confirms that a person is an adult, and there is a test to verify it. The dataProvider for the test will test two cases: age 17 and age 19. I’m sure it’s obvious to many of you that isAdult has 100% coverage. One single line. It passes the test. Everything is wonderful.
But on closer inspection it becomes clear that our provider is badly written and omits to test the boundary conditions: the boundary condition of 18 years is not tested. You can change the >
sign to >=
and the test will not pick up the modification.
Here’s another, rather more complicated example. There is a function which builds a simple object containing getters and setters. We have three fields, which we set up, and there is a test which checks that the buildPromoBlock function actually returns the object we expect.
If we look carefully, we also have setSomething, which sets any property to true. However, we do not test for this assertion. So, we can delete this line from buildPromoBlock and our test will not pick up this modification. When doing this, we have 100% coverage for the buildPromoBlock function, because all three lines were executed during the test.
These two examples lead us to define what mutation testing is.
Before we take a look at the algorithm, let me give you a brief definition. Mutation testing is a mechanism which, by making small modifications to the code, allows us to imitate the actions of a mischief-maker or junior developer who has come along to try and deliberately break it, changing >
to <
, =
to !=
, and so on. For each of these changes made in good faith, we run tests which should cover the line which has been changed.
If the tests show nothing if they do not fail, then they probably aren’t effective enough. If they do not test for boundary conditions or contain assertions, they may need improving. If the tests fail, that means they are great. They are protected from these modifications. As a consequence, our code is harder to break.
Now let’s take a look at the algorithm. It’s fairly simple. The first thing we do to implement mutation testing is to take the source code. Then we get the code coverage, to find out which tests to carry out on which lines. Next, we go over the source code and generate what are called ‘mutants’.
A mutant is a single change to the code. That is, we take a function which has the >
sign in the equation, an if statement, and change the sign to >= —
and we get a mutant. Next, we run the tests. Here is our first mutation (we changed >
to >=
):
However, these mutations are not done at random: they follow particular rules. The result of mutation testing is idempotent. No matter how many times you run mutation testing on the same code, you will get the same results.
The last thing we do is run the tests which cover the mutated lines. We extract this from the coverage. There are sub-optimal tools which run all tests. But a good tool will only run those tests which are necessary.
After that we can evaluate the results. If the tests failed, that means it’s all good. If they do not fail, it means they are not very effective.
Metrics
What metrics does mutation testing give us? It adds another three to code coverage, and we’ll talk about them now.
To start with, let’s get to grips with the terminology.
There are killed mutants: these are mutants which our tests “beat up” (that is, they caught them).
There are escaped mutants. These are mutants which managed to escape retribution (that is, the tests did not catch them).
And there is the covered mutant — a mutant which is covered by tests — and its opposite an uncovered mutant, which is not covered by any test (that is, we have code which contains business logic, and we could change it, but none of our tests check these changes).
The main score that mutation testing gives us is the MSI (mutation score indicator), the ratio of killed mutants to the total number.
The second score is mutation code coverage. This is a qualitative rather than a quantitative measure, because it indicates what volume of business logic, which can be systematically broken, our tests will trap.
The final score is covered MSI, which is a less stringent MSI. In this case we only calculate the MSI for those mutants which were covered by the tests.
Problems with mutation testing
Why is it that fewer than half of all programmers have heard of this tool? Why is it not used everywhere?
Slow speed
The first problem (and one of the main ones) is the speed of carrying out mutation testing. If we have dozens of mutation operators in the code, we can generate hundreds of mutants for even the simplest classes. You need to run tests for each mutant. If we have, say 5,000 unit tests, which take ten minutes to run, mutation testing will take hours.
What can we do to offset this? Run the tests in parallel, in several threads. Spread the threads across several machines. It works.
The second method is incremental processing. There is no point calculating the mutation scores for all branches every time — we can use branch diff. If you use branch diff, you can easily: run tests solely on those files which have been changed, and look at what is currently going on in the master, compare and analyse it.
The next thing you can do is mutation tuning. Since you can change mutation operators, you can set up rules for their operation, so certain mutations can be prevented from running, if they are known to cause problems.
Important point: mutation testing is only suitable for unit tests. Although you can also run it for integration tests, it’s a particularly bad idea because integration (and also end-to-end) tests run a lot more slowy and involve a lot more code. You’ll just never get to the results. Basically, this tool was conceived and developed solely for module testing.
Infinite mutants
The second problem which can crop up with mutation testing is what are called infinite mutants. The simple example code here is a basic for loop:
If you change i++
to i--
, then the loop becomes infinite. Your code will loop forever. And mutation testing often generates similar mutants.
The first thing you can do is mutation tuning. It’s obvious that changing i++
to i--
in a for loop is a very bad idea: in 99% of cases you will end up in an infinite loop. Therefore, we forbid the tool to do this.
The second and main protection against this problem is run-time timeout. For example, in the same PHPUnit there is the option to end a test for timeout regardless of where it was run. PHPUnit executes a callback through PCNTL and calculates the time itself. If the test does not complete within a certain period, it just terminates it. This case is considered to be a killed mutant, because the code generated by the mutation was definitely checked by the test, which definitely identified a problem, demonstrated by the fact that the code stopped working.
Identical mutants
Theoretical, this is a problem with mutation testing. In practice, you don’t encounter it very often but you should be aware of it.
Let’s take a look at a classic example which illustrates it. We have a variable A which we multiply by -1 and divide by -1. In general, these operations give the same result. We change the sign of A. Accordingly, we have a mutation which allows us to swap the two signs. This mutation does not break programming logic. Nor should tests catch it or fail it. Certain complications can arise from these mutations.
There is no universal solution — each problem requires solving individually. It is possible that some sort of mutant registration system would help. Here at Bumble we are thinking about something like this at the moment: we will intend to mute them.
That’s the theory. What’s in PHP?
There are two well-known mutation testing tools: Humbug and Infection. When I was writing this article, I wanted to tell you which of them was better and came to the conclusion that it was Infection.
But when I visited Humbug’s page, it said: This package is deprecated, check out Infection instead. So, part of my article turned out to be pointless. Anyway — Infection is a really good tool. We need to thank Maks Rafalko from Minsk, who created it. He has done a great job. You can take it out of the box, put it through Composer and run it.
We really like Infection and e wanted to use it. But we couldn’t, for two reasons. The first is that Infection uses code coverage to run tests for mutants correctly and precisely. So, this gives us two possible routes. We could calculate it directly at runtime (but we have 100,000 unit tests). Or we could calculate it for the current master (but it takes an hour and a half to install it on our cloud of ten very powerful multi-threaded machines). If we need to do this for each mutation run, then clearly the tool will not work.
There is a variant that’s ready to go, but in PHPUnit format it’s a whole lot of XML files. Aside from the fact that they contain valuable information, they bring in a load of structures, properties and the like. I calculated that generally, our code coverage would take up around 30 GB, and we need to spread it across all our cloud machines, constantly reading the disk. Basically, it’s not a great idea.
And the second problem with Infection turns out to be even more of an issue. We have an excellent library, SoftMocks. It allows us to wrestle with poorly-tested legacy code, and write successful tests for it. We use it actively and will soon be unable to do without it, despite the fact that our new code is written to preclude the need for SoftMocks. So, this library is incompatible with Infection, because they both use almost identical approaches to mutated changes.
How do SoftMocks work? They intercept include files and substitute them with a modified one: so instead of executing class A, SoftMocks creates class A in a different place and, instead of the output, plugs a different one into the include. Infection works almost exactly the same way, but using stream_wrapper_register()
instead, which does the same thing but at a system level. As a result, we can either use SoftMocks or Infection. Since SoftMocks is essential for our tests but integrating these two tools would be very difficult. No doubt it’s possible, but it would be such a tight fit for Infection that the point of the changes would be completely lost.
To overcome with these difficulties, we have written our own little tool. We borrowed the mutation operators from Infection (they are brilliantly written and very easy to use). Instead of running the mutations through stream_wrapper_register()
, we run them through SoftMocks, so we use our own tool out of the box. Our tool works with our internal code coverage service. That is, it can get the coverage for files or for lines on demand, without running all the test, so it’s very fast. On top of that, it’s simple. Infection has a whole lot of tools and all sorts of options (for example, to run in multiple threads), but our tool doesn’t do any of that. Instead, we use our internal infrastructure to compensate for this lack. For example, to run a test in multiple threads we carry them out through our cloud.
How to use it?
First of all, run it manually. That’s the first thing you need to do. You need to check every test you write manually with mutation testing. It looks something like this:
I ran the mutation test for a random file. The result I got: 16 mutants. Of these, 15 were killed by the test, and one failed with an error. I should have mentioned that mutations can generate a fatal error. We can easily change that: make it so that the type returned is invalid, or something. It’s possible, and counts as a killed mutant, because our test started and failed.
Nevertheless, Infection splits these mutants out into a separate category, because sometimes you need to pay special attention to the errors. Sometimes, there’s something strange happening, and the mutant cannot really be classed as killed.
The second thing we use is a master report. Once a day, overnight, when our development infrastructure is idle, we generate a code coverage report. Then we create the same report for mutation testing. It looks like this:
If you’ve ever looked at the PHPUnit code coverage report, you probably noticed that the interface is poor, because we made our tool similar. It just calculates all key scores for a specific file in a directory. We have also set up specific targets (in actual fact, we just plucked them out of thin air, and do not yet meet them, since we have not yet decided which targets are worth meeting for each metric, but now that they exist it will be easy to build reports in the future).
And the last thing, the most important thing, follows on from the other two. Programmers are lazy. I’m lazy: I like it when everything works so I don’t have to expend any extra energy. We have made it so that when a developer is swearing at his branch, the scores for his branch and the master branch are automatically incrementally calculated.
For example, I had trouble with two files and got this result. I had 548 mutants in the master, with 400 killed. In the other file there were 147 against 63. The number of mutants in my branches increased in both cases. But in the first file the mutant was killed, and in the second it escaped. Of course, the MSI score fell. This makes it possible even for those who don’t want to spend time running mutation testing manually, to see what they have made worse, and give it some attention (just as reviewers do in the code review process).
Results
It’s difficult to quantify just yet: we did not have any sort of scores before, and now although they’ve come along but we have nothing to compare them with.
I can say, though, that mutation testing has a psychological effect. If you start to run your tests through mutation testing, you inadvertently start writing better quality tests, and writing better quality tests inevitably leads to changes in how you write code — you not only start to think about how to cover all the cases which might break it, but also to structure it better, and make it easier to test.
This is just a subjective opinion. But several of my colleagues have given more or less the same feedback: when they started using mutation testing consistently in their work, they started to write better tests, and many also said that they started to write better code.
Conclusions
Code coverage is an important metric — you need to track it. But the score is no guarantee: it does not tell you you’re safe.
Mutation testing helps make your unit tests better and makes your code coverage tracking easier to understand. There is already a tool for PHP, so if you have a small, uncomplicated project, you can just grab it and try it today.
Start running mutation testing, even if it’s manually. Just take the first step and see what it gets you. I’m sure you’ll like it.