Mutation Testing on Scala with Stryker4s

Published in

Wix Engineering

12 min readOct 2, 2020

Mutant in the Wild by hosmer23 on DeviantArt

Through the years, Test Driven Development has made a huge impact in software industry, allowing to eliminate the fear of software change and refactoring, guiding systems to decoupled and flexible design and acting as a documentation and requirement specification to some degree. But writing tests is not enough, because experience has shown that same software issues can persist even after TDD is introduced to the project, because core disciplines of TDD were not followed or test code had different quality standards compared to production code had. This created a demand for test quality measurement tools that would help developer in deciding should the results of existing test suite be trusted during the upcoming refactoring or feature implementation. That’s why tools like Code Coverage have been introduced to some software projects.

In this article we will explore a new approach to test quality measurement that has started to gain traction recently, called Mutation Testing and it’s usage in Scala.

Code Coverage — metric of test quality or punisher of the team ?

If Code Coverage is a tool to monitor quality of tests, then that means in order to ensure their quality is not decreasing from build to build, we must fail the build if code coverage is low ? Of course not, everyone in the team should know the coverage numbers, but it shouldn’t fail the build. But shouldn’t team be held accountable to high standard ? Well, not necessarily. We can get coverage numbers high without testing the code. We just pull out any asserts out of the code. This means we can create several tests that execute majority of production code without verifying and asserting anything, keeping code coverage high.

When deadlines are near, pressure is high and coverage is low, the temptation to write tests without any asserts can be pretty irresistible. Code coverage is a good tool to improve the code, but it’s a bad idea to use it as a management tool, because it encourages pretty bad behaviour.

Even if you failed the build on code coverage, what coverage number would you use ? Only justifiable number is 100%, but it’s impossible to reach it, meaning it does not make sense to fail the build based on code coverage.
Code coverage is a measurement on what to improve, instead of a tool to punish the team.

If assert statements are removed from tests, the coverage tool reports high coverage and the tests pass. When the code coverage is high and unit tests pass, it does not imply that the code is tested, it only proves that the code is executed.

Code Coverage reports executed code without assertions as being covered

To solve this issue, parts of the code could be removed or some of it’s statements changed expecting some test to fail, proving that the modified part of code is being tested. This is achieved by changing one statement at a time and launching whole test suite, in order to find which test will fail. Replacing multiple statements at a time would make it difficult to know if test failed for only one of the statements or all of them. When all tests have passed, despite changing the statement, a test has to be written that not only executes the code where the statement was modified, but also asserts it.

Modifying parts of the code in order to see which test fails

For simple pirate greeter application like here, it might be enough, but on larger codebase, the amount of modifications needed is many times greater. That’s why this modification effort can be done automatically by a set of tools called mutation testing tools. What these tools do is they run your test suite multiple times, each time making a semantic change to your code, and it remembers which semantic change to the code made test pass and which change caused it to fail.

Mutation Testing on Scala with Stryker4s

Professor X: For someone who hates mutants… you certainly keep some strange company.
William Stryker: Oh, they serve their purpose… as long as they can be controlled.

How to use

Each language has it’s own mutation testing tool, we are going to explore one of such tools for Scala’s ecosystem, called Stryker4s.

Since stryker4s is available for use as a plugin for the build tool being used with the project, it is available for sbt and maven.

Sbt plugin can be used by adding this line to project/plugins.sbt

addSbtPlugin("io.stryker-mutator" % "sbt-stryker4s" % stryker4sVersion)

Maven plugin can be used by adding the following in pom.xml under <plugins>

<plugin>
    <groupId>io.stryker-mutator</groupId>
    <artifactId>stryker4s-maven-plugin</artifactId>
    <version>${stryker4s.version}</version>
</plugin>

To run stryker4s on sbt run the following command “sbt stryker” and on maven “mvn stryker4s:run”.

Once started, Stryker4s will look for a config file named stryker4s.conf at the root of the project. If found, it will use the configured values. All values in configuration are optional including file itself. Configuration allows to define which files to mutate, which tests to use, mutations to exclude, thresholds, timeouts and etc. Full configuration reference can be found here

We are going to use our previously shown pirate greeter sample along with it’s unit test suite:

We execute mutation tests by following command on sbt in project root:

sbt stryker

Stryker4s will perform the mutation testing by applying a change to the code just like we did manually in previous example. The application with the change is called a mutant and it’s up to the tests to detect that mutant. When the tests fail the mutant is killed, but when the tests pass the mutant has survived and we all know that survived mutants will wreck havoc. Mutants can also be in no coverage state, because part of the code with the mutation was not executed. Mutant can be in a timeout state — the mutated code resulted in a timeout, for example mutant resulted in a infinite loop. To kill the surviving mutants for good, we write tests that assert existing mutated values to expected ones. After you write all the tests that kills all the mutants, you can be sure that refactoring that follows will not introduce a semantic change to the module.

Results are reported in console and in html file, but it’s possible to output the results to json and even dashboard, by defining the output format in config file.

The report displays that there were 4 mutants in total, 2 being detected and 2 being undetected. Detected metric counts mutants being killed, plus those that resulted in a timeout. Undetected metric counts surviving mutants along with those that were undetected because of no coverage.

The first number represents the index of the mutation, meaning that mutants with index 1 and 2 were killed and those with 0 and 3 were the surviving ones. The second column shows the state of the mutant, in our case it is Survived state, but since here are displayed all undetected mutants, you can expect in some cases the state to be no coverage.

The third column represents the mutator being applied, which resulted in it being undetected by any test allowing it to survive. In our case we had ConditionalExpression and StringLiteral mutators being applied, which is a fancy way of saying that statement was replaced with boolean value for ConditionaExpression mutator and string was changed for StringLiteral mutator. The report shows the line numbers for each file, where the mutation occurred and what kind of value was modified. In our case Stryker changed the code the same way like we did earlier in manual testing scenario. It changed the likeAPirate value in “if” statement to true and the hardcoded string value of “Hello” to a different one, causing all tests to pass and mutants to survive, just like in our manual testing case. Stryker supports many other variations of mutators like EqualityOperator, MethodExpression, LogicalOperator and more. Though there are quite a few of unsupported mutators at the moment. Full list of supported mutators can be found here.

The report generates a html file that has more information about the mutation test run.

Html report displays the same information like in console, but in a more visually appealing way, killed mutations are marked green and survived mutations marked green. By clicking ‘Expand all’ you can see the exact mutation value being applied during the test run.

Here it shows both surviving mutant values displayed, one being a variable replaced with boolean value true, other being a “Hello” string replaced with empty string. At the top of the report you can see the Mutation score, which is a total percentage of mutants killed along with same metrics from the console version and more. Right after that there is a count of mutants being killed, survived, in timeout and in many other states. Full list for these metrics with mutant states can be found here.

Lets add more statements to our code in order to see more mutators in action.

By running Stryker4s on this code, we get the following html report:

We can see more mutators being used along with their values. LogicalOperator mutator replaced && with ||, MethodExpression mutator replaced a call to age.nonEmpty with age.isEmpty, EqualityOperator replaced ≥ with <, > and == and so on.

How Stryker4s applies mutations

Internally Stryker4s takes a different approach than it’s java counterpart PITest, which manipulates bytecode for each mutation. Why this approach would not work on scala even though it’s source code is compiled to bytecode is because bytecode from scala’s compiler contains a lot of scaffolding code which does not link back to the logic created by the developer, resulting in mutations being useless in such code. Even if you managed to distinguish the scaffolding code in order to not perform mutations there, there’s no guarantee it will work with each version of scala’s compiler since bytecode could be significantly different with each version of the compiler.

Source code mutation is also not a great solution, since each mutation would require recompilation, resulting in lower testing performance.

Imagine that you have a code like this:

We can have three mutations here, if we apply each mutation one by one, we would need to recompile the code each time, making it even worse with larger codebase. That’s why Stryker4s uses a technique called Mutation Switching by following these steps:

All mutants are identified for the whole codebase.
All mutants are applied to the codebase at the same time using a Scala Pattern match.
All mutants are tested one by one, with only one mutant active at a time, using an environment variable.

Let’s take a look at the previous code example, but with all mutations applied through pattern match:

The source code is compiled only once and specific mutation is enabled or disabled using a identifier, the default case will be used when none of the mutants are enabled.

Semantic Stability

What is semantic change ? It’s not the same as behaviour, because I can change the behaviour of the module, without changing it’s meaning or semantics. I could replace linear search with binary search and it would not change the meaning of the module, but it would change it’s behaviour.
Refactoring does not change the semantics of the code, it does not change it’s meaning, but it can change the detailed behaviour of the code.

The goal of TDD is to eliminate the fear of changing the code. If test suite is semantically stable, then it is safe to clean the code, because if there will be semantic change to the code, a test will fail. If you would follow three laws of TDD, then you will get semantically stable test suite and that will eliminate the fear of change. This means that by following TDD cycles, you can be sure that test quality is high since the only code in production is written that would suffice for test to pass. In such case, no coverage is needed and you can be sure no behaviour is untested and can confidently refactor the code without fearing that it is not being covered somewhere.

Who Tests the Tests ?

If test code ensures correct behaviour of production code, who ensures correct behaviour of test code ? To answer this, let’s look at possible test quality metrics.

One of possible metrics in such case is fear the developer feels before changing the production code. It tends to correlate with poor test quality, the higher confidence in tests, the lesser the fear. When we write tests after the fact, we are never sure that all the cases are covered by tests, meaning that there is a reasonable level of fear to change some existing behaviour in production code without having any tests failing.

Another metric for code quality is the level of TDD discipline being used during development. By following the three laws of TDD, you are forced to write no more and no less production code than the test requires and there is no test that covers more than production code requires. This means that test code is being tested by production code, because after writing failing test during Red phase of TDD, you write production code to make it pass on the Green phase of TDD, the change you see from failing test to passing test proves that the production code you wrote is what triggers this test to switch. By seeing the switch ourselves from red to green, we know that this test is doing it’s job — reacting to change in this particular line of production code. Since production code change triggered the change from red to green, we can assume that production code tested the test — a permanent entanglement between production code and test code exists, one tests the quality of another.

It is very similar to what Mutation Testing is doing — another metric for code quality. In TDD cycle, we confirm ourselves that production code change triggers the test to switch, Mutation Testing tool does the same — changes lines of code, expecting to see a switch in some test case. The only difference being that not only Mutation Testing does it automatically, compared to TDD’s manual approach, but Mutation Testing does multiple changes to production code expecting to see a switch, running whole test suite after each change, while TDD switch is done only once in a lifetime of a production code block.

Since test code should be held to the same high quality standarts that production code has, these metrics allow to be sure that the test code is still covering the business cases by testing the test code using production code, either manually through regular cycles of TDD or automatically through Mutation Testing.

Conclusion

It is possible to have the same quality test suite without following three cycles of TDD. By using mutation tests, you can get the test quality level and using this information it’s possible to cover those cases with tests and improve test quality till you reach the needed level of confidence for safe refactoring. This is especially great tool for legacy systems that have not sufficient test suite or no test suite at all. There are still some issues with this approach, it’s entirely up to luck if the system written without tests is flexible enough to be covered by tests, because if system was written without tests in mind, it’s design is not flexible and decoupled enough to be divided to units in order to be tested properly.

Program testing can be used to show the presence of bugs, but never to show their absence! -Edsger W. Dijkstra (1970) “Notes On Structured Programming”

While this statement has stood the test of time, tools like mutation tests make it hard to have an absent bug (or mutant), once all the mutants have been killed.