Mutants Against Bugs: Implementing mutation testing in a niche language

Published in

AT&T Israel Tech Blog

9 min readAug 5, 2020

Untitled, Computer Assisted Drawing, 1975, Paul Brown, written with FORTRAN punched cards on a ICL 1903A mainframe, plotted on a Calcomp Drum Plotter

Unit Tests and Coverage

Unit tests are the best way to determine the reliability of code. Tests assert that specific code behaves as intended by running it in isolation. This assures that we notice breakage as code evolves and grows.

Coverage checks can determine which parts of the code are run by unit tests and which aren’t. This is done “under the hood” by pre-processing the source code to add “sensors” to each statement (in case we measure line-coverage) or to each scope (for branch-coverage).

This sensor is an inline function which increases a counter when triggered, mapped to metadata such as file name and line number. After completing the test suite, the result is reduced to a coverage report. This report is an integral part of a good CI/CD pipeline and provides a good objective measure of overall code reliability.

However, this report can be misleading. The fact that a unit test triggered certain lines of code doesn’t mean that the unit test proved it correctness. It only proves that the code ran. In other words, one could write tests without any assertions and score highly. This can give a false sense of confidence in our code.

Mutation Testing

The best way to check the efficacy of a current unit test is by purposely making it fail. Change a condition in the source code, an operator, or a Boolean from “true” to “false” and expect a failure in the test suite. Even if the code has high coverage, if no tests failed as a result of this change we can be sure there is a gap in our test suite. Automate this process and you’ve invented Mutation Testing. Tools like Stryker or PIT change –or mutate– certain tokens of code and run its associated tests. A failed unit tests “kills” the mutation, otherwise the mutation “survived.”

I witnessed all teams implementing mutation testing for their projects, easily integrating Stryker for JavaScript, TypeScript, C#, and Scala, and PIT for Java. However, nothing like it has been tried for Roku channel development.

I had the challenge to integrate mutation testing for our Roku development pipeline. Roku uses a proprietary language called BrightScript and its only runtime is in a physical device running the proprietary closed-source RokuOS. This represented a twofold challenge for us, since there were no over-the-counter mutation testing frameworks for BrightScript, and all unit tests need to be deployed to a device.

This is how we did it and the things we learned along the way. I’ll be using JavaScript for examples.

Resources

I turned to Stryker for its extensibility with Plugins. Using the resources and documentation on their website, and helpful chats on Gitter (now Slack group) with the community. The two things I had to do was:

Write a mutator-plugin to create mutations on BrightScript and a test-runner to run the unit tests on a Roku device. I was directed to the javascript-mutator as the easiest example to base on. For this I had to understand how the code was parsed and mutated with an abstract syntax-tree and the visitor pattern.
Write a test-runner-plugin, which which Stryker will call for running the unit tests on the designated platform and framework, and will wait for it to receive the test results.

The AST

Mutations are introduced into the source code (or in the byte-code in JVM languages, and this is how PIT works). To introduce mutations, we need to read the file into memory and parse it into tokens as an abstract syntax-tree (AST).

Each branch of code, such as if-else, causes the syntax-tree to branch, as well as each new stack such as a function call. The nodes of this tree are the language tokens: variables names, operators, and all of the language keywords such as its control flow, loops, and assignment operators. Each node is an object with metadata, including the file name, line number and cursor position to which it belongs, as well as its type (Boolean, space, assignment operator, etc.). This tree is “abstract” in the sense that it does not include all the details of the syntax needed for runtime, but only an overview of its structure.

This is how tools such as Babel parses modern JavaScript and rewrites it into backwards-compatible syntax to run in browsers. Stryker uses Babel to parse JavaScript and generate an AST before triggering mutations on it. Other tools such as Acorn, ESLint (through Espree), Chevrotain and TypeScript work in the same manner. Some of them have applied a standard for JavaScript AST, called ESTree.

Here’s a trivial example in JavaScript using AST explorer:

if (true) {
 const hello = “world”;
}

Turned into a JSON AST with acorn 7.3.1 would look like this:

{
  "type": "Program",
  "start": 0,
  "end": 218,
  "body": [
    {
      "type": "IfStatement",
      "start": 181,
      "end": 218,
      "test": {
        "type": "Literal",
        "start": 185,
        "end": 189,
        "value": true,
        "raw": "true"
      },
      "consequent": {
        "type": "BlockStatement",
        "start": 191,
        "end": 218,
        "body": [
          {
            "type": "VariableDeclaration",
            "start": 194,
            "end": 216,
            "declarations": [
              {
                "type": "VariableDeclarator",
                "start": 200,
                "end": 215,
                "id": {
                  "type": "Identifier",
                  "start": 200,
                  "end": 205,
                  "name": "hello"
                },
                "init": {
                  "type": "Literal",
                  "start": 208,
                  "end": 215,
                  "value": "world",
                  "raw": "\"world\""
                }
              }
            ],
            "kind": "const"
          }
        ]
      },
      "alternate": null
    }
  ],
  "sourceType": "module"
}

Luckily for us, RokuRoad built Bright, an AST parser for BrightScript which returns an ESTree-like AST using a customized Chevrotain engine. This is the engine behind the eslint-plugin-roku, for syntax highlighting in the BrightScript language.

Remember we mentioned that coverage testing works under the hood by placing a “sensor” function on each line? For this to be syntactically correct, an AST is necessary to recognize scopes and relevant lines we want to track. georgejecook/rooibos testing framework uses sjbarag/brs interpreter for this.

If you like tinkering with metaprogramming, ASTs gives you wings.

The Visitor Pattern

Once we have this tree in memory, we can walk through it using the “visitor pattern.” A visitor is an object which defines an interface function, in this case “mutate,” and “visits” (is sent as a parameter) each one of the nodes in the tree.

We can define, for instance, a “mutator” object for Boolean substitution. It visits each node, and if it’s a literal “true”, it will clone it, change it to “false,” and push it into an array of mutated nodes with all its corresponding metadata. After the entire tree has been traversed, we can move on to another mutator object, such as unary operators, which change a ++ into a — — .

Boolean Substitution Mutator, based on javascript-mutator

Other interesting mutators are:

Arrays — replacing arrays with empty ones.
Equality — replace >= with > as well as all other permutations of equality.
String literals — replacing strings with "Stryker was here!"
Void functions — removes calls of void-returning functions needed for side-effects (remember to ignore all loggers).
et cetera, (be creative!)

When we’re done with our mutator object array, we pass the array of mutated nodes to the mutant transpiler. The mutant transpiler substitutes the token in memory according to its location and writes the files into a sandbox folder. This is the new mutated source code on which unit tests will run.

Running the Tests

Stryker has no knowledge of how to run tests for each framework, language or device. He achieves this through plugins implementing the “TestRunner” interface. The interface specifies a function “run” and returns a Promise with “RunResults.”

Unfortunately, all Roku BrightScript code needs to run on a physical device, and unit tests are no exception. As of 2020, there’s still no emulator for the Roku Streamer, such as there are for tvOS or Fire TV, and the proprietary runtime for the BrightScript language is inside the device. We need to copy the unit test framework into the code, deploy the code to the device, trigger the test suite, and retrieve the results via Telnet logs.

Once parsed, the result is accumulated by Stryker into a reporter singleton instance. After all the tests have finished, we can format the report through specialized reporter plugins which we can get over the counter.

This has to be repeated for every mutation, which can run on the hundreds or thousands, depending on your source code and test suites. Mutation tests usually take a long time, and for a runtime requiring deployment on physical devices can go for hours.

For running unit tests we use the amazing georgejecook/rooibos mocha-inspired test framework. The device outputs its results with telnet, so we built a Nodejs module in TypeScript, using the native ‘net’ package to retrieve them as a Stream and pass or fail the unit tests on the relevant log line:

If things start getting fancy with this stream I might switch to reactive programming. This way I can subscribe multiple times to the logger with only one telnet port connection:

import { Subject } from 'rxjs';const logger$ = new Subject<string>();// create connection...socket.on('data', (chunk: Buffer) => {
    const str = chunk.toString();
    logger$.next(str);
});// subscribe to logger$

This forms the core of our Test Runner plugin, using the TestRunner interface from the Stryker API. In it we define a run() function which returns a Promise with RunResults. We added a function which formats the above results to conform to the interface below:

interface RunResult {
    tests: TestResult[];
    errorMessages?: string[];
    status: RunStatus;
    coverage?: CoverageCollection | CoveragePerTestResult;
}interface TestResult {
    name: string;
    status: TestStatus;
    timeSpentMs: number;
    failureMessages?: string[];
}declare enum TestStatus {
    Success = 0,
    Failed = 1,
    Skipped = 2
}declare enum RunStatus {
    Complete = 0,
    Error = 1,
    Timeout = 2
}

Putting it all together an running Stryker we begin seeing logs:

Ongoing mutation testing in Roku — 3078 mutations, estimated 19 hours for completion, or until my Roku device melts

39 mutants survived in the first few minutes. This means that functionality changed in the code and no test failed as a result.

Optimizations

A common optimization for mutation testing is to add coverage reports. This means associating a given line of code with its corresponding tests. This filters the unit tests and only run those which are relevant for the mutated lines of code, which dramatically reduces its running time. This, however, requires a large refactor into the Unit Test framework since we need to add IDs to each test and test result.

The second optimization involves multi-threading, which in our case involves deploying and testing on multiple Roku devices at once. This requires managing a thread pool of devices and deploying in turns. Also, I’m a bit weary of straining one small device for 19 hours straight, so balancing the load with 5 or more devices would be a good idea.

With Roku devices I ran into Telnet connectivity issues, since the previous instance of tests hasn’t disconnected in time for the next one. I might switch to reactive-programming and a singleton instance for the connection. Some of the mutations crashed the test suite, and some of them rebooted the device, triggering timeouts for Stryker.

The string literal mutator had to be disabled, since a simple string concatenation such as "hello " + name + "!" resulted in two separate mutations, with two separate sandboxes and two full runs of the test suite with none of the mutations killed, since there were no unit tests to catch such a trivial change. I’m looking for a different way to mutate strings.

Conclusion

We want to eventually complete these plugins and release them into the public domain. The current implementation is still a proof of concept. The current implementation uses only one Roku device, which resulted in many connection timeouts.

Once the report was in, it was easy to see files which had a high test test coverage but a low mutant kill score. Once we had a score for our test suite we know where work is needed, while also having concrete objective measures of improvement. However, it’s important to still write good tests, and not just for the sake of increasing a score.

Of course your test suite should fail if you make a semantic change to your production code. Does anybody realistically doubt that? -Uncle Bob, Mutation Testing

Ideally a test should always fail for every semantic change in code. This shouldn’t be considered a nuisance. It increases the importance of each line of code and helps keep in mind the greater repercussions of small changes. Once tests react on changes we can effectively refactor for housekeeping and improvements without fearing unintended repercussions.