A mutation testing tool for Hive SQL
At Hotels.com (an Expedia Group brand) we are big users of Apache Hive and therefore SQL. Although we love SQL, when it comes to trying to test it the relationship tends towards love-hate. Since we use a lot of queries for analysis, reporting and ETL it’s important to know that they are working as we expect. Unfortunately in the past we’ve often ended up with large monolithic queries which can be very challenging to test to ensure correctness.
In this post I will describe some techniques we devised to help with breaking up large queries. I’ll also introduce a tool that we developed called Mutant Swarm which helps us ensure we have sufficient test coverage of our Hive SQL codebases.
As much as we try, SQL will never die
SQL recently celebrated its 45th birthday, causing speculation that it’s no longer relevant. But as you can see from the diagram below, after creation of many big data frameworks there has at some point been an adoption of SQL and a subsequent release for it. This demonstrates to us that there is still a big demand for SQL and, try as we might, we can’t ignore it.
Despite its wide use there is a lack of testing and coverage tools available for SQL, and there doesn’t seem to be a standard for how testing should be carried out for large monolithic queries.
In the Big Data Platform team at Hotels.com we are Java developers, so we are very aware of a number of great tools which are available for testing and code coverage such as Cobertura, Jacoco, and PIT. These tools provide some really great insights with fast visual feedback, motivating developers to write more tests to get the visuals as green and positive as possible.
So we wondered — could we also have these same reporting tools for SQL?
Unit Testing for Hive
It turns out that there already is a great unit testing framework for Hive queries called HiveRunner, built by Klarna. Writing HiveRunner tests does require developers to have some knowledge of Java, however, once you have this the framework is simple to use and provides great visual feedback to developers.
In a previous blog post I discussed the benefits of the modularisation of large nested SQL queries by implementing the methods of chained tables and views and comparing them with a monolithic query. The benefits of breaking up nested queries include:
- Makes them easier to maintain,
- Each part can be tested independently and the results checked,
- Modularisation doesn’t affect execution performance.
So we have some solutions for modularisation and testing of Hive SQL but we were unable to find any tools which provide useful insights into the code coverage of the SQL being tested. We looked further into some of the existing Java tools with the idea of potentially building our own tool for SQL. In doing this, we found that the tool PIT uses mutation testing to mutate existing Java test classes. Through research into mutation testing with SQL we discovered a web tool created by the university of Oviedo which generates a list of mutations for a given query. The simple web tool confirmed that this idea can be applied to SQL, and gave us the green light to go ahead with building a mutation testing tool.
What is mutation testing?
Mutation testing is a form of testing which measures the effectiveness of your test suite by making small changes to your code — or in this case query — referred to as a ‘mutant’. This mutant is then run against your existing test suite. Since the underlying functionality of the code has changed, we would expect that the test suite should have a number of failures. If, however, the tests in the suite still pass, it is an indication that they are not sensitive enough and that not all lines of the code or query are being covered by the test suite.
If the test fails it means that the changed section (mutation) is being covered by the test suite, and the mutant is referred to as having been ‘killed’. If the test survives despite being executed on the mutated code the mutant is considered to have ‘survived’. The effectiveness of the test suite is measured by the percentage of mutants that are killed — so, failing is actually a good thing! More mutants can be killed by adding more tests and more discriminating test data.
The example below should help to explain this concept a little better.
Shown is some example test data and a simple query that selects hotels in a certain price range, as well as the result from running the query on the test data. At the bottom of the diagram is the test suite for the query.
Mutation testing will take this source query under test and will identify the pieces of it which can be mutated — these parts have been highlighted in yellow in the diagram below. We refer to this as “sequencing the genes”, where the genes are the parts of the query which can be mutated. The genes are grouped into types, and each type has a set of mutations which they can be replaced with. For example,
lower can be mutated to
< can be mutated to
The original query under test is then taken and a number of copies are produced. Each of these copies is then injected with a single mutation (highlighted in yellow in the diagram below), turning them into mutants. These mutants are then executed against the test suite. In the example below, a red cross indicates the test has failed (again, this is a good thing) and a green check mark indicates the test has passed (not good).
The example shows that there was one mutant (top right) which survived. You can see the mutation is on the
price column, so this survivor indicates that this column is not covered by the test suite. In some cases, for whatever reason you may actually want the value for this column to be
null, so to cover this situation you would need to add some more discriminating test data.
So now you know what mutation testing is, you’re ready to hear about Mutant Swarm.
How it works
Mutant Swarm initially uses HiveRunner to run the test case once and generate the standard unit test reports. After this Mutant Swarm will then take the SQL under test and sequence the genes inside it — again, this is a fancy way of saying that it identifies the parts which can be changed. The mutants of the initial query are then generated, with a single mutation injected into each, and are then run against the test suite one by one.
After all of the mutants have executed, a mutation report will be generated indicating which mutations were killed and which survived, thus determining the effectiveness of the test suite. Below is an example of a generated coverage report.
The mutation report displays the original query, highlighting each of the genes. If a gene is highlighted green it means every mutation of this gene has been successfully killed and this gene has full coverage. However if the gene is red it means that one or more of the mutations has survived, so more tests are needed to cover these case. Clicking on a gene provides the information of what it has been mutated to, whether the mutation survived or was killed, and the test name for the case where it survives. The line number also provides a summary of the mutation information.
So by looking at the mutation report you can clearly see which parts of the query aren’t being covered by the test suite. This provides some guidance for where additional testing and more discriminatory data is needed.
If you have skimmed the post up to this part, this section is for you.
- We have built a cool mutation testing tool (with an even cooler name) for determining code coverage of Hive SQL codebases.
- Mutant Swarm is open source on the Hotels.com GitHub page.
- You can use HiveRunner to unit test your queries.
- Checking code coverage highlights issues/risks in your code, allowing you to focus your testing efforts.
- By using Mutant Swarm, you can remove the guess work as to whether your queries are a) actually behaving as you think they should, and b) appropriately covered by your test suite.
- You should use Mutant Swarm 🐝