A smarter way to QA: introducing generative testing

Following on from our post about how we built a whole new platform to make building great integrations faster: Jon, our Test Software Engineer, explains how we now use generative testing to ensure it works!

One of the key things we do at Geckoboard is make it easy to get data from wherever it exists onto a dashboard. This means having to integrate with scores of third party APIs. Within each service there’s also a huge variety in what a customer might want, from different types of visualisation to different metrics, filters and time periods. We want to be able to cater for them all.

Before we delve into how we’re now testing it, we’ll begin with a short explanation about how our new platform actually works.

Imagine we wanted to build a new Intercom integration. Before we had our new platform, we had to research all the metrics and filters someone might want, look at the service’s API docs, then manually work out which queries we’d need to make for all the most common data requests. We’d then code these up with a custom built UI.

Our new approach is far less manual. Now, when we want to build a new integration, we plug the API’s characteristics into a metadata file and a query planner works out what requests it needs to make it. Even the UI for users to select what data they want is built dynamically from metadata.

However, the problem we quickly encountered with our new approach was that seemingly small changes to the query plan could cause totally unexpected regressions in performance. Given the huge variety in potential queries due to filtering and complexity of the query planner, how were we going to effectively test this new platform?

We knew we’d never get sufficient coverage through exploratory testing or manually hardcoding a set of test cases. We needed to find a thorough testing solution that didn’t involve an army of people trying every possible query combination.

Enter generative testing

When we built our query planner we drew inspiration from how relational databases work, so a natural first step was to come up with an approach to test our query planner. For this we needed to research how they’re tested.

It turned out that most rely on hard-coded benchmark tests that evaluate example queries against execution time. Unfortunately we weren’t able to use this approach because we were querying third party APIs. We wouldn’t be able to run enough tests before falling foul of rate limits. We also wouldn’t have any idea of what was going on the servers — slowness might have nothing to do with the queries we were running.

Next we decided to look into generative testing. This was something we knew of conceptually but had never applied in a real situation. Generative testing is all about writing code that generates test cases for you. You tell it what a query can look like and what kinds of parameters are possible. It creates the tests. What’s great about this approach is the scale. Rather than having tens or hundreds of test cases that have to be written manually, it can generate millions.

Another huge benefit is that it can randomly generate different test cases each time it’s run: ensuring broad coverage without the need to run every permutation every time. And what’s more, if you add new functionality all you need to do is update the logic that generates the test cases rather than hundreds of test cases manually.

But how do you evaluate whether the tests should pass or fail when the test cases are all generated randomly? You can’t just use the code you’re testing (or can you?! More on that later!). That’s where property tests come in.

Let’s take the classic string reversal example. Instead of hardcoding that “backwards” should return “sdrawkcab,” you might have a set of property tests like: the length of the input and output string should be the same, the first character of the original string should be the same as the last of the reversed string etc.

The test doesn’t necessarily need to know the actual value returned, just what conditions it must satisfy to be true.

Property testing and generative testing usually go together. The Haskell programming language has a popular library called QuickCheck. We weren’t sure if we’d find a suitable Go equivalent, given it’s not a functional programming language, but we came across Gopter and it turned out to be exactly what we needed. To prove the concept, we started by writing tests to evaluate one particular type of query we call “fast counts.” We did this because they were easy to evaluate — they should all produce the same type of query plan.

With paginated APIs, when you send a request to return all the tickets, agents etc. that match a filter, as well as returning all the individual items, it will also return metadata, like the total number of items it’s going to return. We can use this metadata to answer the query rather than having to count the answer ourselves.

To evaluate our generative tests all we needed to do was check certain types of query produced the fast count plans. From there we expanded out the tests to the full range of parameters our query planner would ever have to deal with.

If we were going to use property based testing though we still had to work out what property we wanted to test. What we cared about was making sure that queries didn’t get slower from one release to the next. Our breakthrough was when we realised that to evaluate the tests we could run them twice each time. Once against the query planner currently in production and once against the new version. Any difference in the query plans generated would be immediately flagged and investigated by a developer to decide whether it was a regression or an improvement. Most often it’s an inadvertent change.

Another trick we were able to apply was to use the same validator we use in the product (to test whether the user’s query was valid) to check the queries produced by the generator before running them. That way we knew the tests weren’t failing because of bad queries. To begin with, around half the tests generated weren’t valid, but as we tightened up the logic that creates them, we got this up to almost 95%.

As well as speeding up development and giving us peace of mind over our changes, the sudden increase in the variety of test cases helped to highlight a few problems with our query planner. We found one particular query that managed to generate 97K possible plans before the planner crashed!

The future is shrinking

In terms of where we take it next, we’ve started looking into shrinking. Shrinking aims to make it easier for developers to track down why their commit is generating a different query plan.

When a generated query produces a different query plan the test returns the problematic query. The difficulty is identifying why. One essential piece of information is tracking down which part of the query causes the plans to deviate. As the queries are often long and complicated this can be a time-consuming endeavour.

Shrinking helps automate the process by taking the problem query and modifying or removing one step at a time. If, when re-run, the tests no longer show a difference then the problem is likely to be related to the parameter just changed, flagging where the issue is. The end goal is to come up with the simplest possible test case that triggers the problem.

We’re now totally sold on the benefits of generative testing. We’ve proved that, when you have potentially millions of queries that can be run, generative testing can quickly and efficiently cover a far greater range of values for much less human effort.