Spec failing? Better luck next time.

https://phys.org/news/2016-11-maths-roulette.html

Background

Many of us rely on our test suites to catch problems with our code. However, what happens when we can’t trust our own tests?

Tests are Unreliable (fact)

At Greenhouse, our main application has become somewhat of a monolith. Our developers write tests over every new feature and bug fix. You can imagine how that translates in terms of our test suite size. Our current test suite sits at over 2000 spec files which are run every time someone creates or pushes to a PR.

The current state of our test suite
On average it takes between 16–19 minutes for our CI to run the full test suite

What’s the problem with that? Isn’t having wider coverage better? It’s not a ‘yes or no’ answer. Here’s why, a lot of tests don’t bring much lasting value in identifying bugs, just because some functionality is more prone to issues than others. Another issue is that there are so many tests that they are hard to maintain and update as our code evolves over time. Most of all, many tests are prone to random failures (for a variety of reasons) which we call finicky.

While developers all felt pain from having to rerun the test suite on their branches due to these finicky specs, it was the release managers who had it the worst. It takes around 15 minutes to finish running the test suite in our CI/CD pipeline. This meant that if there was even 1 finicky example in all those 15000+ examples, the release manager would have to rerun the pipeline, each time spending another 15 minutes to do so. At times, releases would get backed up to such an extent that developers would have to branch off of existing branches — something we have always tried to avoid.

The fail/pass rate of a typical release branch

Another problem with finicky tests is that at some point, at a certain point as a developer, you begin to lose confidence in the test suite. You are no longer able to rely on the results of the test suite to expose bugs. When you do see that tests are failing, you no longer think that they are related to your branch and ignore them, until you realize that rerunning the test suite six times isn’t making the build green. In some cases, finicky tests (on your own branch) are still indicators of actual bugs.

The fail/pass rate of a typical developer branch

How Can Faith in the Test Suite Be Restored?

The first step to restoring confidence in the test suite was to figure out what was finicky and what was broken. By identifying actual finicky specs, we prevented developers from blindly rerunning all the time and provided a backlog of finicky specs to prioritize.

And thus, Autometrics was born!

To begin gathering this data we decided to build a small Rails app, which we called Autometrics (automation metrics). In our CI server, we added a JSON RSpec formatter and used it to send our test run data to Autometrics by hooking into the test suite’s completion hooks. This provided us with an abundance of information we could use to analyze the history of all our test runs across all branches. With this new information, we knew we’d be able to create something that would be able to check for failure rates and failures across all feature branches.

Now how did we transform this data into something useful? We decided that there were a couple of things that would prove useful and be simple enough to deduce:

  1. What were the most finicky specs across master and/or release branches? This quickly tells us which specs are the worst finicky offenders and causing the most trouble for our releases.
  2. What is the failure rate of a spec across master and/or integration branches? This tells us how finicky a spec is (since it’s failing on a master/release branch it’s not an actual failure of the feature branch).
  3. What is the failure history of a spec across feature branches? This tells us possibly when a finicky spec was introduced and if the behavior is consistent across all these feature branches.

We next planned an MVP that would get this information in front of developers and release managers. What would introduce the least overhead, but still provide useful data? We decided to develop a CLI tool on top of Autometrics. It would be based on the data we send over from our CI/CD tool and has a few executable commands.

Below are the CLI commands available:

am help spec
Commands:
autometrics spec finicky FILE_PATH # Check if a spec is finicky against master/integration
autometrics spec finicky_list # Check for X top list of finicky specs against master/integration
autometrics spec help [COMMAND] # Describe subcommands or one specific subcommand
autometrics spec history FILE_PATH # List of example runs for spec

To monitor the spec with the most failures on integration or master branches you’d run finicky_list.

In addition, the CLI proved useful for test automation and QA to determine where finicky specs or bugs were introduced by looking across the history of all branches. For instance, if there was no history of a particular test failure at all until “Branch A” or if there was an abnormal increase in failures after that branch it would clearly indicate that branch introduced a finicky test.

While a few developers and release managers used the CLI we created, we didn’t see the level of adoption we were hoping for. We had to figure out how to expose all of this information better. What’s simpler than a small CLI? A webpage! We thought about what would be easier to digest quickly and accurately. Instead of having lots of rows of data to sift through, graphics and charts are more digestable.

How Have We Benefited?

Now, how has this all helped us? The tool has led to us identifying and fixing 70 finicky specs so far this year!

Also, we see far less of those oh so sad Jenkins failure icons.

Instead of the bottom image, we see much more of the top set!

More so, we’ve seen an increase in the frequency of releases! We’ve improved from 1.75 to 1.86 releases per day. While that difference may not sound too large, over the course of a year that could amount to an extra 25 releases! Aside from this data, the introduction of Autometrics led to a change in general attitude towards finicky tests. Developers and QA are more empowered to identify and report tests that are finicky.

While it would be nice to dream, the world of development will never be rid of those pesky finicky tests. However, we here at Greenhouse are at least better equipped to handle them as they come.

Pssst — we’re hiring!