10 changes that improved our tests
This is the second in a series of articles on how we built a productive C/C++ development environment at Undo.
When I started at Undo, a lot of effort had been made to build strong test coverage for the product alongside the development of the product itself. There was a culture of adding a new test every time a failure was fixed. But we had problems with our tests in three main areas:
- Reliability. There were too many reasons that tests could fail intermittently. We ran many tests (now on the order of a million tests per day), meaning that any one-in-a-million race conditions left in our code would show up often enough to make a sizeable dent in our productivity.
- Efficiency. When we rolled out our first comprehensive merge-gating test suite it took 50 minutes to run, which was too long for an efficient development cycle. With subsequent work we got this down to less than 20 minutes, and improving inefficient tests was an important part of this.
- Actionability. When tests failed, it wasn’t clear whether there was a real product problem that needed to be fixed. When developers were busy and had competing deadlines and other priorities, test failures that could be explained away as “not a product bug” were often ignored.
When investigating these issues we found that they mainly stemmed from a small set of anti-patterns that had found their way into our tests, and in an attempt to shift our testing culture I started reviewing every new test that was added to our code base. After some time I wrote up a document with suggestions on how to write reliable, efficient and actionable tests and started to socialise this within the team, and I’m no longer regularly reviewing tests because the principles appear to have stuck — our tests are now substantially more reliable than they once were.
I’ve extracted what I think are the ten most important points from that document and listed them here. Some of them may seem obvious and others may be controversial — let me know what you think.
- Tests should only sleep when waiting for a specified condition. This is the big one. Often our tests contained sleeps where we expected something to have happened by the time the sleep expired. Cloud test machines can occasionally do nothing for a number of seconds at a time, and when the sleep expired nothing had progressed, breaking the assumption of progress and causing the test to fail. We fixed this by identifying the condition we needed to wait for and synchronising on that. If that’s not possible (perhaps because adding a print would be too slow in performance-sensitive code), we found it useful to identify something that would indicate the condition is not met, and loop until that’s not the case.
- Where sleeps are necessary, ensure they are short. When waiting for a condition to become true rather than just waiting for some time to pass, it’s usually sufficient to sleep only for a short period of time between checks of the relevant condition. In the context of testing our UDB and LiveRecorder products, it is rarely necessary to sleep for longer than 100 milliseconds in each loop iteration if the condition being waited for is clearly defined.
- Ensure that loops always terminate. Our test system runs all tests in the same process, so there is no easy way to add per-test timeouts. This means that tests with loops that can iterate forever would sometimes cause the entire test run to time out. To guard against this, we asked developers to always check against a timer in each loop iteration and fail if the timeout is exceeded.
- Ensure all invocations of external processes have timeouts. We found that external processes that we spawned (for example, to check out code from external git repositories) would have a tendency to hang. As with the previous item, this would cause entire test runs to time out, so we added a default timeout to the function that was used in our tests for spawning processes.
- Where timeouts are required, make them longer than you think they need to be. Sometimes we attempted to target timeouts to specific tests, machines or scenarios as a basic form of performance testing before we had proper performance testing infrastructure in place. This typically led to intermittent failures when we ran those tests on new platforms with different performance characteristics. Our default timeout value was five minutes, and we found that timeouts shorter than that value were unnecessary in the common case when the test was passing and increased the risk of sporadic failures, and when they were hit they raised difficult questions about whether the timeout was too short or whether there was a genuine problem needing to be resolved. We found it preferable to avoid the question entirely and just use a long timeout.
- Always wait after asynchronous harness operations. Our tests would typically submit instructions to the product under test and immediately wait for a response. In some situations we needed these requests to be asynchronous, and sometimes we failed to wait for the response. Adding tracking in our test harness to ensure that only a single asynchronous operation could be in flight at once helped to prevent this.
- Assume tests will be run in parallel with themselves and other tests. It was common to find tests that had been written with hard-coded filenames or that used other singleton machine resources. Those tests passed when run sequentially or when run in a reasonably deterministic parallel schedule, but when we attempted to parallelise them more aggressively we started to see intermittent failures. Where tests genuinely require singleton resources, we tagged them accordingly and ran them sequentially in a separate test run.
- Test against requirements. We asked test authors to explain briefly in the documentation of their tests what the product is required to do and why. Including references to bug tracker issues, requirements documents or specifications as appropriate means that future readers can understand the technical context and are less likely to remove important coverage.
- Distinguish checking from testing. Checks are deterministic and should replicate failures on their first iteration, meaning they’re suitable for inner-loop continuous integration loops. Our more exploratory tests would loop through multiple iterations to achieve “probable” coverage and we had to move them outside of our main merge-gating or release-gating continuous integration loops to avoid intermittent and un-actionable failures.
- Add thresholds to performance tests *and* plot historical results. If a test failed due to an exceeded threshold but no performance history was available, the failure could be dismissed too easily as an infrastructural blip. Tests with performance graphs but no thresholds were a little more useful, but by the time a regression was noticed it was much harder to figure out why it had happened. We had to delay major releases on a couple of occasions when we discovered that some of our performance tests had regressed a week previously.
I hope these suggestions are useful — perhaps they chime with your own experiences in testing complex software? In our case, we now see on the order of one or two intermittent test failures per day out of a million tests being run, which is a substantial improvement from where we were before.
Keep an eye out for the next article in the series.