Implementing System Tests

By Mike Milner, CTO

System tests, also referred to as end-to-end tests or “enormous” tests at Google [1], assert that an application works as a whole. These tests exercise many components in the application stack from the user interface to the backend, and may even use external services. This post shares some of the hard lessons learned while implementing system tests.


The setup of components in the application stack must be fully automated before implementing system tests. In some cases, the components can be configured with Docker as detailed in my Rails Integration Testing with Docker blog post. In other cases, a configuration management framework like Ansible, Chef or Puppet might be preferable. In all cases, the process should be as similar as possible to DevOps when they deploy to a staging or production environment, and to Developers when they configure a development environment.

Note that the setup of some components might be more problematic to automate than others. Some internal services might be complicated to configure and require a prohibitive amount of work to fully automate. Some external services might require using a single account for the company. Thus, it’s possible that system tests might have to reuse some components.

Continuous Integration

Once the setup is fully automated, system tests can be run in a Continuous Integration (CI) testing framework like Jenkins or Travis; this can even be done before actual system tests are implemented to expedite the setup process. At this point, it might be tempting to configure CI for every pull request and then for every merge to master. However, experience demonstrates that this can be a recipe for disaster [2].

The problem is that system tests can take a long time to run. There are stories of system tests that take weeks to run [3] whereas Google kills enormous test targets after an hour. Either way, this is more time than a developer should have to wait for code to be tested and merged.

The solution is to run system tests periodically and report failures where people are looking. This is important because failures then need to be correlated with the changes across components since the last test run. This implies that the shorter the period, the fewer changes are likely to happen between test runs.


When CI reports failures, someone needs to take ownership. One approach is to place that onus on the team writing the component that is mostly exercised by a system test. The issue is that system tests exercise many components, so it’s not always clear that one component is used more than another component. Another approach is to have a dedicated team, like a Quality Assurance (QA) team, take ownership of all system tests. The issue here is that this team might not be familiar with the implementation details of each component.

An alternative approach is to have the QA team collaborate with the teams implementing each component. First, the QA team would be responsible for triaging failures by correlating them with one of the components. Then, the team for that component would be responsible for fixing the regression or updating the system test accordingly. This implies that system tests should be maintainable by everyone.


The failures encountered by system tests are not always related to a component. As the number of components tested increases, so too does the number of difficult-to-predict interactions. As a result, system tests tend to become flaky and eventually fail for unexpected reasons. For example, some tests might rely on external services to download dependencies and might render the site temporarily unavailable.

A common solution to work around flaky tests is to simply retry running the test suite another time. The problem is that some failures might be an actual bug in a component; race conditions are notoriously difficult to recognize. It is essential to never accept flaky tests and consistently strive for tests to pass on the first try.

Another solution is to minimize the time to triage failures by improving error reporting. This doesn’t preclude improving the reliability of a flaky test, but it can be an effective temporary solution. Furthermore, the improved error reporting might be useful to the DevOps and the Developers when they run the same components.

Why bother?

Implementing system tests is a lot of work, so why bother? It depends on knowing if the fact that an application works as a whole is worth the effort to test it.

Recommended Reading

  1. How Google Tests Software
  2. Continuous Integration: Improving Software Quality and Reducing Risk
  3. Building Microservices

Originally published at on August 31, 2015.

Like what you read? Give IMMUNIO a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.