Elements of Successful Massive Scale Automated Testing

Published in

Walmart Global Tech Blog

6 min readAug 18, 2016

The mere mention of the term “front end automated testing” is enough to make a seasoned web developer shudder. Driving web browsers remotely is an inexact science. It’s expensive, slow, and fraught with deal-breaking instability. Meanwhile, we still want to verify that our apps and websites work before deploying them to production or the app store. In fact, forget deployment — deployment is far too late. We want to be able to fully verify our code before we merge it to minimize our impact.

Verifying before merging should be familiar to anyone who works on a project whose master or development branch is protected by unit tests. Developers like unit tests, because they:

execute quickly,
are perfectly isolated, and affected only by changes in commits,
and are run before code is merged.

Most test suites based on protocols like Selenium, Appium, etc, do not exhibit these traits. They often target server-deployed code, are subject to numerous random environmental factors, run slowly, and are triggered long after code has been merged.

Automating UI tests just plain sucks. We get very low informational value at an immense cost of time or computational resources. The potential for high value is there, but the typical process is broken. How do we fix this?

The Cost of Procrastination: Sailing The Sea of Red

When we first started investigating this problem at WalmartLabs, we were running thousands of customer flows through Selenium, but far too late in the process, often resulting in a test matrix full of failures. We lovingly called this the “sea of red”.

By the time regressions were discovered, they were hours-to-days old, mixing with other bugs, and causing numerous failures. The lifecycle of bugs then followed a torturous path of re-testing, investigation by quality engineers, ticketing in Jira, and then strained attempts by developers to reproduce and pin down the problem.

It seems obvious, but it’s often not considered in the process of software development teams: that the later tests are run after code has been written, the more information is lost, and the more expensive it is to recover from bad code.

Teams that appreciate the utility of unit tests paradoxically treat automated UI testing as an “integration test” activity, left to be done long after code is merged. Procrastinating integration testing until late in the game is a profoundly expensive mistake, since customer-focused testing (testing that performs real customer actions) reveals an entire class of regressions that unit testing does not.

The only way to get around this is to run tests — including automated UI tests — before merging code.

The Realtime Testing Holy Grail: Fast Realtime Verify

Today, the WalmartLabs test automation process relies on three fundamental assumptions:

Massive Parallelism: Tests are run with hundreds or even thousands of workers (virtual machines, browsers, or devices) in parallel.
“Realtime” Testing: Tests run as soon as possible, as often as possible, before merging code, and always green before merging code. In practice, this means triggering tests on every commit to a pull request, and not allowing code to be merged unless that commit is green.
Deterministic Behavior: The entire testing stack is deterministic. This means fully mocked servers, fully predictable database entries, total control over endpoint behavior, and fine control over randomized application behavior (eg: promotional fly outs).

The result is Fast Realtime Verify, depicted in the graphic above as an ideal combination of our fundamental assumptions. On every PR and subsequent commit to a PR, we run a massively parallel automated UI test with a fully deterministic, mocked application back end, multiplied by a range of browsers or devices.

This is pretty ambitious. Are there any easier ways to achieve the same effect? As it happens, making compromises to the three fundamental assumptions yields processes, some of which at first look useful but fail in deal-breaking ways.

Chaotic Verify

In our experience, the part of Fast Realtime Verify that teams find the most laborious to replicate is determinism. Modifying an existing application with a convincing mock that can be run in isolation is not easy. However, the cost of not doing so is also high: Even with massive parallelism, running tests in PRs yields unpredictable results that can actually make life miserable for developers. A change made to an external service or module outside of a team means a developer can see tests unrelated to their update fail. The Chaotic Verify pattern erodes trust in CI infrastructure and the testing tools themselves.

Late Verify

While many development teams are comfortable with the idea of running massive automation suites on every PR once presented with the benefits of the workflow, some are not. These teams would prefer to run UI tests in a late-stage “integration” phase. This compromise allows regressions to creep into the codebase, and remain undiscovered for a very long time. By the time they are discovered, the cost of regression recovery increases significantly. It’s easy to see why we call this form of verification “Late Verify”.

Slow Verify

Leaving out massive parallelism gives us the very reliable, but also glacially-paced Slow Verify. If we have a small application, or UI tests run unusually fast, or the number of supported customer environments (browsers, versions, devices, etc) is very small, this may be an acceptable compromise. If your scaling budget is small, this may be your only option. On the other hand, automated UI testing on even a mid-sized application can take many hours, especially if you’re testing across many customer environments. When verifying a PR takes a long time, developers start to ignore the result and merge code without waiting. If verification is perceived as taking “forever”, developers will begin to campaign against automated testing. Slow Verify highlights why usability and convenience is a critical element of massive-scale test automation: Developer acceptance matters in choosing a workflow that survives the test of time.

Green is Cheap, Red is Expensive

For automated UI testing to deliver useful information to developers, it has to run before code is merged, has to be fully deterministic, and has to run very fast. Removing any of these puzzle pieces results in a process that mostly frustrates development teams and wastes time and money.

Massive-scale automated testing only works well if it is never allowed to descend into a “sea of red”. When a test suite is allowed to slip into this state, spending a lot of money on automation infrastructure suddenly looks very foolish. An automation infrastructure dominated by failing tests is extremely expensive, while the same infrastructure running mostly passing tests is a comparatively smart investment. This guiding principle informs how WalmartLabs organizes its testing infrastructure, and also how we write automation tools that make intelligent use of time and hardware.

Tools

Thus far we’ve omitted any mention of specific tools. The workflow pattern described in this article is far more important in creating an effective test automation process than the particular software chosen.

With that said, we’ve been working hard on a toolset that enables our in-house implementation of Fast Realtime Verify, which employs a mixture of Github, Jenkins, Docker, and our TestArmada suite, including Magellan.

Conclusion

We no longer shudder at the prospect of a large set of automated front end tests, and nor should you. In fact, we make sure our front end test set grows in depth and comprehensiveness. This provides our developers with insights into trouble spots, and gives them increased confidence when they deploy.

Stay tuned for a series of deep-dives about how we’ve implemented these critical elements at @WalmartLabs, and how it all fits into our Continuous Delivery pipeline.