Zombies and Soup

Why End-to-End testing sucks
(and why it doesn’t have to)


This is an adaptation of a talk I gave at JSConf 2015.

Your README is Trolling Me

If you’ve been hanging around the Internet long enough, you’ve probably learned how to draw a horse from Van Oktop. But here’s a quick refresher:

Credit: Van Oktop http://oktop.tumblr.com/post/15352780846

A few essential details were left out between step 4 and 5. But they’re not particularly obvious. Do I need some image editing software? Do I need to know how to use an airbrush? Do I need years of artistic experience?

Sometimes, when trying out an Open Source project, I get the same feeling that some key information is left out of the “Getting Started” section of the README.

Sure, OSS maintainers are working for free, and it’s certainly not reasonable to expect top-notch documentation for somebody’s side-project. But I’m not just talking about missing docs. I’m talking about the Troll README.

Oftentimes, a project will enumerate a series of steps, promising a quick jump start to getting the project up and running. But there are clear and obvious pitfalls that a new user is likely to hit, which go unacknowledged.

Sometimes you’ll see evidence of this in a project’s issues list, with scores of users “+1”ing the same problem they all experienced right out of the gate. But equally likely is confused silence.

Giving Up Quietly

Maybe they’ll throw up their arms in frustration. Maybe they’ll feel stupid, thinking that there must be something obvious that surely everyone else had figured out on their own. But the reality is that there will always be a sizable portion of users that won’t try to hunt down some obscure error. They’ll just move on to something else.

These are tragic missed-opportunities — Users exploring a new project who could have evolved into full-fledged evangelists and contributors, yet gave up because something that seemingly should have been simple just didn’t work.

Simple acknowledgements of likely pitfalls really go a long way in keeping users sane and productive. Unfortunately, they are often omitted, leaving users to scratch their heads, shrug, and give up.

Hitting Close to Home

When I joined WalmartLabs in early 2014, this phenomenon of unacknowledged pitfalls, wasted time, and abandonment of hope had already occurred in one particular area: End-to-End Testing (e2e testing). I just didn’t know it yet.

For anyone not familiar with e2e testing, it refers to using automated tools to launch real web browsers that follow a script to interact with a webpage in the same way a real user would. Most all e2e testing solutions are extensions of a browser automation library called Selenium.

Anyone who has to support a long list of browsers (especially old IE) thirsts for e2e testing. It promises to take the tedium and time out of this certification process. But just as quickly as people are enamored by this technology, they usually grow to despise it.

Our own QA team had been desperately trying to get e2e testing up and running for a long time, but had mostly given up on it. We weren’t alone. Even Google’s own Testing Blog described a similar experience, comparing e2e testing to “a movie that you and your friends all wanted to see, and…all regretted watching afterwards.”

http://googletesting.blogspot.com/2015/04/just-say-no-to-more-end-to-end-tests.html

This blog points out the three biggest problems with e2e testing:

  1. It’s Unreliable: False-positives waste everyone’s time
  2. It’s Slow: Test suites often take hours to complete
  3. It’s Unhelpful: Hard to pinpoint cause of failure

If You Want Something Done Right…

We set out to build our own cross-browser end-to-end testing solution that would succeed in the real world. It was going to be awesome!

Our goals were that it should be:

  1. Collaborative: Low barrier-to-entry, allowing QA and Dev to easily write tests together
  2. Fast: Allowing e2e tests to run as part of PR verification builds
  3. Reliable: No test flake!

We tried many of the popular NodeJS wrappers for Selenium that provide nice clean APIs, and settled on NightwatchJS for its simple API and widespread adoption.

Right out of the gate, test flake hit us hard.

Like a Croissant

For an engineer, the only thing more maddening than something that doesn’t work, is something that doesn’t work sometimes.

When a testing system produces errors in a non-deterministic manner, without obvious causes, we call this “test flake” and it sucks. Test flake consumes debugging time, and erodes confidence in the system when users start to doubt the validity of reported errors.

We introduced e2e testing as a pilot project in our “cart and checkout” team. While initial enthusiasm was high, test flake quickly started to erode it. Errors would pop up, then disappear the next time the tests ran. Developers began to doubt the validity of what the system was telling them. We needed to fix it, or risk losing the hearts and minds of our dev consumers forever. We rolled up our sleeves and began to tackle the root causes of test flake.

Whack-a-Mole

Over the first few weeks, we quickly learned that many, many people had been dealing with the same flakey behavior for years. Common causes of e2e test flake include:

  • Driver bugs: When Selenium’s own browser-specific adapters, themselves, have bugs. For instance, sometimes IEDriver claims to have clicked an element, but it didn’t.
  • Network hiccups: Selenium communicates over a very chatty HTTP API. When you’re running hundreds of tests every day, that’s thousands of HTTP requests. Even the occasional dropped packet can cause significant test failures.
  • Service instability: Most organizations choose to outsource browser farm maintenance to services like SauceLabs or BrowserStack. While these services provide a great benefit, they are not immune from their own instabilities. Sometimes remote VMs fail to initialize, or time out for no apparent reason.
http://xkcd.com/979/
  • Other random things: Every now and then, we’d hit an error which didn’t have any Google matches — or worse yet, a single result from years ago with no resolution. Even months into our project, we’d continue to find brand new sources of flake such as this.

Like good engineers, we dutifully tried to play whack-a-mole and address each source of test flake directly. But it became clear quickly that we’d never be able to solve every source of flake, especially when we couldn’t even establish a comprehensive list of what test flake includes.

We were starting to get discouraged. After having come so far, test flake seemed an insurmountable impasse.

Zombies and Soup

Our baseline for test reliability came from our experience writing unit tests, where reliability is excellent. You can run a unit test a million times and get the same result because there are so few moving parts: tests run locally on your computer, and no network is involved.

In a moment of frustration, I made this into a silly metaphor — In the world of unit testing, the process of wanting a test result, and then getting it is like being hungry, and grabbing a can of soup from the cupboard. It’s right there. You grab it. You’re done.

But with e2e tests, the process for getting your can of soup is more like:

  1. Get in the car
  2. Drive to the grocery store
  3. Along the way, drive through a Zombie apocalypse
  4. Grab a can of soup off the looted shelves (if you’re lucky)
  5. Try to get back home with your brains intact

Obviously nobody wants to deal with all this, when all they want is a damn can of soup. But does that mean that e2e testing is doomed to be just out of reach for everyone as a reliable tool?

Back to Math Class

After weeks of banging our heads on the desk, we took a step back and thought about what we were really trying to solve. The main issue was that QA and Developers both expected test results to be truthful. By giving them a flakey system — in mathematical terms — we were breaking an axiom they depended on.

axiom (n) a premise or starting point of reasoning… a premise so evident as to be accepted as true without controversy.

We began to think about a new distinction:

  • Previous goal: Make Selenium error-free
  • New goal: Make the test results reliable.

Devs should be able to count on test results as axiom. We should be able to say, You worry about writing good tests — we worry about infrastructure.

So how could we make the test results reliable, when the underlying technology itself was not?

One Weird Trick for Solving Test Flake

If a test fails, retry the test.

That’s it. Really.

We took a seemingly-hacky solution, and baked it right in.

We wrote a test orchestrator that, in the event of a test failure, would retry the test a certain number of times before reporting it as a failure. We found that 3x was a good number. After making this change, virtually all false-positive test results were eliminated.

In the early stages of our e2e testing project, we’d get nervous about reporting failures to developers, not knowing if we’d be sending them on a wild goose chase. But with this solution in place, we could now be confident that any reported failure was extremely likely to be a real problem.

Now this is where we lose a lot of street cred amongst the purists:

So you just keep brute-force-hammering away at crappy technology until it works?

I like to answer this with a continuation of our earlier zombie-soup metaphor:

Instead of getting in your car to go the grocery store, you’re now getting in an armored convoy.

Is it overkill?
Yeah.

Is it inefficient?
Absolutely.

But is it going to get you to the grocery store?

Hell. Yes.

At the end of the day, an ugly solution that fixes a previously broken axiom is infinitely better than a thousand failed attempts at a clean solution.

Calling it What it Is

http://www.homestarrunner.com/vcr_poop.html

Finding ugly solutions to ugly problems isn’t a particularly glamorous task. It’s really akin to shoveling shit.

But once that shit was shoveled, we regained our lost momentum, and began to have significantly more fun with our e2e testing project.

We built a massively-parallel test runner, to compress long-running test suites into a fraction of the execution time.

We forked the TestSwarm project, and built some beautiful dashboards to show test results over time. This also helps with the problem outlined in the Google Testing Blog, where it’s hard to pinpoint the source of a failure.

Recognizing that tests might sometimes be flakey due to real application bugs and not just flakey infrastructure, we built some analytics tools to identify outliers — specific tests or browsers that required retries more than others:

At this point we started to hear the same thing over and over:

This is awesome! You should really Open-Source this.

At first we thought, “No way. This is all just a bunch of hacks. Nobody wants this.”

But then we stopped to think about all the other people who had wasted time trying to solve test flake, then gave up. What if we could prevent others from wasting that time, and allow them to start from a stable axiom? Then they could invest their time in solving more interesting and rewarding problems!

We began to think of Shoveling Shit not as an inglorious exercise, but as an invaluable practice in the Open Source world where difficult and messy problems get in the way of getting shit done.

We should all strive to practice Shoveling Shit as a Service (#SSaaS) — not in the sense of a hosted service (though that would be interesting) but rather a service you provide to your Open Source community.

Shit Shoveling 101

Here are the basics you need to know to jump on this revolution with us:

  1. Momentum > Perfection : Getting stuck on messy problems is often demoralizing and unproductive. Don’t get stuck chasing a perfect solution to every problem.
  2. Smoothing Over > Giving Up : Think of all those 80%-solved problems that never see the light of day. Smoothing over one problem might let you solve a hundred more.
  3. Useful > Precise : If you can help someone else smooth over a bump, it doesn’t matter how much duct tape and rubber bands you have.
  4. Open Source > Closed Source : Release when it’s ready? Nah. Release when it’s useful! Even if in some small way.

Join Us

We’ve come a long way with our e2e testing solution, and now we’re releasing it to the world.

Translating the “armored convoy” metaphor into a nautical theme, we’re releasing the following tools as part of an e2e test ecosystem called “TestArmada”:

  • magellan: A framework-agnostic test runner that handles retries, browser-as-a-service integration, and reporting to dashboards and CI
  • magellan-nightwatch: The first (we hope, of many) plugins that make magellan drive an existing node-based webdriver.
  • admiral: Beautiful dashboards for tracking magellan results over time.

We’re hoping to grow this ecosystem over time, and we need your help! Come join us at http://testarmada.github.io

What will your #SSaaS story be?
Follow us and let us know, at https://twitter.com/TestArmada