How We Built Scalable End-to-End Tests

Published in

Bluecore Engineering

9 min readJul 2, 2018

Two years ago, Bluecore started building UI products. Up to that point we had been good about creating backend unit tests, but creating full scale UI tests is a completely different organizational experience. End-to-end (E2E) UI testing is hard, and we ran into several challenges. Our tests were brittle, and debugging failures could be a nightmare. Tests were taking forever to finish or were too expensive to run often. When we searched for help online, we got mostly results with someone trying to sell us their silver bullet solution they claimed would magically make all our testing problems go away (but probably wouldn’t). In this article, I’ll share our experience with three different test frameworks and how well they helped us address these issues.

What do we mean by scalable?

It’s relatively easy to stand up some tests and build initial coverage for features. What’s significantly harder to do is build tests that stay reliable over time. There are several best practices we had to learn the hard way that led to successful testing. Examples include building idempotent tests, the test pyramid model, keeping tests small, and quarantining noisy tests. Sure, maybe whatever test infrastructure we created was great at first, but as we grew our test suite, problems started creeping in.

However well-intentioned we were to start, sometimes problems would begin to snowball and grow. Test results start becoming unreliable, which makes people lose faith in the tests. This leads to people spending less time working on the tests, which leads to unreliable test results, and then we’re stuck in a vicious cycle. Eventually the pain reaches a tipping point, and our team decides that it is easier to start over from scratch with a new testing approach than it is to convince everyone to fix the existing problems.

In order to spare yourself from the aforementioned testing death spiral, we recommend you consider the following: As you grow your test suite over time, are the following criteria fulfilled?

Easy to write: This leads to increased adoption and test coverage.
Quick to run: This allows for a shorter feedback loop and iteration cycle.
Easy to maintain: This helps people keep tests up-to-date.
Repeatable and reliable: This matters because results are useless if they can’t be trusted.
Cost efficient: This ensures you can continue (re)running your tests.

Here’s a high-level comparison of how we rated each of our E2E test frameworks:

*VS is used to anonymously refer to our Vendor Solution of crowd-sourced tests.

Now let’s take a closer look at each of these.

How we approached building UI tests quickly: Enter our vendor solution

The first versions of our UI were unreliable and buggy. A big reason for this is that we were manually testing our UI changes. As an early stage startup, we were hyper focused on finding product-market fit. This meant we had to build UI workflows quickly, and our test debt ballooned. With limited resources and no QA team, there weren’t many great options for building automated test coverage. However, we found a software testing vendor that could help us quickly create “automated” tests. I’ll refer to it anonymously as VS. The way VS works is that we write test steps in plain english and it utilizes testers from around the world to execute these tests. Yes, someone is manually running through our tests, but from our point of view the tests are “automated” because executing tests is simply an API call from our CI/CD service.

These tests are easy to write but have a higher marginal cost than traditional automated tests because rather than computing power, we pay humans every time we run a test. We needed E2E coverage ASAP, and VS made it easy to build this in a pinch. Our approach was to create new tests in VS before we eventually automated them ourselves.

Let’s see how VS does with our scalability criteria:

Easy to write: Yes — Anyone can write tests because it requires no technical expertise. The interface is intuitive and easy to use.
Quick to run: Somewhat — All our tests effectively run in parallel because they all get crowd-sourced at once. However, human testers will never match the speed of computers.
Easy to maintain: Somewhat — The VS UI is easy to use but had no version control, which was a pain point for coordinating test updates with code pushes. We got around this by creating a GitHub repository for our tests and pushing updates through a VS command line interface as part of our code deployment script. Viewing test results is easy, and VS provides tester feedback, screenshots, and video playback of tests. However, while VS can identify buggy behavior, it cannot identify where in the source code the failures are.
Repeatable, reliable: Somewhat — While we were eventually able to achieve 99% test reliability rates (as defined by results that are not false positives or negatives), crowd-sourced human testing is inherently imperfect and can provide inconsistent results. Spending time investigating a test case failure we discover is due to user (mis)interpretation is never fun.
Cost efficient: No — At least not for us in the long-term. This cost model depends on how often we run tests. This wouldn’t be so bad if we weren’t trying for CI/CD, but we’re aspiring to achieve a state where we can run all of our tests against each of our private builds. This means instead of running our test suite twice a day on our QA environment, we’re running our test suite tens or hundreds of times a day on every developer check-in. Running VS this often is cost prohibitive.

How we addressed the cost efficiency concern: Enter Selenium WebDriver

We never intended for VS to be our long-term solution for automated testing. Our strategy was that VS would be a stopgap to get “automated” sanity tests quickly. Meanwhile, we built a Selenium WebDriver test framework where we planned on migrating all of our VS tests. To summarize: any new tests for upcoming features could be created quickly in VS but would eventually get migrated to Selenium once the feature entered General Availability.

Let’s see how Selenium does with our scalability criteria:

Easy to write: No — Unlike VS, writing a Selenium test requires technical expertise. Only people with coding experience are qualified to write them. Furthermore, while we can be up and running with VS in just an hour or two, we found that our developers with no prior Selenium experience should be allocated 3–5 days to become comfortable with the tool and the page object model we set up before they could start writing their own tests.
Quick to run: Yes — Our Selenium test suite is configured to distribute testing amongst several containers in CircleCI so that tests run in parallel. As we grow our test suite, run times become a function of how much container parallelization we want to utilize. Essentially:

Easy to maintain: Somewhat — We use Allure to notify Slack with test results. It shows us exactly what test step failed with screenshots and logs. However, while our test results are more informative from a technical perspective than VS, there still isn’t great support for identifying where in the source code the problems are.
Repeatable, reliable: Somewhat — As many users will attest to, there’s a certain amount of flakiness encountered when using Selenium. However, Selenium receives some unfair blame because it’s just exposing some of the difficulty of testing E2E UI in general. We have been using Selenium for a year now and have gotten our tests to a state of 99% test result reliability. This was mostly accomplished through getting better at the art of writing intelligent waits and developing code that is more test-friendly.
Cost efficient: Yes — Writing new Selenium tests requires a higher time investment than VS. However, running tests costs us CircleCI container usage, which is more than a full order of magnitude cheaper than running VS tests. This means we don’t have to worry about our test suite growing prohibitively expensive because it costs pennies on the dollar to run compared to VS.

*Cost models of VS and Selenium: Selenium requires a bigger initial time investment but is cheaper over time because test execution is cheaper.*

Our current focus: Enter Cypress

Although Selenium solves our cost scaling issue, there are two major pain points that are preventing higher adoption rates and test coverage from our development teams. Writing tests in Selenium is much harder than in VS, and it can be hard to troubleshoot false positives. During an engineering collaboration with another popular startup, our teammate asked how they addressed UI test case flakiness. One of their developers chimed in that using the open-source project Cypress was working well for them. This was the catalyst for us to investigate how well Cypress could solve our testing needs.

The main architectural difference between Selenium WebDriver and Cypress is that Selenium runs test scripts remotely to the browser, while Cypress runs within the browser. For a more detailed explanation of the key differences, check out this summary. After an evaluation period of two months, we concluded Cypress was the framework that best addresses our E2E UI testing needs thus far because it makes testing easier, faster, and cheaper.

Let’s see how Cypress does with our scalability criteria:

Easy to write: Yes — Like Selenium, Cypress requires technical expertise in order to write tests. However, installation is ridiculously easy so we can be up and running in a matter of minutes. Tests are intuitive to create, and Cypress provides great online documentation. The open-source community is also very active. When our developers have reached out for support, someone usually responds with an answer in a matter of hours.
Quick to run: Yes — We built similar container parallelization as with our Selenium tests, but our Cypress tests run even faster. In fact, many of our Cypress tests run more than 2x faster than their Selenium counterparts.
Easy to maintain: Yes — We created the same Allure and Slack reporting as with our Selenium tests. However, one major benefit Cypress provides is a built-in debugging tool. Finally!
Repeatable, reliable: Somewhat — Cypress claims to be more reliable than Selenium because it gives native access to every object within the browser and can understand everything that’s going on from the inside. However, we’re going through some growing pains with making our tests reliable, including troubleshooting false positives. That being said, this isn’t any different than the pains we had when ramping up with Selenium. Because Cypress has native access, we’ve been able to narrow down issues more discretely than before. In fact, we’ve been able to find and fix previously undiscovered bugs because of Cypress! Over the course of a year we improved our Selenium test case reliability from 92% to 99%. We anticipate we’ll get to 99% reliability with Cypress even faster.
Cost efficient: Yes — Writing new Cypress tests is on par with the amount of time it takes us to write VS tests. We’ve set up our Cypress tests to run like our Selenium tests, but our Cypress tests finish faster. This means our Cypress tests are even cheaper than Selenium because we pay for less CPU usage.

*Test tool cost comparison: I’ve hidden the actual dollar costs and used 1 bluecoin (not our cryptocurrency) to represent our Selenium cost as a baseline.*

Learnings

There are a lot of testing best practices Bluecore is getting better at as we grow. Test results triage, fixing or disabling noisy tests, and writing shorter tests are examples of processes we try to improve regularly. Without this attention, we know that our team could lose confidence in our test results and our testing framework. On top of this, we came away with some other key takeaways:

While VS gave us a quick way to pay our test debt, it was a high interest loan. It’s hard to convince people to prioritize migrating passing tests to another framework versus new feature work.
By regularly exchanging ideas within the NYC tech community, we get a wider range of perspectives and see how other teams are doing things. There’s a lot of room for mutual gain.
Keeping a pulse on what’s current provides tangible benefits. Between the massive amount of information available via online content, podcasts, professional networks, etc, we regularly re-evaluate how we’re doing things to see if there’s a better way.