EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Selenium Isn’t Flakey

How to improve pass/fail rate without rewriting your test suite

Jeffrey (Bongo) Russom

Published in

Expedia Group Technology

7 min readAug 25, 2020

deeply flaked paint on a wooden wall — Photo by Samuel Schneider on Unsplash

In 2017, the Vrbo™️ search page team (then called the HomeAway® search page team) rewrote the legacy search page application using modern JavaScript technologies such as React and Node. As part of this rewrite, we also wanted to transition from a manually QA-ed biweekly release cycle to a continuous deployment model with automated testing. Since we needed to support IE11, Selenium was the obvious choice for end-to-end release validation.

Since the rewrite, these Selenium tests have proven invaluable for catching page breaking issues (especially IE11) and have caught many P1 issues before making it out to production. Unfortunately, in the beginning of 2020 we noticed that our pass/fail rate for these tests started to approach 50%. This meant that our tests had gotten so flakey over time that developers spent double the amount of time re-running the suite to get a clean build. Something had to be done to get our pass rate back up to an acceptable level.

So begins the investigation…

During a team meeting we discussed what might be causing the flake. Was it our Selenium cloud provider, Saucelabs®? Was it our test environment? Was it Selenium itself? We went so far as to discuss whether it made sense to dump our test suite completely and start over with Puppeteer, but instead we opted to do a deep dive into the failing tests.

What did we find?

After spending almost a week exhaustively looking through Saucelabs screen-replays and logs we discovered several key issues leading to our test flake, which in order of flakiness included

1. Flakey testing environment

At Vrbo, we have 2 separate testing environments: test and stage. Traditionally test was reserved for developer testing whereas stage was reserved for manual pre-deploy testing and C-team demonstrations. Because of this distinction, test tended to not be stable and many services running in test were either under-provisioned or had poor test data. We found that almost half of all the Selenium test failures were caused by errant errors and timeouts in microservices in test that our app depended upon.

Because stage was better maintained due to its importance as a demo environment, we reconfigured out tests to run against an instance of our application running in stage rather than test. Since our tests didn’t rely on specific test data, this change was simple and led to a great reduction in errant errors and timeouts, but it still only reduced our failures by ~20% or so.

2. 3rd party scripts

The 2nd most common error we noticed in our end-to-end suite was Selenium timeouts caused by 3rd party scripts, particularly Google analytics and the Google maps API scripts. Specifically, for maybe about 10% of test suite runs, a Google analytics or Google maps API script would hang, preventing the DOMContentLoaded event from firing. When this happens, Selenium times out waiting for the load event even though the rest of the page is still usable. Since most of the end-to-end tests didn’t require Google maps to load and none of the tests required Google analytics, we were able to disable Google analytics in our test environment and turn off Google maps for all of the non-applicable tests.

3. Insufficient test idle timeout

a clock — Photo by noor Younis on Unsplash

Most Selenium suites at Vrbo are built using WebDriver.io (WDIO). However our search page application is unique in that it uses Jest as a test runner rather than the built in WebDriver.io test runner. This has the advantage of giving us Jest’s watch feature and richer assertion library, but with the added cost of needing to “tweak” Jest’s parallel test execution to play nicely with Saucelabs. Due to the nature of our test setup we discovered that Jest was setting up test instances in Saucelabs far ahead of when Jest would actually run the tests for that instance. In practice this meant that by the time a test instance was needed, Saucelabs would already have timed out waiting for commands from Jest and closed the test instance.

Fortunately for us, this was easy enough to fix by bumping our idleTimeout in our Sauce config.

4. Race conditions

runners in a race — Photo by Braden Collum on Unsplash

Several of our search filter related tests had become flakey after we switched the search page from server side rendering all of its content to client side rendering of main content in early 2019. Our tests had originally been written under the assumption that once the page loads, a search had already been performed server side and it would be safe to apply filters. However with client side rendering we were performing the main search API call after the initial pageload. This meant that it was now possible for Selenium to start applying filters before the client side search was complete, introducing a race condition in which the initial client side search would blow away any applied filters depending upon how fast or slow Selenium executed.

Once we understood where the race condition lay, it is straight forward to sprinkle a few waitForExists into our tests to make the test execution wait for the page to be ready to receive commands.

5. Lack of test retries

Fixing the above 4 issues allowed us to take our pass rate from ~50% all the way up to ~85%, giving us a huge increase in reliability of our tests. At this point the rest of the failures we were seeing corresponded to a big “other” category of issues that did not fit into a single root cause. For example

Occasionally we’d still see HTTP 500 errors from an upstream service, i.e. even the stage environment was not 100% reliable, nor could we reasonably expect it to be 100% reliable.
One out of every 200 or so pageloads, we’d occasionally see our admission control feature kick in. To fix this we would’ve had to provision more instances in our staging environment or slow down our test execution to avoid overloading the instance. Neither of these was desirable.

The potential root causes of these rare failures are too many and too infrequent to warrant fixing individually. If we counted up all of the possible types of test failures due to flake and ordered them by frequency of occurrence, we’d get a graph with a long tail

graph of test failures with long tail on right labeled “Rare, so just retry” — Failure types that occur frequently (green area) should be fixed directly whereas rare failures (yellow area) should be handled by a simple retry mechanism.

Our previous 4 steps fixed issues that lay in the green area of the curve, but now that we’re looking at infrequent failures in the yellow area, it was time to introduce test retry logic into our tests.

This test retry logic consisted of 2 forms of retries:

Check for an error page

After initial page load, check if there was a 500 error page, if there was, retry loading the page just once. We baked this retry logic into our existing waitForSearchResults function which tests use to wait for the client side search call to return before interacting with the page.

Global retry in CI/CD pipeline

The 2nd form of retry consisted of a global retry in our Jenkins CI/CD pipeline which would retry the whole Selenium suite once if the suite failed.

And the Result?

After making all the above changes, we were able to increase our test success rate to 92%. You can see the stark change below when we rolled out the updates on January 31st.

bar graph with days across x axis becoming much greener from Jan 31st

In addition with the Jenkins global retry, this effectively raised the PR pass rate for e2e tests to 100%! Meaning that no PR was failing due to test instability! Now everyday we can open the Saucelabs dashboard and revel in the sea of green

screen shot of list of test runs with green checkmarks next to them — I love the look of green builds in the morning

Takeaways

Through the entire investigation one thing stood out. Every test failure could be explained by either a problem with the test itself, the test environment, or the app itself. In none of the test failures did we find any instability in either Saucelabs or Selenium itself. This is an important observation since many times developers can be quick to discard Selenium as buggy or outdated in favor of other tools such as Puppeteer or Cypress.io, but in our case every single problem was the fault of our tests, app, or environment, not the tool itself. This shows the importance of fully understanding your tools and the problem you are trying to solve with them before discarding them and reaching for a different tool. We are not saying that Selenium and Sauce are issue-free, just that they are nowhere near as bug and flake riddled as some might lead you to believe.