UI Regression Testing

Baseline image on the left and incorrect background color of the same page on the right

UI testing is hard. So hard that most companies have a suite of QA teams devoted to testing new versions of their site. Yes we have unit tests, yes we have integration tests; but while these provide great assurances on the functionality of our code, they fail to ensure that a user will see what they expect. When components move around or colors change, users can become frustrated when presented with an unfamiliar interface.

UI Regression testing compares a screenshot of a given page to a baseline of what it should be, and outputs a diff image when applicable. When a difference is found, you have many options on how to proceed. You can set up your CI/CD pipeline so that the build fails, trigger a pager duty alert, or send a slack message to the team responsible.

In this post, I’ll be going over the tooling and setup of how we built our solution, and show how you can do it too!

Tooling

Here’s what we used:

  • Docker / docker-compose
  • Selenium
  • WebdriverIO
  • AWS S3
  • Blink-Diff

Docker provides a sandboxed environment and makes it easy to deploy this to our CI/CD pipeline. WebdriverIO allows us to control the headless browsers managed by Selenium, S3 stores our files (baseline images, diff outputs etc), and Blink-Diff is our image comparison tool (note that it’s an npm package).

The Process

  • Download baseline images from S3
  • Take screenshots of the local version of the site
  • Test for any differences between the screenshot and baseline
  • Upload any generated diff images
  • Cleanup and alerting

Downloading the baseline images from S3 is simple. Whatever language you’re using, AWS likely has an API for it. Similarly, dockerizing your website should be fairly easy, though if you’re running a legacy site, it could be more involved. I won’t discuss how to do this since there are many articles and tutorials on this subject, and it’s very much platform specific based on your stack.

Docker

We’ll be using docker-compose with a couple pre-built images all set up to run Chrome and Firefox in Selenium.

Here’s the start of the docker-compose file that sets up the three services (one for each browser, and one that controls them). We’ll be adding to this file as we go on.

We set up our selenium hub and two browsers to run as their own service. The browsers connect to the hub on port 4444, which we then also expose in the hub service so that we can see what’s going on (this would be done by visiting localhost:4444).

Adding your Site to docker-compose.yml

You’ll need to add your website as another service. This requires adding something like the below to your docker-compose.yml. We define a new service called app, and tell it to look in the folder ./webapp for its Dockerfile. The Dockerfile for the service should copy over its files, install dependencies and start up its server.

app:
build: ./app # build using the Dockerfile in the `app` folder
ports:
- "3000:3000" # expose port 3000 (The port the app runs on)
logging:
driver: none # hide logs

I expose port 3000 here so that when the service is running I can visit the site and make sure I’m able to hit it. Once this is verified you can remove this if you want.

Adding WebdriverIO

Similarly, you’ll need to have another service for the actual testing process. This will involve running WebdriverIO, downloading and uploading files to and from S3, diff generation, as well as any other business logic specific to your needs. To get started, you’ll want to add something like the following to your docker-compose.yml.

testing:
build: ./testing
depends_on:
- selenium-hub
logging:
driver: none # turns of logs in your terminal

In here we specify that this service depends on selenium-hub so that selenium starts up before the testing service. You may need or want more control over only starting one service before another then this provides. Check out this docker documentation on how to do that.

Setting up WebdriverIO

WebdriverIO is pretty cool, but can be difficult to understand when first getting started. We’ll be using wdio which is the test runner built into WebdriverIO, but it’s difficult to see where one ends and the other begins. To set it up run the following:

npm i -g @wdio/cli

wdio config

You’ll then be asked a bunch of questions to set up your env. Below is an image of the questions you can expect to be asked.

WDIO Config Setup

This will generate a wdio.conf.js file with all your settings. The file has a ton of comments explaining each part of the config. If you’re having problems in getting the test runner to work, this file is likely the culprit. I’ve pasted below a version that I know works and that I have used myself. I’d spend some time going through the file that you create through the CLI tool and comparing it to the one below.

Working wdio.conf.js

You’ll likely have to play around with some of the settings. For example, the timeout values may cause a problem and need to be increased, or the path to your test files might be different.

When going through the setup process, wdio will offer to install any additional packages that you need based on your answers. This includes the report package and the testing framework among others.

The last step I’ll cover is using WebdriverIO to actually run your tests. Take a look at the following code sample to see how I set this up. I define a few helper functions ,and then simply loop through all the screen sizes and routes that I want to test.

Next, I define a list of screenshots in a separate JSON file which allows me to easily run a diff comparison for any number of screen sizes. This lets me test what the page looks like on an iPhone SE vs an iPhone X vs a 13 inch computer etc. Similarly I pull the list of routes to test from an external location. You can set this up any way you want based on your architecture and requirements.

A few notes

  • Based on the browser, I use a different method to resize the browser
  • The URL of the site is https://app:3000/${route}. app is the name of the docker service running the website. I specify the port as well and use the route passed in as a parameter
  • I wait 1 second before taking the screenshot. There are likely better ways than just waiting a little to know when the page is loaded. You can test for an element on the page or perhaps maybe for an event to have been fired.

All the Other Bits

A few more scripts are needed for this to work. You’ll need to run the image comparisons, upload and download files from S3, and perform any other tasks that you want to occur for any given situation. These aren’t difficult to write, and are fairly trivial, so I’ll leave their design up to you.

The Final Piece

The last question in all this is: what happens when a build fails this test? This is more of a process issue than a technical question. Is the diff found intentional? If so, then you need a way to update your baseline image to the new image. This requires another part, a diff reviewer that allows people — application developers, people on your QA team, product owners or anyone else who needs access — to accept or reject the changes found. Accepting the changes would require updating the baseline image to the new screenshot image, while a rejection would have to notify the dev team responsible for the changes.

The Output

Below is an example output that is produced by the diff tool. On the left is the baseline image, the middle is baseline overlaid on top of the screenshot, and on the right is the screenshot. The baseline has the drawing on it, since it was easier for me to add a quick edit to the baseline rather then edit the contents of the page.

Example diff output

One cool way to expand on this would be by integrating an AI to determine when a change is valid. You would either need to train your own model on good and bad UIs, or find a pre-trained model and then integrate it into the process and use that rather than blink-diff. While this is certainly a cool advancement to make, it’s by no means a requirement; and doing simple image comparisons will take you far.

Potential Issues

I’ll end with a few open questions and potential issues that you may encounter.

  • You’ll need to have some business logic to determine if baseline images exist. If not, then it means it’s the first run and the screenshots should be named, labeled, or stored in a way as to make them the baseline images for subsequent runs.
  • Timeouts can cause trouble. It’s possible you’ll have to increase the timeouts in the wdio.conf.js file to prevent the runner from exiting prematurely.

While this is a complex problem, and this solution doesn’t solve everything, I hope it’s given you some insight into how you can build your own automated UI regression testing suite. Since all of this runs in docker it’s very easy to drop into your CI/CD pipeline whether it be Jenkins, Travis, AWS or any other provider.

Happy testing!