Testing for unforseen visual change

Your computer can (probably) see better than you

In part 1 I explained how we use unit tests to validate that the markup generated by our design system components matches our expectations from one release to the next. In this part, I’ll show how we check the visual presentation of those same components.

We start again, with a button.

Remember me?

Our humble button has 8 possible main colour variations, things like btn--primary, btn--success, btn--danger, etc. It can also have a number of other modifiers, like btn--outline, btn--small, btn--naked or is-disabled and it could have one, multiple or all of them at any one time. Oh, and don’t forget active, hover and focus, yeah?

You want a small, primary, disabled, outline button? We got you.

Certain classes need to take priority over others, regardless of the order in which they’re applied by the developer coding the UI. That’s a lot of variations to check, and a bit of a cascade jigsaw puzzle to code up; we’re pretty protective of the colour combinations for the default states because they all meet the WCAG 2.0 AA standard for colour contrast, Something which we aim for in all our user interfaces.

So if you use…

btn btn--primary btn--outline btn--small btn--disabled

…it should result in the same visual styling as…

btn btn--disabled btn--outline btn--small btn--primary

To test this I’ve set up grids of buttons, mapping out the various combinations of classes so we can visually scan through them while also checking their interactive states.

Just one of many button grids

I have a page full of button combinations, 384 of them in fact. So many, that it’s unrealistic to expect anyone to visually check these for each release to make sure nothing has changed. So it’s time to employ one of my guilty pleasures—Visual Regression Testing 🙌

Remember those Twig fixture files we previously used with PHPUnit? I use the BBC’s Wraith tool to take screenshots of each and every one. Wraith uses a headless version of WebKit and stores those images as a visual baseline—a defined and verified standard that we want future screenshots to match; Unless we’re making explicit changes, of course.

Every time we make changes to a stylesheet we can re-run Wraith which will take a second set of snapshots, then automatically compare them against each other and report any differences, no matter how small.

This means that if one of our 384 buttons suddenly got a different label colour than we were expecting, Wraith will tell us. If one of the buttons suddenly became wider by 1 pixel, Wraith will tell us.

Wraith is that annoying colleague who hovers over your shoulder and points out the tiniest difference, but only when you ask it to, which is nice.

Shall we play a game?

Here’s two screenshots of that button grid but I’ve made a change, can you see it? Try not to scroll down too much until you’ve had a good look…

Stop scrolling, can you tell what it is yet?

Figured it out yet?

I can’t tell, but Wraith can.

Wraith: “What the f… m8??”

I lightened the label colour for the naked buttons by 1%, too small a difference to be detected by my eyes, but more than enough for a computer to notice. Admittedly, I had to dial Wraith’s tolerance all the way down to zero to get it to pick up this small of a difference (you can tell it to work with a certain percentage ‘fuzz’ factor) but I wanted to make a point ;)

At the end of the process we’re informed about any failures, although failures is a strong word as they’re just “differences”, it’s down to us to decide whether those differences are intended or not. We’re even given a handy HTML gallery of all the screenshots to easily work through and visually check whatever it’s found.

Gallery of Fails

Spend time setting up a safety net

If you work on a design system, however large, spending time learning and setting up a tool like Wraith will reduce the amount of time you spend manually checking your components every time you make a change, and it’ll let you release with more confidence that unintended changes haven’t leaked out to other components.

No-one likes leaky styles

We don’t use visual regression testing to fail builds, but it’s becoming an extremely valuable safety net to run before committing changes or performing a release of our design system. We’ve all had issues where a change with CSS is one place affects other components in ways we weren’t expecting, and Wraith helps us catch those.

There’s an investment to be made in setting up the tests. We were lucky enough to be able to tap into the hundreds of helper tests we already had and make them work in Wraith too, and we’re now adding tests cases (like the button grids) purely for the visual regression testing.

Finally, the biggest benefit I’ve found is during refactoring. I’m much more confident in making larger-scale changes because I can run my visual tests and have Wraith check all of my components for me.