Holding the performance line by regression testing bundle sizes

Published in

Zoover Engineering

6 min readJan 20, 2019

As performance is important for both users and SEO reasons, we’re taking some steps for Zoover.nl to improve those metrics. This is part two of a series where we discuss some of the improvements we are making.

In an effort to increase awareness of — and encourage collective responsibility for — performance, we added breaking tests for performance regressions on all pull requests. Three months later, we look back at not just the changes in performance, but also how the tests impacts our decision making.

Testing for performance

For every pull request, we deploy a test environment that contains the changes. As part of our CI pipeline to build that environment, we run a test with Lighthouse on several pages, and compare those results to the staging environment. We have added a bunch of hand-rolled gatherers so we can run audits on custom metrics as well. If those metrics have decreased compared to the reference environment, or stay below the limit, the tests pass. In all other cases, the build is flagged as failed, and the change author is notified.

Measuring the right things

When we first explored this option a while ago, one of the issues we ran into was that Lighthouse’s default metrics (First Paint, Time To Interactive) were not particularly useful for comparing changes. Not only are they dependent on response times (which is very unpredictable in our slower test environments), we ship not just our own code, but also a tag manager that executes third-party scripts on our page. The variation in these factors make it impossible to reliably assess the impact of any proposed change. We decided to zoom in on our own code, and compare the following metrics:

Critical JavaScript Size: This includes the size of all our bundled JavaScript. Keeping this down is important, because it impacts how long it takes for our site to become “interactive” — meaning that search works, drop-down menus work, display advertisements are loaded, etc. Before that is possible, our JavaScript needs to be downloaded, parsed, compiled, and finally executed. The more JavaScript you ship, the longer this will take, essentially blocking the user from interacting with your website.

A bar showing the impact of JavaScript on page interactivity, (via)

Critical CSS Size: We also measure the total size of CSS shipped to the browser. Before CSS is downloaded and applied to the page, the browser will show a blank screen. The longer this takes, the more likely it is that the user just gives up, and leave.
Inline Styles: When this project was started a few years back, inline styles was the go-to solution for styling in React land. That however, has changed, as it is pretty bad for performance. Because it’s not supported by AMP either, and we want to keep that option open, we are moving away from inline styles. This metric measures the number of elements on a page that have inline styles.

Additionally, for posterity and debugging purposes, we also record the amount of elements on a page, the response size of the HTML, and the size of the images on the page.

The developer experience

While our initial implementation was pretty rough around the edges, we made various improvements over time that helps us in running accurate and consistent tests, and detailed logs so anyone can dive into a failed performance test and figure out why it’s failing:

We output the metrics to the console, with lighthearted emojis for the feelgood factor. We’re all humans, so seeing a nice little green heart when you actually make things better is a pretty nice bonus.

A table, displaying a metric comparison between the source and reference environment.

We report uncompressed sizes. This is more of a practical choice; our staging environment (and production) is behind a CDN which applies its own compression, and the compression rate is different than the compression on our test environments. This makes a comparison unreliable. We wrote our own Lighthouse gatherer to make this work.
We disable external scripts. There’s too much variation in how external scripts impact page speed, so we disable it in order to focus on the things that we can control. This comes with a disclaimer though: in our case, external scripts have an even bigger impact on page speed (50–70%), so it’s definitely important to keep that in check as well, but because managing this requires a more holistic approach, with many stakeholders, we had to remove it from the equation for now.
There’s a small threshold of 0.2%. If the change is within that threshold, the tests will still pass. If we don’t have a little wiggle room, even things like very small bug fixes will fail the performance tests, and it will just block us from getting critical things out the door.

Additionally, we store some artifacts that we can download from our CI tool:

A Webpack Bundle Analyzer report: For a great visual overview of what actually ends in your bundle, you can create a report with webpack-bundle-analyzer. It shows a treemap of included modules, sized by its share in the minified asset. This helps us in quickly seeing which modules got bigger and smaller, and why they did. It’s especially helpful in analyzing duplicate modules (like React DOM ending up twice in the bundle, or various Lodash instances).

Diffs for the HTML, CSS and JavaScript. Our custom resource gatherer not only stores the uncompressed size of the response, but also the response’s body. We store this in the Lighthouse report, and then use it to generate diffs when comparing to the reference branch. Its output is similar to git diff, and it’s especially helpful for CSS and HTML to figure out what exactly has changed.

The entire Lighthouse report as a JSON file. In this 1mb+ file, Lighthouse stores a bunch of information about the run. All audits and metrics, but also things like screenshots and runtime errors. As a nice side-effect, we were able to use these files to retroactively upload the performance test results to Elasticsearch so we can analyze the changes over time.
Network errors and console messages. In some cases, we saw some weird, inconsistent results. To gather more debugging data, we’ve added custom gatherers for console messages and network errors so we can figure out retrospectively what happened during a CI run. This for instance was particularly useful when we need to figure out why a lazily loaded image was sometimes unexpectedly included in the reported size.

Assessing the impact

To see what metrics we moved, we downloaded all the results from our CI tool, uploaded the scores to Elasticsearch, and created a dashboard that shows how our metrics progressed. Results are pretty good: JavaScript is down by almost 30% on average. A similar change can be seen in bootup time and main thread work. CSS is up by 5%, but still under 50kb. This is pretty great considering the fact that elements with inline styles dropped from almost 300 to 75.

A chart showing a steady decline in total JavaScript bytes shipped to the browser

On the flip side, productivity took a bit of a hit, especially in the beginning. It can sometimes be really hard to figure out how to ship a new feature without running into performance regressions. For some this is a challenge, but for all of us it can be pretty frustrating. But we’d like to think things are definitely getting better: we now see it as a warning, an incentive for us to to ship features economically, and fix any lingering performance issues; it’s there to start a conversation, not as a final decision on whether a change is accepted or not. It has increased awareness of performance among not just us as engineers, but stakeholders too, and it has exposed some fundamental flaws in how we built things that can now be addressed. Hopefully the downward trend continues as we keep getting better at fixing our performance issues.

Holding the performance line by regression testing bundle sizes

Testing for performance

Measuring the right things

The developer experience

Assessing the impact

Written by Dario Gieselaar