Despite its strengths, our deployment of brotli has been quite challenging. When we enabled brotli in a straightforward manner, it reduced bytes sent as expected. In individual tests to enable it on specific sets of pages, it didn’t hurt conversion rates or performance on about 25% of pages/requests, but doing so on other pages hurt one or both of these metrics. If we had enabled it in this naive manner fully on faith alone, our website would be worse off for it.
This post uses our trial-and-error deployment of brotli as a setting for what is, in essence, an incident review. Although I’m providing a survey of relevant topics in web performance and A/B testing as background, the reality is I didn’t know a lot of this when we started. Most of this post is about how we achieved good performance results in spite of our oversights, misunderstandings, and mistakes. The secret sauce is the use of A/B testing, which provides us a means to observe the effects of our changes, stop those changes if the effects are bad, and test again until the changes yield good results. Because, as the saying began, “To err is human, to recover, is Angelical; to persevere is Diabolical.” So it goes with product development.
Although this post aims to show that success is built on failure, we also hope that the story of our brotli deployment teaches you about the complexity of browsers, the complexity of the web, and the importance of what you measure (and how you measure it).
Booking.com’s journey with brotli compression started with (and has been sustained by) the curiosity of its engineers, who had started looking into brotli by mid-2016. The hopes they had for deploying it were roughly these:
- To make our pages load faster (by fitting the same data into fewer bytes, and thus maybe fewer packets, complete data would arrive sooner)
- To reduce our users’ data usage (bandwidth is expensive relative to incomes in some places, and roaming can make it pricier yet)
- And to reduce our own bandwidth usage (which may reduce or defer spending)
After a few forays introducing brotli our engineers hit their first roadblock (a test with statistically significant degradations in performance and conversion rate), after which the original volunteers stopped being able to participate and I ended up continuing the quest. Let me catch you up on where we stand today and how we got there.
Our first brotli test (and the basics of performance experimentation on our site)
Our first A/B test with brotli in late 2016 was an attempt to dip a toe in the water. As such, it targeted the endpoints on which it was easiest to measure and implement.
What we measured and how we measured it
Implementation constraint #1: Dynamic, not static, resources
In our case, that meant experimenting on self-hosted endpoints (those whose requests are handled by Booking.com-owned and -operated webservers) that serve dynamic data and not static assets (we mainly delegate serving of static assets to CDNs). There are several reasons for this:
- Dynamic endpoints already integrate with our A/B testing framework, which makes it easier to run tests here than on CDNs. We can still run tests around the use of CDNs, but that requires setting up a new CDN domain (configured with the new behavior), and using an A/B test on our application servers to choose between the old CDN domains and the new CDN domains.
- Dynamic endpoints serve responses with a “Cache-Control: private” header to avoid those responses getting cached by browsers and proxies outside of Booking.com’s control. A/B testing against caches is messy. For one thing, a cache’s hit rate is usually predicated on how often it’s used, so splitting traffic between control and treatment reduces hit rate for both, potentially invalidating their comparison. Also, you have to decide how the cache should behave at the start of the test: if the test starts with a warm cache for control and a cold cache for treatment, or if both start cold, either way the difference between treatment and control at the beginning of the test does not generalize to normal operation. It complicates evaluation either way.
- Our usual metrics are oriented around time to load the page as a whole, which is highly variable and influenced by many factors. We believed that the effects of static asset compression would be hard to spot with such metrics. If testing on static assets, it might be better to set up measurement of asset retrieval latency and success via the ResourceTiming API and/or Network Error Logging.
For reasons similar to the last one, the initial test avoided AJAX-oriented endpoints and focused on endpoints that serve HTML.
Implementation constraint #2: Single-flush pages
There is a last constraint on this test’s recruitment criteria, and it sounds esoteric but it turned out later to be important. Our highest-traffic pages implement a performance optimization called “early-flushing”, and because of the way these pages are implemented in our web framework, it was hard to test both early-flushing and non-early-flushing pages at the same time.
If a page does an early-flush, that means that the page is not delivered all at once, but in two phases. The first phase — the early-flush — delivers the initial part of the document that specifies what blocking assets (mostly, CSS files) the page will need, so that the browser can start retrieving these immediately. The second phase then delivers the rest of the HTML document. In this way, time spent retrieving CSS files happens in parallel with our servers computing the rest of the HTML document, and thus the page is ready to render sooner. In many implementations (ours included), the first phase also delivers enough HTML to render some of the page’s ‘critical path’ elements (stuff at the top of the page that helps users orient themselves). This blog post from Will Hastings does a good job of explaining the optimization and showing its impact.
For the first test, it was way easier to implement on pages that do not employ the early-flushing optimization.
Recap of test setup and summary of results
Okay, that was a lot of background. To sum up: in late 2016, our curious engineers ran Booking.com’s first test using brotli. If a request arrived to our server for one of our non-early-flushing endpoints, and the request indicated brotli compatibility, the browser would be recruited into the test and assigned to a condition. If the browser was in the control condition, our servers would not ever brotli-compress the response. If the browser was in the treatment condition, our servers would serve a brotli-compressed response.
Our experimentation framework stored a number of measurements broken down by condition (control vs treatment, aka without/with brotli) to help us understand how these measurements change with the introduction of brotli. Comparing treatment against control, there were subtle movements in the sample distribution of the time-to-window-load metric, but nothing especially good or bad. In the end, we decided that the brotli treatment was better mainly on the basis of sending 10% fewer bytes over the wire.
Our second brotli test
The next test aimed to add brotli to the HTML-serving pages that early-flush. Since the engineers toying with brotli were doing it on the side, it was a few months before they found suitable hacking time to set it up.
Performance metrics: New and Improved!
By the time the test was set up the company had converged on a different performance metric. The newer metric uses the same telemetry, but instead measures duration from navigationStart timestamp to a different ending timestamp, domContentLoadedEventStart. This timestamp is recorded right before DOMContentLoaded event fires, so it counts the time to download the HTML document, decompress it, parse it, build the DOM, run deferred scripts, and to build the CSSOM. Ilya Grigorik has an article about the critical rendering path and why DOMContentLoaded is a better measure of website speed than window load. We believe this metric is a far better approximation of how long it takes to get a page interactive, and the changes it shows are more likely to be actionable and related to the code changes our experimenters make. Here in 2020, performance evangelists recommend to measure page interactivity via time to Largest Contentful Paint, though the tests covered in this blogpost didn’t use it.
Additionally, by this time our experiment tool was showing us more information about performance metrics. Instead of showing the mean and change in the mean, it had started to collect and display changes in distributions (visualized as a many-binned histogram) as well as changes in the mean, in sample counts, and percentiles. The distributions we see from the field for time-to-DCL (and other latency metrics) are relatively compatible with what the rest of the industry has reported. Below is one sample distribution for a performance metric:
To give a numerical sense of the dispersion: From 10% to 50%ile, and from 50%ile to 90%ile the values observed grow roughly 2x; but from 90%ile to 99%ile, and 99%ile to 99.9%ile they grow 3x in each jump. This much spread can make decision-making difficult: if a test improves performance for half of the samples, and degrades performance by the same amount for the other half, what do you do? Does your decision depend on which part of the distribution gets which effect? For example, the original brotli whitepaper uses the geometric mean, so that even rare latency increases (if they’re large) will push the mean a lot higher than a large number of small decreases can reduce the mean.
Results of 2nd test
Our engineers ran this test to introduce brotli on pages with early-flushing, and the brotli treatment substantially increased this latency (meaning: worse performance). Digging deeper, the negative effect was concentrated in the mobile website, though performance on desktop and tablet versions of the website wasn’t clearly different. The engineers asked my help looking at the test, since I had been working on performance at the time, and looking along with them I couldn’t see anything useful. Even though our experimentation framework shows some metrics broken down by the browser, language, or country from which the user is browsing, none of these showed a pattern that’d help us divine a cause from the overall effect. We ended up guessing that decompression was straining the capabilities of mobile devices. But we had no hypotheses specific enough to act on, so we shelved it.
A few months later, some other engineers and I ran a series of experiments that offloaded content from our pages (for that content to be filled in later via AJAX requests), and saw substantial performance improvements correlated to smaller responses on the navigation. The two metrics that moved the most in this test were HTTP response size and time-to-DCL (both reducing, which is good for both metrics).
Based on this experiment’s results, I over-extrapolated and started to believe that sending fewer bytes over the wire had huge potential to improve performance.
Wait. Does sending fewer bytes actually drive performance?
In hindsight, there was a lot of evidence that I was wrong.
People in the field of web performance have known for a long time that the minimum time it takes for data to go to a server and to receive a reply — the round-trip time (RTT) — has a large effect on what ‘effective bandwidth’ you’re able to achieve for an internet connection of a given (maximum) bandwidth. Ilya Grigorik’s book “High-Performance Browser Networking” mentions it with back-references to More Bandwidth Doesn’t Matter (Much)(2010) and It’s the Latency, Stupid (1996–2001).
Using back-of-the-napkin math, if you try to download a 250KB HTML document and you’re able to get full utilization of a 2Mbps link to the server, that’d take 2 seconds. But by now most countries have an average mobile download speed of 13Mbps or more. The barrier to utilizing that bandwidth is often gated on RTT. RTT drives one-time connection costs and underutilization when you start talking to a server (TCP handshake, HTTP/1.1 connection setup, TLS session setup, TCP slow start), but can also cause underutilization after a connection is up and running (TCP’s congestion control algorithms typically cut transmission speeds sharply, and recover them cautiously).
Some of the interventions that have given Booking.com the greatest reductions in time-to-DCL do not involve reducing transfer size at all. Early-flushing is one of these — we send almost the same document, just split into phases. As a byproduct of the surprising brotli result on early-flush pages, we spotted a high-traffic page that didn’t have the early-flush optimization. When we enabled early-flush on this page, the time-to-DCL values observed went down by 6–10% for the middle 80% of the distribution (the tails of the distribution didn’t change much, and the mean was 7.7% lower).
So, when I thought that reducing bytes sent had significant potential to make our pages faster, I was almost definitely wrong. But I didn’t know that, so nevertheless, I persisted. From here on I started playing with brotli on my own as a little R&D project.
Looking at brotli with fresh eyes
The first thing I tried was to look at the prior tests’ results. Since the first brotli test, the experiment tool had changed aspects of performance metric collection which removed a source of possible bias and we had converged on a more relevant metric (time-to-DCL). So it seemed prudent to re-test our existing uses of brotli on pages that don’t use the early-flush optimization. What I found on re-test of non-early-flush pages is that brotli hurt performance slightly on the mobile website, but was similar-or-better on the desktop website. On the basis of these results, I removed brotli from our mobile website. I also re-ran the test to add brotli to early-flushing pages, and got results comparable to the previous run: severely worse performance on the mobile site, but fine on the desktop site.
What I knew at this point was that brotli hurt our mobile website’s performance (slightly on non-early-flush pages and badly on early-flushing pages), but on the desktop site, performance was inconclusive irrespective of page type.
Okay, so it makes our pages slower. But how?
A/B testing should control for everything except for the recruitment criteria (and the treatments) themselves, but this still leaves a lot of possible causes. Factors that remain uncontrolled in our tests would include:
- The mobile devices themselves
- The difference in network connections between mobile and desktop
- The different ways HTML documents pages are constructed for mobile versus for desktop
- A bias in the set of pages participating in early-flush in mobile versus desktop
- User expectations in general or on these specific pages
- And the treatment (using brotli) itself
I scoured the web looking for evidence of other people having trouble deploying brotli, hoping to find a success story I could learn from and emulate.
One common objection about brotli is that it takes longer to compress at the default quality level (11), but our original tests and updated tests using levels 5 and 7 showed brotli adding very little server-side latency relative to the latency increases collected from clients. Even if testing had found compression latency to be large enough to drive time-to-DCL increases, it would not have explained why brotli only hurt latencies observed on the mobile website. So compression times didn’t seem explanatory for the imbalance in results between mobile and desktop websites.
It’s not easy to collect reliable evidence of bumping into resource limits on a client device in the field. If the device is saturated, that very same saturation may change user behavior. If the device’s user closes the tab, the telemetry isn’t sent and we’ll get less time-to-DCL data for those users who experienced worse performance (what I’ve heard referred to as “coordinated omission" in a talk from Gil Tene). And in fact our experimentation framework had flagged a sample ratio mismatch (SRM for short) in both A/B tests of brotli on early-flushing pages, indicating we were also failing to collect timings from browsers.
Although fast decoding is one of brotli’s explicit objectives, and its reputation matches that, I took a look at synthetic benchmarks of encoders/decoders. On the cp.html corpus (a 40kB HTML document), for device profiles with similar RAM and CPU to an iphone 4–6 (raspberry-pi-2 and odroid-c1) decoding gzip takes ~25% longer than copying bytes around, and decoding brotli takes ~36% longer than copying.
For some time, that was our most promising lead: brotli can take a little longer to decompress, and maybe that’s slower on mobile devices. We ran tests which enabled it with different values of quality level and window size parameters of the encoder, hoping that these would improve performance by making decompression less resource-intensive. No combination of settings seemed to bring it in line with our baseline performance (though extreme configurations could cause more severe performance degradations, so there’s that).
Has anyone else had this problem but us?
Most posts about brotli are focused on its potential — enthusiastic about the compression ratios and other benchmarks — but don’t really speak of a partial or completed deployment. Those that do rarely give any sort of hint about the empirical methodology or the magnitude of the performance outcomes. If they do mention results, it’s likely to be the substantial size reductions. Bummer. There goes our attempt at “amateurs borrow, professionals steal”.
There is one noteworthy exception to this. In February 2019 (1 month after I had started looking at brotli again), one post stated the effect size of deploying brotli (a 37% latency reduction). Moreover, this post received a comment saying “We just started using Brotli and we see very good stats on desktop platform and somewhat sinkhole in decode metric on touch devices”.
In hindsight, there were clues that mobile devices’ computing power was probably not the main problem. User-Agent headers give a hint to the device requesting our pages, and if this hypothesis were true, the latency increase incurred with brotli use during the A/B test should correlate to device type that registered that latency sample, which doesn’t seem to be true. The benchmark data on decompression could have also been a clue that this was a bad hypothesis: if all of the time-to-DCL increase was due to decompression latency, and the synthetic benchmarks mentioned earlier were about correct, that’d imply that gzip decompression is also very slow, potentially 300ms for 1% of requests. So back-of-the-napkin math — although it doesn’t rule out this hypothesis — does make it hard to swallow.
A “crazy theory”
We got in contact with Google Research with the question. After some investigation, they had a “crazy theory” for us. Their general hypothesis was that the DOM pre-parser (which scans through the HTML document early and fast so it can start fetching critical resources like CSS files as soon as possible) is finding out about resources later on early-flush pages when brotli is in use. If you (like me) personify the pre-parser, you might imagine it is like Usain Bolt braced at the starting blocks, ready for the race to begin, but something makes him hear the starting signal late.
They even suggested a possible mechanism for this: Chrome (and presumably other browsers) buffer the bytes of the response body as they come in from the network, and the pre-parser won’t see anything until that buffer flushes and the data is decompressed. They even shared with us a link to the line in the Chromium source code where this buffer’s size is set at 32KiB.
Why the theory makes a ton of sense
If the browser buffer is indeed the specific mechanism behind the general cause, a browser receiving a highly-compressed response will keep that data buffered until the response is complete or the buffer is full.
With an early-flush, the server transmits some data, then pauses. If the data hasn’t filled this buffer, your browser is not yet acting on that data. Your browser is instead waiting for more, just like you are if someone says “I see you shiver with antici-” to you. In this way compression eats into the performance improvements granted by early-flushing: the better you compress, the slower you are to fill up the buffer, the later the pre-parser sees your <link> tags, the later the file retrievals are started, the later the stylesheets are available, the longer you wait until DCL can fire.
At the same time, enabling better compression on a page that’s served in a single flush should not hurt time-to-DCL since the buffer will be getting filled as fast the network can deliver the bytes, with no awkward pause in between.
The buffering hypothesis explained the performance discrepancy between pages with early-flushes and pages without, which we had by then seen in multiple tests.
In theory, there could be other mechanisms than the in-browser body buffer driving the pre-parser getting data later. For example, it’s also possible we weren’t instructing the brotli stream-encoder to flush the stream (to finish encoding all pending input and produce a completed output block which the recipient would be able to decompress as-is). But I checked our code (and ran a simulation) and it appears our early-flush does not suffer from this issue.
What good is a hypothesis?
This hypothesis really energized me. Not only is it compelling to learn more about how browsers work by seeing it in the code, and then explore its empirical consequences. But it also got under my skin a bit: splitting our pages’ generation into 2 phases for early-flushing had been laborious, and our users should be able to get the best out of both that and brotli, gosh darn it!
If these hypotheses are true, there are multiple test setups that should improve performance. I could have split page generation into more phases so that the first flush is always 32KiB, but it would’ve been a lot more upfront labor than I could commit to. In the spirit of using really bad ideas to prime the pump for better ones, I joked about padding the early-flushed head with an HTML comment full of incompressible random text to exceed the user agent’s buffer size. Ultimately the best idea was offered by Tom van der Woerdt: Link headers can instruct the browser about resources that need to be preloaded, and these would arrive in plaintext before the first byte of the HTTP response body does, presumably kicking off retrievals immediately.
Running two tests for the price of one
This left me believing that I wanted to test two interventions — enabling brotli; and preloading CSS with Link headers in addition to link tags — and I was especially interested in observing their interaction. I set up an A/B/C/D test on our early-flush-enabled mobile pages to capture all combinations of those, believing that with Link headers, we could use brotli without a performance penalty.
I set that test up and ran it. The changes weren’t uniformly better or worse across the entire distribution for any variant, so it’s hard to pick a single representative summary statistic. That’s further complicated by the dispersal of the distribution (discussed earlier in this post). In the table below I summarize the relative change in time-to-DCL between the treatments and the control, for a small number of illustrative statistics. Recall that these are latency numbers, so if the change is negative/positive our pages got faster/slower, respectively.
Matching the prior tests, enabling brotli showed a substantial slowdown for most users: the brotli treatment had 1–5% higher time-to-DCL values from the 1%ile up to the 50%ile (but had lower time-to-DCL values at some high percentiles). Using Link headers without brotli gave a slight speedup over the control condition (which would support Tom’s idea that browsers act on Link headers slightly before link tags, and may also imply that early-flushing’s effects are partially lost with gzip-compressed responses too). Link headers + brotli was slightly faster than Link headers without brotli. Adding brotli on its own having a strong negative effect, and adding brotli to Link headers having a positive effect greater than the effect of adding Link headers alone, is strong evidence of our interference hypothesis.
Put another way, combining these two interventions provides an effect better than the sum of their effects, which is evidence that the interventions are not independent. More satisfyingly, finding an intervention with brotli that performs better than control (almost always), meant that Booking.com could give users faster pages in addition to lower data plan usage.
Where we are today
This test brought brotli adoption on our consumer-facing website up to 47% of responses with a response body. After identifying a few gaps, I’ve gotten it up to 66%. For example, the way that we know what static assets to make Link headers for is by hooking into a particular function call for building static asset urls. But if a request doesn’t use that function for its static asset urls, the web framework has no Link headers to add and will not use brotli to stay on the safe side. This may also mean that pages that use Link headers and brotli are failing to put some critical path assets in Link headers, and we could improve our page speeds by addressing these gaps.
What’s still missing or unknown
The test results demonstrated not only that brotli interferes with something on our early-flushed pages but also that Link headers can remove that interference. However, these two facts are not definitive proof that the buffer was the main or only thing at play. In fact, there are still reasons to doubt that the above understanding is correct! Booking.com’s early-flushes with gzip are almost never larger than 3KB, so switching from gzip to brotli should not change whether browsers see our link tags earlier or not, unless those browsers have a much smaller buffer in use (recall that our experiment tool lets us see some effects grouped by browser, and we didn’t see a pattern here that would corroborate this either). Even if our early-flush was 33KiB with gzip and 30KiB with brotli, the hypothesis about the non-full buffer wouldn’t explain why early-flush pages on our desktop site did not suffer performance degradation with naive application of brotli.
Looking back on our journey, there are a lot of places where we got stuck or got lucky, and I hope that calling these out can help turn them into lessons for myself and others.
Here are some things that worked out great:
- Improvements in how our experiments record performance metrics: Since the first brotli test, the experiment tool changed these metrics, and they are now less biasable by user behavior (so now they’re more about the requests/pages themselves).
- Asking for help: Google Research gave a possible explanation that would never have occurred to us otherwise.
- Watercooler chats with colleagues: After I believed we had interference, I had no idea what might overcome it without being awful. It was by discussing the problem with others that I found out about Link headers, and came up with a reasonable way of applying them.
- “Test Every Change” paid off: Booking.com has a culture of testing every change in production, mostly via A/B tests. It turns out that not all optimizations play well together. If we had taken an article of web performance best practices and implemented all of them without testing, we would have a worse-performing website than we have today. A random walk that looks around at every step turns out to have saved us from walking off a short pier.
The gotchas and sharp edges mostly come down to “boy, the web is complex!”
- Different metrics tell you different things: Time-to-DCL is a good low-level metric for page interactivity for the kinds of site changes we’ve discussed here, but it’s not always right. There are other metrics available that are usable in the field, it may not be suitable at all if your pages are constructed differently.
- Billions and billions of moving parts: Although we did get lucky in the ways stated above, not all website operators have mature telemetry and experimentation frameworks readily available to them. The way that optimizations like Link headers, early-flushing, and brotli interact with one another is hard to reason from first principles. This is unfortunate, because it means a website operator spending time to make their site faster may unknowingly be spending their time on low-impact interventions, or even implementing a combination of interventions which work against one another.
- Fixating on the wrong pattern: After all the brotli re-testing, we saw brotli underperforming all across our mobile site, but with a much more severe underperformance on early-flushed pages. And we’ve eventually confirmed that early-flushing interferes with compression. And moreover, the solution that removed brotli’s interference with early-flushing also made brotli stop being worse on non-early-flushed mobile pages as well. This suggests that if we’d considered the most acute category of underperformance instead of (or in addition to) the most general category of underperformance, we might have been able to target our analysis and experimentation better. Then again, it’s too easy to use hindsight and attribute superhuman investigative powers to your counterfactual past self. 😁
- Mysteries remain: We now have a plausible explanation for why brotli interfered with early-flush, but it’s still not clear why the interference was large on our mobile website but negligible on our desktop website. Our desktop site’s HTML documents tend to be substantially larger, but the gzipped size of an early-flush is never more than 4 kilobytes on either platform, so it doesn’t seem like brotli would be a differentiating factor.
Our troubles getting brotli from 0% of pages to 47%, having to backtrack to 25%, and finally up to 66% is to me a story about benefitting from the groundwork laid by others (better metrics and metrics collection) and letting your data do the talking (while you practice active listening).
It’s also a story of luck. If I’d known then what I know now about our metrics and the web’s underutilization of bandwidth, I might not have picked up brotli where others left off. Because I didn’t know these things, I kept on. I’ve had the pleasure of finding things out, the website got a little faster via the addition of brotli, we got early-flushing on one of our highest-traffic pages, and you get a little more travel out of your data plan.
Hopefully via this post, your own brotli deployments deliver better results and go a little smoother.
Specific thanks go to:
- Marcell Szathmari, Ming-Ki Chong, Ævar Arnfjörð Bjarmason, and Quim Rovira: For work on the original tests, including updates to the perl bindings and Booking.com’s internal interfaces.
- The folks we spoke to at Google Research: For listening to our confusing results and investigating to the point of a “crazy theory”. Without this we would have stalled out completely.
- Tom van der Woerdt: For suggesting Link headers.
- Diogo Antunes and Quim Rovira: For ongoing advice and support while I was pulling my hair out in all phases (R&D, implementation, testing, and authoring).