Measuring Web Performance at Airbnb
Learn what web performance metrics Airbnb tracks, how we measure them, and how we consider tradeoffs between them in practice.
How long did it take for this web page to load? It’s a common question industrywide, but is it the right one? Recently, there has been a shift from using single seconds-based metrics like “page load”, to metrics that paint a more holistic picture of performance, representing the experience from a website user’s perspective. At Airbnb, measuring the web performance that our guests and hosts actually experience is critical. Previously, we described how Airbnb created a Page Performance Score to combine multiple metrics from real users into a single score. In this blog post, we describe the metrics that we consider important on our website and how they relate to industry standards. We also discuss some case studies that moved these metrics, and how they impacted the experience of website visitors.
Web Performance Metrics
There are five key performance metrics that we measure on our website. We chose these metrics because they represent performance as our users experience it, and because their definitions are simple, interpretable, and performant to compute.
We record these metrics both for direct requests to the site, as well as for client side transition requests between pages (Airbnb uses a single page app architecture). We will give an overview of these metrics, how we instrument them, and their relative weightings in our overall Page Performance Score.
Time To First Contentful Paint
Time To First Contentful Paint (TTFCP) measures the time between the start of navigation and the time at which anything appears on the screen. This could be text, a loading spinner, or any visual confirmation to the user that the website has received their request. We use the paint timing API for direct requests. For client routed transitions, we have written our own instrumentation that is triggered when a page transition begins:
Time To First Meaningful Paint
Time To First Meaningful Paint (TTFMP) measures the time from the start of navigation to the point at which the most meaningful element appears on the screen. This is usually the page’s largest image or highest heading. This indicates to a user that useful information has arrived and that they can start consuming the page’s content.
To instrument TTFMP, product engineers tag their page’s meaningful element with an id — we call this the FMP target. We then recursively search for a page’s FMP target.
It’s important to note this metric requires manual instrumentation by our product engineers — every page must include a “FMP-target”, or we’ll never record the first meaningful paint milestone. To ensure that each page instruments TTFMP correctly, we report on how often this element is found on a given page. If it is found less than 80% of the time due either to lack of instrumentation or to conditional rendering of the FMP target, we trigger alerts to warn that the metric is not valid for that page. This requires developers to keep the TTFMP instrumentation up to date through page redesigns, refactors, and A/B tests.
Instrumenting TTFMP automatically is difficult because it is hard to systematically know what element is the most “meaningful” on the page. Largest Contentful Paint addresses this by measuring the largest element on the page. We do not use Largest Contentful Paint because the browser API for this metric only returns the paint timing for initial load and is not available for client transitions in our single page app. If Largest Contentful Paint could be reset and used for client-side routed transitions too, we would use Largest Contentful Paint as a default that requires no manual instrumentation.
First Input Delay
First Input Delay (FID) measures the time it takes for the browser to start responding to user interaction. A low FID signals to the user that the page is usable and responsive. Conversely, anything over 50ms is a perceptible delay to a user. To support client transitions, we forked the first-input-delay instrumentation from web-vitals to reset the observation of the input delay.
Total Blocking Time
Total Blocking Time (TBT) measures the total sum of time for which the main thread is “blocked”. When TBT is high, the page may freeze or stop responding when scrolling or interacting, and animations may be less smooth. Tasks that take longer than 50ms are considered “long tasks” and contribute to TBT.
One difficulty with using TBT is that it can be hard to attribute blocking to specific components or sections on our pages. For this reason, we have created a sub-metric we call interactivity spans, which captures blocking time that occurs within a specified window.
While we report the total blocking time, we know that not all blocking time is equal — time spent blocking user interaction is worse than idle blocking time. Another drawback is that blocking time accumulates indefinitely over the course of the page, which makes the metric hard to collect synthetically, and impacted by session length. We’re investigating how to attribute specific blocking times to user interaction, and will follow the direction of the animation smoothness metrics in the web vitals initiative.
TBT is currently only available in Chromium-based browsers, and there is no polyfill available. In these cases, we do not report TBT — however, we have found that even with limited browser support, TBT is a useful measurement of post-load performance.
Cumulative Layout Shift
Cumulative Layout Shift (CLS) measures the layout instability that occurs during a page session, weighted both by the size and distance of the element shift. A low CLS indicates to the user that the page is predictable and gives them confidence to continue interacting with it.
CLS is also not available in every browser we support. While there is no polyfill available, we do not report any value for CLS in those browsers. Similar to TBT, we find even partial browser coverage to be useful, as a shift in Browser A likely also occurs in Browser B.
Web Page Performance Score
We combine these scores using the Page Performance Score (PPS) system, described in the previous post in this series. PPS combines input metrics into a 0–100 score that we use for goal setting and regression detection.
Web Vitals and Lighthouse
Lighthouse is a tool that rates a web page by running synthetic tests, auditing, and scoring the page. However, Lighthouse runs these tests synthetically, while PPS scores pages according to real user metrics. Lighthouse is a powerful diagnostic tool, while PPS lets us use real user metrics for goal setting and regression detection.
Web Vitals is a library that measures real user metrics, similar to PPS. However, it does not include a numerical scoring system similar to PPS or Lighthouse, and it does not yet account for client transitions inside a Single Page Application. We do make use of web vitals by including and prioritizing similar metrics to ensure that the direction of PPS and Web Vitals are aligned.
Early Flush Case Study
When making changes to improve performance, we often run A/B tests to gather data on how successful our improvements were. Ideally, we would strictly improve performance by improving one or more of the metrics described previously. However, we sometimes see examples where one metric has improved at the expense of another. The PPS system streamlines decision making when considering tradeoffs.
As an example, on pages that have dynamic content (such as our listing pages), we previously CDN cached a generic version of the page that contained a loading state, leading to a fast TTFCP. We then ran an experiment to flush HTML content from the server early and skip this initial loading state.
The result of this experiment was a slower TTFCP without the CDN, but a faster TTFMP because we skip the initial loading state. Though we weight TTFCP higher than TTFMP, we found that the magnitude of improvement in TTFMP outweighed the regression in TTFCP and shipped the change. This type of decision is simple to make when we have a Web Page Performance Score to help us consistently evaluate tradeoffs.
We have seen through experimentation that these metrics correlate with positive user experience changes. Web PPS gives us a single score we can use for goal setting and regression detection, while also capturing many different aspects of user experience: paint timings, interactivity and layout stability. We hope that Web PPS can be used as a reference for implementing similar systems outside of Airbnb.
Our deepest thanks go out to our industry colleagues working on performance — as the industry evolves Web PPS will also evolve.
Thanks to Luping Lin, Victor Lin, Gabe Lyons, Nick Miller, Antonio Niñirola, Aditya Punjani, Guy Rittger, Andrew Scheuermann, Jean-Nicolas Vollmer, and Xiaokang Xin for their contributions to this article and to PPS.
All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.