This is the story of how Glossier got serious about measuring web performance, enabling us to reduce latency by 31% in 2020.
Specifically, we reduced the 90th percentile Time-To-Interactive latency for our homepage, product listings, and product detail pages across desktop and mobile measured with Real User Monitoring.
What does that even mean and why is it useful? Welcome to my TED Talk.
Facing the challenge
In early 2019, we recognized the need to focus on site latency. Glossier had clearly matured to a growth-stage company focused on expanding and retaining our customer base — as distinct from an early-stage startup making frequent, dramatic changes to discover product-market fit. For engineers and product managers, that meant a mindset shift to give greater importance to the performance and scalability of our product features.
There was broad agreement among engineers, our colleagues, and execs that the site was slow. So what to do?
“When you can measure what you are speaking about, and express it in numbers, you know something about it.” -Lord Kelvin
As our Tech team dug in, two central decisions emerged: what metrics should we focus on, and how should we measure them?
Choosing a Metric: Simplifying the Complex
The first challenge was deciding what to measure. The team brainstormed a cornucopia of ways to measure performance improvements.
Should we measure how optimized and responsive our images are? The size of our JS bundle? How many React components we render or XHR requests we make? The number of database queries for a backend response?
Each of these metrics affect site performance and we had worthwhile projects that would have a meaningful impact on any one of them.
But would customers care if our JS bundle shrunk by 100KB, or we removed a SQL query from the home page rendering? Not so much. These are input rather than output metrics — the means to the end. We want our ultimate focus to be on customers’ perception of our site latency, not the many intermediate factors that contribute to site latency.
These tools give us a handful of important metrics, including Time to First Byte, Time to First Paint, Time to First Contentful Paint, Time to First Meaningful Paint, Time to First Input Delay, and Time to Interactive. Oof!
We again have a paradox of choice: trying to focus on all these metrics at once isn’t an effective strategy. We needed to focus on one that would be most impactful for customers.
We used these criteria to identify the most useful metric:
- Enduring: we aim to improve this metric for years to come — it let’s us set ambitious targets and report quarterly progress to execs. We don’t want to redefine success with each project.
- Meaningful to non-engineers: it should be intuitive, preferably measured in time (like milliseconds) rather than a composite score (like Lighthouse) or an event count. Keep the nerdy explainers to a minimum.
- Creates healthy incentives for engineers: the metric should be strongly correlated with perceived latency. I.e., if we improve the metric, we also improve perceived latency. Moreover, avoid a moral hazard where we could improve the metric yet make little or negative impact on perceived latency. This eliminated some Paint-related metrics since it’s possible to quickly render a fancy loading widget that hits the paint metrics but doesn’t show the true content customers want.
We decided on Time to Interactive (TTI) as our most important metric from these criteria. And gosh, did we have room to improve, with critical pages having a TTI upwards of 30 seconds.
Caveat / totally nerdy explainer: TTI’s implementation belies its simple name. We’ve found Glossier pages are functionally interactive much earlier than the TTI metric says. Calibre has a great summary on how TTI is implemented:
The browser’s main thread has been at rest for at least 5 seconds and there are no long tasks that will prevent immediate response to user input.
Glossier has several 3rd party JS integrations that cause a flutter of work on the main thread, which increases TTI without significantly impacting responsiveness.
Despite TTI being an inflated measure of perceived latency and interactivity, it has served us well as an enduring and intuitive metric for both engineer practitioners and our colleagues.
Choosing a Method: Synthetic vs Organic
With TTI as our metric of choice, we still needed to find our ground truth. How would we measure the TTI of our site?
There are two main methods:
- Generate highly-controlled results in a lab-like setting using synthetic (fake) traffic.
- Instrument our site so web browsers report actual organic measurements from the field, in all of it’s uncontrolled messiness. The industry term is Real User Monitoring (RUM).
Synthetic tools like Google Lighthouse, Calibre, and Datadog Synthetics are helpful because they give developers fast feedback where the only variable is your site’s code. And they can easily be run against non-production environments to show the performance impact of a new project. But synthetics could give a misleading picture of our customer’s actual experience since they can miss real world complications like a regional network outage, or an evolving mix of devices and networks speeds.
RUM metrics, on the other hand, often vary for reasons outside of our control (like device and network speed). And some browsers and plugins block RUM code. But RUM still gives the clearest picture of our customers’ actual experience.
So we report on RUM data as the ultimate goal, and we use synthetics to guide our development: from initial proof-of-concepts to QA and alerting on regressions.
There were a few mechanical decisions about how to measure our RUM TTI:
- We focus on the top 90th percentile of the data (TP90). This is a balance between having a value that includes as many customers as possible; and lets us drill into small customer segments without as much variance as a higher percentile.
- We usually split the data across key pages in our customer journey, reporting different values for our homepage, product listing pages (PLPs) and product detail pages (PDPs). These pages have quite different purposes, so it’s useful to look at their performance in isolation.
- Similarly, we split the data across
Mobiledevices since they’re rather different experiences.
While we started collecting RUM data in 2019, it was in 2020 that we made significant improvements, especially on the mobile experience. Here are graphs comparing our TTI across different pages and devices, Jan 1-Dec 31 2020.
Let’s zoom in to our busiest day, Black Friday, and compare year-over-year for 2019 and 2020:
Double-digit improvements across the board. Overall, we reduced TTI by 31%. 💖
To be sure, we plan to keep up our momentum to fully meet our own targets and eventually achieve a Google Lighthouse “fast” TTI rating of under 3.9 seconds, especially in synthetic results.
Elevating a single, clear metric is a powerful way to build alignment and momentum on a major latency initiative. Carefully choosing your focus and measuring it well is an important first step even while the team is eager to jump in to implementing solutions.
We’re exploring alternatives to TTI that could more accurately represent perceived latency, namely the core web vitals: Largest Contentful Paint, First Input Delay, and Cumulative Layout Shift. CLS in particular addresses our criteria to create healthy incentives for engineers.
And while we made major improvements to Glossier’s latency in 2020, we’re still working hard on delivering an even better experience.
Would you like to join our team and make Glossier faster and more delightful? We’re hiring!
Special thanks to Kana Abe, Nicolas Duvieusart Déry, Alix Graham-Tremblay, Roman Korsunsky, Brian Quinn, and Simon Walsh for their contributions improving Glossier’s site latency.