EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Web Performance vs. User Engagement

How Vrbo™ correlates business events to performance data

Carlos Moro
Expedia Group Technology

--

Porsches are beautiful, engaging and fast — three qualities a good website should have. Photo by Francesco Lo Giudice on Unsplash

In this story, we will look at how Vrbo™ (an Expedia Group™ company) implemented an automated process to correlate business events to performance data. I hope it will inspire you to do the same.

But first we need to cover some basics. We all know that faster sites convert more customers. But it’s one thing to look at data from others and trust the same will apply to your website. It’s another thing to do your own research. When you analyze your own data, it suddenly becomes more meaningful.

The holy grail of performance monitoring is finding how performance correlates to conversion.

Disclaimer: If you are reading this, it probably means you have some interest in web performance, and most likely you don’t need convincing about the benefits of having a fast website. So I’ll skip the part where I try to convince you, but just in case you need some convincing, here is an excellent article from Google, or hundreds of case studies from WPO stats.

How do we measure site performance?

Skip this chapter if you are familiar with RUM and user-centric performance metrics.

RUM

Real user monitoring (RUM) is a passive monitoring technology that records all user interaction with a website or client interacting with a server or cloud-based application (source: Wikipedia). We use it as a gauge to track whether we are actually making improvements in the real world.

RUM events at Vrbo number in the hundreds of millions per day. Deriving performance insights in a reasonable amount of time from such a voluminous raw dataset is difficult, as you can imagine. Therefore, we preprocess RUM data into an aggregate summary statistic to make querying the data in real-time a possibility. This aggregate data (and subsequent derivations) are what populate our dashboards.

Spark job runs daily and aggregates data into summary statistics. Dashboards consume aggregated data to populate graphs

We also use synthetic monitoring at Vrbo, but this story is focused on RUM data.

Performance metrics

I participated at numerous meetings at Expedia Group™ where the following question was asked: What performance metrics should we track? The main issue is not the availability of performance metrics, but one that is meaningful to your website can be confusing and overwhelming.

“You make what you measure, so measure carefully”

I’m unsure of the original source of this quote, but I saw it for the first time in the book Chaos Monkeys by Antonio García Martínez. I find this quote perfectly encapsulates performance monitoring. It reminds us to carefully choose the performance metrics we track, because those are the metrics that will most likely improve.

Below are performance metrics deemed most meaningful to Vrbo. The details of how we went about selecting these metrics is somewhat complicated, but was mainly driven by the perceived relationship between a metric and real user experience:

  • PAR (Primary Action Rendered): PAR is a custom metric that measures how long it takes for the most important feature of the page to be rendered to our end users. Since this is a Vrbo-specific metric, we will avoid using it in the examples of this blog post, but it’s one of our most important metrics.
  • FCP (First Contentful Paint): Definition from Google: First Contentful Paint (FCP) measures the time from navigation to the time when the browser renders the first bit of content from the DOM. This is an important milestone for users because it provides feedback that the page is actually loading.
    How quickly can the user see something?
  • FID (First Input Delay): Definition from Google: First Input Delay (FID) measures the time from when a user first interacts with your site (i.e., when they click a link, tap on a button, or use a custom, JavaScript-powered control) to the time when the browser is actually able to respond to that interaction. This metric helps us understand if we are using too much CPU while rendering the page.
  • TTFB (Time to First Byte): TTFB (aka `Backend Time`) is not necessarily a user-centric metric, but we still follow it closely because it allows us to isolate backend vs. frontend changes.
  • FPS (Frames Per Second): Frame rate is most familiar from film and gaming, but is now widely used as a performance measure for websites and web apps. It’s commonly used as a synthetic metric (e.g. Mozilla and Google) but it can also be measured from RUM. We implemented our own optimized utility to measure FPS in a browser environment and we are currently in the process of open-sourcing this code.

These are not the only metrics we measure. They’re just the ones we follow very closely.

Performance regions

“My site loads in 5 seconds”

How many times have you heard something similar? First of all, we need to define what “load” means. Is that when the first pixels are painted (FCP) or when the load event was fired? Second, what kind of measurement was used? Is that 5 seconds average (mean), tp50 (median), tp90, or what?

Also, looking at single statistics is highly problematic. Here are a few examples from Rico Mariani’s post about Understanding Performance Regions (which served as a major inspiration for the work behind this blog post):

“Mean: You can commit any crime of variability and keep the mean constant.”

“P90: If you report only P90 you can commit any crime you like before or after the P90 as long as that one point stays fixed you’re fine. For instance if the best 50% all got somewhat worse that wouldn’t affect the P90 at all. Or if you moved the P90 down at the expense of say, P25 and P50, that isn’t good.”

“P50: Ibid mostly… improvements in the top half do not register, nor does worsening of the back half.”

Due to the problems highlighted above, at Vrbo we try to look at the distribution as much as possible:

Distribution divided into 3 regions: Satisfied, Tolerating and Frustrated.

As you can see in the screenshot above, we also separate the distribution into 3 regions: Satisfied, Tolerating and Frustrated. Those are the same regions defined in the Apdex standard. That’s not a coincidence: we use those 3 regions to calculate an Apdex score for long-term trending:

Apdex score for long-term trending.

How do we measure user engagement?

Accurately understanding the effect performance has on user engagement is notoriously difficult to get right. How you measure user engagement then becomes something you must carefully consider, since it has a profound effect on the aforementioned understanding. Please note that this is different than user conversion. Usually conversion is attributed to users that bought something, or in Vrbo’s case, users that ended up booking a vacation rental. But in this story, we are talking about user engagement, also known as micro-conversions. The idea is that users that engage more will have higher propensity to convert.

The engagement metric can be a measure of any user behavior that means they are “doing the thing”, which is page-specific. For example, on Vrbo’s home page, if a user submits a search request, we consider it as a user engagement; on a property details page, when a user interacts with the photos, we consider this user engaged; and on the booking page, a user who enters his/her email in the contact form is also considered as engaged.

Examples of user engagement

(Finally) Performance vs. engagement

Now that we covered the basics, let’s put everything together.

The holy grail dashboard

Below is a screenshot of the holy grail dashboard we created at Vrbo™. This example is looking at TTFB correlated to user engagement for one of the most important pages at Vrbo™.

  1. Page being analyzed (redacted).
  2. Engagement metric used for this particular page (redacted).
  3. Total engagement change calculated by combining the data from entire graph
  4. Period being analyzed (in this case last 30 days vs the 30 days prior)
  5. Full performance histogram for current period (blue bars) vs previous period (orange bars). This is similar to what we have in the Duration Histogram tab, but each bar is a stacked showing the percentage of engaged (light color) vs unengaged (dark color) events.
  6. Engagement rate for current period (blue line) vs previous period (orange line). The engagement rate is the result of engaged events divided by total events.

Overall this visualization provides a view for developers to understand how the performance change of their pages/apps affects the user engagement.

Case study

In the real world, most changes will have some impact to both performance and functionality. That’s the case for a recent change we made to one of our pages. In short, we converted a page from server-side rendering to client-side rendering while iterating on improving load performance.

As expected, TTFB and FCP improved significantly, since with client-side rendering the browser receives the first bytes much faster, and can start rendering the first pixels earlier.

We know from our dashboard that faster FCP correlates with higher engagement. But to our surprise, we saw a lower engagement overall after our change. Hmm, what went wrong?

If we look carefully, we can clearly see the performance improvement represented by the higher counts in buckets to the left (blue columns vs. orange columns.) But at the same time we see lower engagement rate for most buckets (blue line vs orange line.) Those two are opposing forces that end up resulting in a slightly lower engagement rate overall.

Our hypothesis for the lower engagement rate is due to low-end devices having more difficulty to render the page client-side, and resulting in more users bouncing. It makes sense since it doesn’t affect the fastest buckets.

We hoped to see a major engagement boost with this performance improvement, and we were somewhat disappointed with the results. But at the end of the day, as long as the data is accurate, we have to accept that not all performance improvements are created equal.

Conclusion

  • Look at the distribution whenever possible.
  • Correlating business events to performance data is challenging, but extremely rewarding.
  • Most changes will impact both performance and functionality.
  • Improving performance at the expense of user engagement is an undesirable outcome.

Credits

  • Nan Liu —co-authored this blog post and implemented the visualizations above as part of her intern project at Vrbo during the summer of 2019
  • Brian Quinn — mentored Nan Liu and wrote most of the foundation code for the dashboards reviewed in this blog post
  • Chris Karcher — initial research on performance vs. engagement at Vrbo back in 2018
  • Patrick Ritchie and Seth Hodgson — sponsored this entire effort

--

--