Web Performance Regression Detection (Part 3 of 3)

Pinterest Engineering
Pinterest Engineering Blog
6 min readJun 28, 2024

Michelle Vu; Web Performance Engineer |

Fighting regressions has been a priority at Pinterest for many years. In part one of this article series, we provided an overview of the performance program at Pinterest. In part 2, we discussed how we monitor and investigate regressions in our Pinner Wait Time and Core Web Vital metrics using real time metrics from real users. In this article we will focus on the systems we have in place to proactively detect and prevent regressions from being fully released in production.

A/B Experiment Checks

Collecting performance metrics internally allows us to pipe these logs into our internal experiments framework. Pinterest has an excellent culture of wrapping any major user-impacting changes in an A/B experiment, which enables us to detect the performance impact of these changes. Below, we will describe how experiment regressions are detected and handled.

Graded Experiment Regressions

Experiments that show a significant performance regression for five or more days of the last seven days of the experiment trigger Slack alerts and Jira tickets, which communicate information about the regression and track progress towards fixing it. Thresholds are defined per metric to grade the regression and a specific set of next steps covering experiment ramping, investigation, mitigation, and tradeoff discussions are defined for each severity level (e.g., experiment ramping is blocked for high severity regressions).

Figure 9: Example Jira ticket that is automatically generated when a performance regression is detected in an A/B experiment

Experiments Dashboard

For every experiment, we show all of the top line performance metrics in the main dashboard of core metrics. This shows the relative percentage increase (red) or decrease (blue) for each PWT and Core Web Vital metric:

Figure 10: Top line performance metrics shown in the main dashboard for A/B experiments

Additional performance dashboards are available to help investigate any performance metric movements. These provide key submetrics for the chosen top-line metric so the experiment owner can investigate the symptoms of a regression and how the critical path has changed.

Figure 11: Additional performance dashboards are available to help investigate the critical path and symptoms of a regression for the selected performance metric

Real Time Graphs

When the experiment dashboards don’t provide sufficient detail, we can enable real time debugging metrics by tagging the experiment name in our performance logging. This enables detailed comparisons between the control and treatment for all the submetrics (e.g. log volume, constraint timings, annotation timings, network request stats, network congestion timings, and HTML streaming timings) mentioned in the previous article on Real Time Monitoring. Typically this level of logging is only needed for platform-level changes for which it may be difficult to narrow down the root cause of the regression.

Performance regression detection within our A/B experiments has been a major form of protection over the years. Just in 2023, over 500 experiment regressions were detected and tracked across all of our clients.

Per-Diff JS Bundle Size Checks

Another major form of protection on web has been JS bundle size checks running per PR update via our CI pipeline. Historically, we’ve seen that over 25% of past PWT regressions resulted from increases in the amount of JS we send. It is not uncommon for these types of regressions to be severe (we’ve seen +800ms increases to PWT P90 values due to a single bundle size regression). In 2021, we turned on blocking alerts for the bundle size check and have reduced the number of production regressions due to bundle size increases to near-zero. Typically, 3–5MBs of bundle size regressions are caught and prevented in a single year. For example, in 2023, 2.8MBs of bundle size regressions were prevented, which would have equated to 60 seconds of additional request duration on a slow 3G network.

Implementing the bundle size check was a matter of generating and storing the asset sizes during our webpack build, which runs in our CI pipeline for the master branch and any PR branch. For any CI build for a PR branch, we then find the base commit for the branch, download its asset size file from s3, and use that as a baseline to compare the asset sizes of the branch commit against.

Any significant change in bundle size, whether it’s an increase or decrease, is reported in a comment on the PR to help educate developers on how their code changes impact bundle sizes. Bundle size increases on critical pages additionally trigger a Slack alert sent to the PR author and the surface owning team’s alert channel. The surface owning team is also added a reviewer to the PR.

Figure 12: Example PR comment from the JS bundle size check for a critical regression

These alert messages link to guidance on how to resolve the regression. Typically the regression is due to a new module import, which can usually be lazy loaded. Root-causing and fixing the regression is so simple that almost all of the regressions are resolved by the PR author without assistance (hooray for self-serve performance!). For cases in which the root cause is not obvious, the PR author is guided on how to run webpack-bundle-analyzer to investigate where the size increase is coming from:

Figure 13: A webpack-bundle-analyzer report used in investigating an actual bundle size regression that occurred

This system has been a huge improvement over our old system of monitoring bundle sizes in production, which was limited to just a handful of critical, statically-named bundles. With the per-diff bundle size check, we can easily check the sizes of all bundles we know are needed for a page at build time, and PR authors are able to detect and fix the regressions on their own. This saves the Performance team the significant amount of work of detecting and root-causing production regressions, working with PR authors on fixes, and validating the regression was resolved by monitoring the fix as it’s released into production. It also prevents regressions from impacting users, as the bundle size increases are typically resolved before the PR gets merged.

Per-Diff Performance Regression Tests

While many regressions can only be detected when changes are released to real users, we are able to detect certain regressions in synthetic environments via performance integration tests. Previously we had performance integration tests running on every master branch commit. Similar to the JS bundle size check, we have since migrated these tests to also run per-diff (before PRs are merged) to prevent regressions from reaching users, promote self-serve performance, improve regression caught-rate, and reduce investigation time. We are preparing to turn on regression alerting for PR authors very soon and will hopefully have good news to share on the implementation details and efficacy of these tests in an upcoming article.

Overall Learnings

Year after year, the Performance team at Pinterest works on a combination of optimizations, tooling, and regression firefighting. As we’ve invested in better tooling over the years, we’ve been able to spend less time firefighting and more time optimizing. A few key learnings from our work on performance tooling include:

  • Real time, real user monitoring with granular time intervals and rich submetrics is invaluable in root-causing production regressions given a continuous deploy system
  • Automated, proactive systems, such as per-diff and A/B experiment performance checks, are very effective as they:
    - Provide earlier detection, typically preventing regressions from fully reaching production and impacting users
    - Isolate the possible root causes for a regression
    - Enable self-serve performance, ultimately saving on engineering resources
    - Scale well with increases in the rate of commits, experiments, and other internal changes that occur as the company grows
  • Regressions are more likely to be investigated in a timely manner and resolved if the alerts are actionable and the next steps are finite — regression alerts should be clear and come with easy to follow guidance that can be completed in a reasonable amount of time

These systems have helped immensely in providing protection against web performance regressions at Pinterest, and as a result have improved our internal velocity and have provided a better experience for our users.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.

--

--