Performance Regression in a React App: Investigation and Remediation Strategies

Justin Dang

Published in

Hootsuite Engineering

9 min readMar 20, 2024

Photo by Lautaro Andreani on Unsplash

Introduction

At Hootsuite, ensuring high performance and minimal latency across our platform is a top priority. We achieve this by closely monitoring our latency Service Level Objective (SLO), which helps us maintain optimal performance standards throughout our platform.

An SLO serves as an important performance metric that outlines the level of service our platform aims to provide to users within a defined timeframe. It establishes measurable targets for performance, availability, or quality aspects of a service.

In the case of our Planner application, a React application for viewing scheduled, published, or draft content in a calendar view, our latency SLO states that 90% of users should experience a load time within a reasonable timeframe when accessing the app.

However, we’ve observed a concerning trend in our SLO violation rate. Historically, our SLO violation rate has typically hovered around 75%. Yet, around mid-November 2023, this rate spiked to approximately 95% and stayed elevated since then. This significant increase in latency violations indicates that users are experiencing slower load times when accessing the Planner app.

Investigation

In this section, we go through the tools we used to measure and monitor the Planner Latency SLO, along with the various areas we looked into to determine the root cause of the degradation of the Planner load time.

Monitoring and Alerting

Monitoring provides real-time visibility into the performance and health of an application. By continuously monitoring key metrics such as response times, error rates, and resource utilization, teams can quickly identify any anomalies or deviations from expected behaviour.

Along with monitoring, alerting plays an important role in proactively addressing potential issues identified through monitoring. By setting up alerts based on predefined thresholds or conditions, teams can be notified immediately when performance metrics exceed acceptable levels or when certain conditions indicative of problems are met. This allows teams to take prompt action to investigate and resolve issues before they escalate and impact users.

At Hootsuite, for monitoring, we utilize Grafana, an open-source analytics and visualization platform, to measure and monitor latency SLO metrics across our services. For alerting, we leverage Prometheus, an open-source monitoring and alerting toolkit, for sending alerts to Slack whenever we exceed a certain threshold of the SLO error budget.

On November 23, 2023, an alert was triggered indicating that we were close to surpassing the threshold for the SLO violation. This alert prompted us to initiate our investigation into the latency issues affecting the Planner load time.

An image of an example Slack alert of the Planner Latency SLO violation. — Example Slack alert of the Planner Latency SLO violation

As seen in the graph below, the SLO violation rate began to rise on November 23, 2023, climbing from 80% to 120%.

An image depicting a graph illustrates a upward trend in the SLO violation rate before optimizations. — SLO violation rate started to increase on November 23, 2023 from 80% to 120%

Darklaunch Codes

We use darklaunch codes, also known as feature flags, to securely toggle features and bug fixes. During our investigation, we thoroughly investigated whether any significant darklaunch codes were enabled or disabled around the time in mid-November when the violation rate began to increase. However, we found that there were no changes to any darklaunch codes that could have an impacted the load time of Planner.

Commit History and Git Bisect

Our investigation extended to inspecting the commit history for any notable changes that could have influenced the load time of Planner. To pinpoint potential problematic commits, we used git bisect, a powerful tool for identifying the introduction of bugs or performance regressions within the codebase.

By analyzing the commit history and utilizing git bisect, we aimed to isolate any changes that might have contributed to the observed degradation in load time. This methodical approach allowed us to narrow down our focus to specific code changes and assess their impact on the latency SLO metrics.

How to use git bisect:

In your terminal, checkout master and enter git bisect start
Tag the master branch as a bad commit with git bisect bad
Find a commit that you think is a good commit and enter git bisect good <commit>
Git automatically chooses a commit in the middle of the range between the good and bad commits. It checks out that commit, and you can test the code to determine if the regression is present
Based on the result of your tests, mark the chosen commit as either good or bad with git bisect good|bad
Git will continue choosing middle commits and asking you to mark them as good or bad until it identifies the specific commit where the regression was introduced

External Variables Affecting Planner Load Time

The Hootsuite platform features a range of asynchronous apps developed by diverse teams, all seamlessly integrated within the primary app, known as the dashboard. Given that the dashboard loads prior to any other apps, any delays in its loading process can directly impact the loading times of asynchronous apps. Changes to shared resources, dependencies, or underlying infrastructure within the dashboard have the potential to introduce performance regressions to specific asynchronous apps, such as Planner. As part of our investigation, we collaborated with other teams to assess whether there were any infrastructure changes to the dashboard that could potentially affect the load time of Planner. However, we didn’t find any changes like that during our investigation.

Investigation Concluded

After several weeks of investigation, we were unable to pinpoint the exact cause of the regression. It’s possible that there was an influx of specific users with slow CPUs or internet connections, which could have skewed the latency metrics. Unfortunately, we lacked the necessary data or metrics to confirm this hypothesis.

Summary of the steps we took to narrow down the regression:

Checking if there were any darklaunch codes that were changed around the time of the SLO violation rate increase
Going through each commit (with git bisect) and running performance tests
Considering external variables that may have affected the Planner load time, such as infrastructure changes to the dashboard or changes to shared dependencies across our asynchronous apps

Despite our inability to identify the root cause, we still needed to bring the latency SLO back down. To achieve this, we recognized the importance of gathering more data and metrics to pinpoint where users were experiencing slowness within the Planner app.

In the following section, we outline the insights gained and strategies implemented to reduce the Planner latency SLO.

Addressing Latency Challenges: Insights and Strategies

The insights and strategies outlined in this section aim to mitigate latency and performance issues within the Planner app, offering a proactive approach to improving performance and meeting SLO objectives.

Auditing Render Blocking Resources

We brainstormed ideas on how to speed up the initial render of Planner. Knowing that rendering performance wasn’t the issue, as our performance tests didn’t show any regressions, we considered whether there were any resources or requests that might be blocking the initial render of Planner.

We examined resources or requests that might have been blocking the initial render of Planner and identified four network requests contributing to the delay. All of these requests were internal Hootsuite requests and contained data necessary for rendering Planner. The first request was the JavaScript bundle, which cannot be deferred as it’s required for rendering Planner. The second request contained user-specific data required for subsequent asynchronous calls. Further analysis revealed that one of the other requests could be asynchronous and did not necessarily need to block the initial render of Planner.

Diagram A represents an example network waterfall before any optimizations. In this diagram, each request causes a delay for the following request.

A diagram representing an example network waterfall before any optimizations. — Diagram A

In Diagram B, the optimization is implemented by moving request 3 to execute in parallel with request 2, immediately after request 1. Request 3, which requires data from request 1, is called asynchronously and no longer blocks rendering. This change resulted in a reduction of the Planner render time from 800ms to 650ms.

A diagram representing an example network waterfall after optimizations. — Diagram B

Gathering More Metrics

At Hootsuite, we utilize Sumo Logic for logging data to troubleshoot issues. Our objective was to analyze the TTI of the Planner app by breaking down the request response time and render time. This breakdown would enable us to identify areas of slow performance and implement optimizations as needed.

By leveraging the Performance API, we can retrieve the request response time (measured in milliseconds) of each request. While metrics for backend services are typically available (which we have), we aimed to obtain the actual time for each user, along with the TTI.

Additionally, we wanted to measure the initial rendering time of Planner without any content and the time it took to fetch and render content to the page. This analysis would help us assess whether any React performance optimizations were necessary for the Planner app. The TTI for Planner is calculated by adding together the request response times and render times.

Below is an example JavaScript code demonstrating how to retrieve the duration (the sum of the queue time, stall time, request sent time, server response time, and content download time) of each request and log the data:

const requests = performance.getEntriesByType('resource').reduce((acc, resource) => {
  const { duration, name } = resource
  // Only log the requests we're interested in
  if (name.includes('request1') || name.includes('request2') || name.includes('request3')) {
    acc[name] = duration;
  }
  return acc
}, {})

log('Planner TTI metrics', {
  TTI: TTI,
  requests: JSON.stringify(requests),
  initialRenderTime: initialRenderTime,
  contentFetchAndRenderTime: contentFetchAndRenderTime,
})

If you only need the request start and server response time, you can use the following code:

performance.getEntriesByType('resource').forEach((entry) => {
  const request = entry.responseStart - entry.requestStart;
  if (request > 0) {
    console.log(`${entry.name}: Request time: ${request}ms`);
  }
})

Note, if the value of the requestStart property is 0, it may indicate that the resource is a cross-origin request. To view cross-origin timing information, ensure that the Timing-Allow-Origin HTTP response header is set.

Alternative methods for profiling a React app to identify performance issues:

Virtualization

Virtualization is a technique used to efficiently render large sets of data or complex UI components. It’s especially useful when dealing with lists, tables, or grids with a large number of items.

The primary goal of virtualization is to improve performance and memory usage by rendering only the items that are currently visible to the user, rather than rendering the entire list or dataset at once. This approach helps reduce the initial load time, improves scrolling performance, and reduces the memory footprint of apps.

A Hootsuite user can have over 1000 cards displayed in the Planner calendar. Each card can contain text and images. 1000 cards rendered on the page, each with an image, would render very slowly as the browser would need to render all the cards and download 1000 images. This puts a significant load on the CPU, causing the page to be unresponsive while rendering all the posts.

To address this issue, we used the react-intersection-observer library to only render cards that are visible on the page. This change resulted in a smoother and quicker experience for our users.

A GIF displaying a series of cards appearing as the page is scrolled downward. — The cards are rendered as the user scrolls down the page, with the animation appearing much quicker in practice. In the screenshot, it has been slowed down to showcase the virtualization.

The only drawback of virtualization is that you can’t search for text that’s below the screen since it won’t be rendered on the page.

Some other notable virtualization libraries:

Outcome After Optimizations

With the render blocking requests and virtualization optimizations, the SLO violation rate decreased from 110% to 60%. Additionally, Planner now loads much quicker, allowing users to see their content sooner.

An image depicting a graph illustrates a downward trend in the SLO violation rate following the optimizations. — SLO violation rate started to decrease after the optimizations 🎉

Conclusion

The investigation into the latency issues affecting the Planner app at Hootsuite provided valuable insights and led to the implementation of effective remediation strategies. Despite the challenges encountered in pinpointing the exact cause of the regression, the proactive approach to monitoring, analyzing, and optimizing performance significantly improved the user experience.

By leveraging tools such as Grafana, Prometheus, and Sumo Logic, we were able to closely monitor latency metrics, identify deviations from the latency SLO, and promptly initiate investigations.

Furthermore, optimizations such as identifying and addressing render blocking resources, as well as implementing virtualization, played an important role in reducing the SLO violation rate and improving the responsiveness of the Planner app. These optimizations not only resolved immediate performance issues but also established a foundation for ongoing efforts to enhance performance.

By monitoring latency metrics and adopting a proactive and data-driven approach, we can ensure that the Planner app continues to meet user expectations for speed, reliability, and responsiveness.

Thanks for reading!