Latency Whack-a-Mole: Our Renderer’s Sawtooth Mystery Solved

Published in

Typeform's Engineering Blog

7 min readJul 4, 2024

Introduction

Ever notice how technology keeps an eye on itself? Tools like Datadog help us track important metrics, like how fast our systems respond to requests. But how do you dig deeper and find a solution when those numbers start acting strange?

This is what happened to us recently, and in this post, we would like to share the process of our investigation into an interesting performance issue in one of our central services in Typeform, the Renderer. It is responsible for displaying forms to end users, making it one of our core services. As the entry point for all forms, it handles a substantial amount of traffic — millions of unique visits per day! Given its critical role, ensuring the Renderer responds quickly and efficiently is vital for a good user experience.

After we deployed a new version of this service, updating many dependencies and some adjacent code, a curious Sawtooth pattern developed when plotting the response time. One morning, while sipping a cup of coffee and looking at the Datadog graphs, we couldn’t help but notice this strange pattern. We were lucky because we deploy the renderer quite frequently, which means that the accumulation of response times resolved itself for a few hours whenever we deployed. But after paying more attention, we saw the renderer’s response time increase and increase for about a day until a sudden drop to “normal,” then repeating that same pattern. A quick analysis showed that the drops in response time were caused by container restarts, either intentionally during deployments or when Kubernetes deemed them unresponsive.

Reverting the change would have been an option, but we needed to update those dependencies at some point, and the nature of the issue made it difficult to reproduce outside production. And anyway, why would we let this opportunity pass to try out some nice tooling and learn something in the process?

In this post, we’ll walk you through how we figured out what was happening, from spotting the problem to fixing it. It’s a journey that reminds us why keeping a close watch on our systems is so important.

Root Cause Analysis

To tackle the issue, we first started collecting all the information we had available:

Metrics and Tracing

Tools like Datadog Metrics are really great for detecting that there is a problem and even pinpointing the time it first occurred. We could even confirm and narrow down the issue with distributed tracing.

For example, we found that the renderer was at times taking about 100ms to call the service which stores form definitions, forms-api where it itself only took a few milliseconds to process the request. This discrepancy between outgoing and incoming span durations was suspicious to us. Even though most time is naturally spent on the network and waiting for responses, the renderer was spending much more time in the code calling the API and processing the response than we had expected.

forms-api GET (red) is the span that the renderer generates for making a call to Forms API and it takes 103 ms. Forms API itself retrieves the form from the cache and returns the form definition after only 6ms. The discrepancy is usually spent in network transmissions, but almost 100ms spent on the network seemed excessive and did not match our experience from interactions between other services.

We still didn’t have enough to locate the issue. We needed more details about where the service behaves oddly, so we ran a Profiler. However, we failed to reproduce the issue locally with artificial traffic, and even if we could have, it seemed that it would only occur after about a day. This is why we chose to collect additional Node.js runtime metrics and set up the live, sampling Datadog profiler.

Runtime Metrics

We also added Runtime Metrics to our toolkit. These metrics gave us important information like how often the Event Loop was interacting, any delays in the Event Loop, and details about memory and CPU usage. With this complete set of data, we began a thorough investigation to understand why our system was experiencing latency issues.

The CPU usage was not showing any congestion but a slight decrease after the deployment

The Event Loop delay showed that started building up and right after deployment would drop immediately.

During the period that the response time latency accumulated, we observed an increase in CPU usage and the event loop delay. The more delays, the more congested the instance running our service was. However, we did not observe an increase in memory usage. At this point, we started suspecting expensive object creation somewhere hidden in the code to be the culprit.

Profiler

The Datadog Profiler allows us to choose different time frames and compare their flame graphs. It even conveniently highlights functions in which more time is spent in a given sample compared to another.

In our case, we knew that the system sometimes behaved well (after any deployments) and would degrade gradually afterward. This allowed us to choose two distinct timeframes: A, representing the immediate aftermath of a deployment, and B, coinciding with the peak latency period.

Solution implementation

As there isn’t any tooling (just yet) to automatically fix these detected issues, this is where we needed to start getting creative, digging into the code ourselves to find the most likely cause(s) or, to begin with, anything that stood out.

Under the H (server.js:L#5) running our service, something caught our attention: trackPrivateForm (server.js#1). This was a keyword we could search in our codebase because other files were autogenerated, which limited our debugging options.

We had finally localized the issue!

The function trackPrivateForm was taking a lot more time in our “before deployment” sample with degraded performance.

After identifying the potential source of the problem, the next step was to examine the trackPrivateForm function. To avoid the repeated instantiation of new Axios instances, we moved the creation of the RudderstackAnalytics instance outside the function, ensuring it is created only once. This approach optimizes performance and resource utilization.

// imports and stuff

const trackConversationFailed = (
  reason: string,
  { account_id: legacyAccountId, form_uid: formUid }: eventDataProps
) => {

 // Something smells fishy in here...
 const rudderAnalytics = new RudderstackAnalytics(
   RUDDERSTACK_KEY,
   `${RUDDERSTACK_DATA_PLANE_URL}/v1/batch`
 )

  const eventData = {
    // payload
  }

  rudderAnalytics.track(eventData)
}

export const trackPrivateForm = (eventData: eventDataProps) => {
  trackConversationFailed('private', eventData)
}

So we decided to place the RudderStackAnalytics instance outside trackConversationFailed so that would be only triggered once.

// imports and stuff

const rudderAnalytics = new RudderstackAnalytics(
   RUDDERSTACK_KEY,
   `${RUDDERSTACK_DATA_PLANE_URL}/v1/batch`
)
 
const trackConversationFailed = (
  reason: string,
  { account_id: legacyAccountId, form_uid: formUid }: eventDataProps
) => {
  const eventData = {
   // payload
  }

  rudderAnalytics.track(eventData)
}

export const trackPrivateForm = (eventData: eventDataProps) => {
  trackConversationFailed('private', eventData)
}

When a conversation failed, we tracked those analytics with RudderStack in a middleware tool, and the code was inside each call. This also explains why the latency build-up always took many hours to be noticeable: the metric was only sent on such failure, for example when the connection had been reset by the client (e.g., a user’s browser navigating away from a form).

We can see the Sawtooth pattern before and after the fix where there are no more relevant peaks.

Average request latency per endpoint deployment

Conclusion

This investigation was a fascinating journey to understanding the nuances of system performance and the importance of thorough monitoring. By leveraging Datadog’s metrics, tracing, runtime metrics, and profiling tools, we were able to identify and address the root cause of the latency issue effectively.

The key takeaway from this experience is the significance of proactive technical management. Regular monitoring and detailed analysis not only help in identifying performance bottlenecks but also provide valuable insights for optimizing system performance. Addressing issues promptly ensures that we maintain high-quality service for our users, even under heavy traffic.

Ultimately, this experience reinforced the importance of continuously improving our systems and staying vigilant in monitoring their health. As we continue to evolve and update our services, the lessons learned from this investigation will guide us in maintaining robust and efficient systems.

👋 Arnau Guadall — Frontend Engineer at Typeform
This article is co-authored by: Philipp Langer who provided guidance, acted as an advisor throughout the process, and participated actively during the investigation and bug-solving phases