The Birth of the Performance Lab at Spring Health

Moving Fast and (not) Breaking Things

Published in

Spring Health Engineering

7 min readSep 27, 2022

“Building the plane while flying is easier said than done, especially when you’re used to working on a Boeing 737 versus a stealth fighter jet.” — Anonymous

Coming from a much larger, more mature engineering organization, I had grown accustomed to the luxuries of big tech development processes: comprehensive test suites in the CI pipeline, iterative environment promotion en route to production… safety checks and hand holding nearly all the way up.

While fairly restrictive, it made shooting yourself in the foot (more) challenging. On top of that, our product was established. New features were typically less risky and developed with timelines an order of magnitude longer. It was the proverbial Boeing 737.

Spring Health is, in many ways, the opposite.

Moving fast to save lives

We are a hyper-growth late-stage startup seeking to define ourselves and the entire mental healthcare industry. One of our core values is “move fast to save lives,” and we adhere to this value.

As a member of the platform team, my job is to keep our plane in the air — while the product teams continue to mold the plane into what our customers want.

A key component of my team’s charter is enabling our engineering organization to ship with confidence. A particularly interesting facet of this work is the development of a performance lab to help us answer questions in a safe, controlled, and isolated setting.

As Spring Health matures, our scaling footprint must grow. By developing a performance lab, we can begin conducting experiments and exploring the answers to questions such as:

At what load will our authentication service become overwhelmed?
If we double our traffic, what will break first?
When turning on auto-scaling, can we trust the automation system to function as we expect?

The Performance Lab at Spring Health

Spring Health’s performance lab is an isolated, seeded, production scale environment, consisting primarily of our backend services. We have chosen to mock out most of our third-party dependencies to improve the determinism and thus reproducibility of our experiments. To generate load, we have reproduced the most common user flows to realistically emulate everyday traffic.

Ultimately, Spring Health’s main objective is to enable members to schedule appointments with a well-matched care provider — quickly, efficiently, and with the least amount of friction as possible. Thus, when designing user flows for performance testing, it made sense to focus almost exclusively on the hot code paths around appointment scheduling.

Applying the Pareto principle, we found that we get ~80% code traffic coverage from the following three flows:

Member browses for an appointment
Member schedules an appointment
Async workflows complete to ensure that all subsystems around appointments have eventual consistency

We chose to use K6 as our load test runner due to positive past experiences, its ease of use, and its integration with our APM. Below is a snapshot highlighting some of the information from K6, which surfaces in our performance investigation dashboard. It’s incredibly helpful to have this data side-by-side with our other metrics.

Furthermore, while we have a templated dashboard of general, high-level metrics most pertinent to performance investigations, we can also investigate specific subsystems after the fact since many additional metrics, logs, etc. are collected during a run.

To better understand how we use our performance lab, let’s dive into an example.

Authentication Service Performance Woes

When conducting performance investigations, we tend to model our approach after the scientific method, a six-step empirical method of acquiring knowledge.

We’ll borrow this structure to demonstrate a significant performance win from a few months back. Note the cyclical design of the method — this is not an accident. In our investigation here, it took us a couple attempts to achieve the desired outcome.

Observation

During early runs in the performance lab, it became apparent immediately that our authentication service’s database had hotspots that can result in scalability issues. When applying a fairly light workload against our system, we observed the following:

While the CPU Limit is not accurate in that graph, the TLDR is that we observed saturation of the database CPU way before any other service even broke a sweat.

Research

Having identified the database’s CPU as the area of interest, we opened up our APM and began narrowing in on the hot code path, token validation.

Analyzing the source code, we identified a couple potential areas of improvement.

We improved our query to fetch the token, reducing the auxiliary SQL calls resulting from how the ORM constructed the SQL execution plan
We removed an unnecessary call to update token metadata

Thus, we reduced the number of database calls from over three down to just one.

Hypothesis: Attempt 1 Reduction of SQL Calls

With fewer database calls, we assumed that we’d observe a reduced strain on the database’s CPU. Furthermore, we expected an increased TPS for the same workload as a side effect of eliminating this bottleneck.

Test: Attempt 1 Reduction of SQL Calls

For this scenario, we wanted to perform an A/B comparison or regression test. First we deployed the latest production build. With a successful reproduction of the observed CPU saturation, we were satisfied.

Then, we reseeded the environment and deployed the production build with our changes. Executing the load, we observed the following (we ran the two scenarios back-to-back).

Note: while we first executed the baseline, we did so far enough in advance that we decided to re-execute it with the latest production bits. Thus, for this test, we actually started with version B (our changes) and then ran version A (the baseline).

For starters, we did see the decrease in SQL calls, as expected:

A closer look at which specific database calls experienced a reduction in load aligned with our expectations. Thus, the code was functioning properly — it’s always good to sanity test your results first.

In terms of performance changes, we observed a notable p50 and p95 reduction in the statement execution overall:

Analysis: Attempt 1 Reduction of SQL Calls

While the database statement execution latency decreased (a nice win), resource usage and TPS were fairly consistent between the two versions. Even though the average request decreased from 587ms to 508ms, a ~13.5% decrease, that did not seem to surface in a noticeable way from the vantage point of the overall call at the authentication service level.

Furthermore, we observed a similar saturation of the database CPU. Back to the research phase!

More Research

This time we stepped back and considered what we may have missed the last time around. Opening up our APM again, we began by looking at a trace. Here, we discovered that within each call to the database, the call consistently took longer than 50ms to instantiate a database connection:

Research yielded two promising paths to investigate:

While one group experimented with PGBouncer, myself and a few others looked into enabling persistent database connections.

Hypothesis: Attempt 2 Persistent Database Connections

Following our research, we hypothesized that configuring persistent database connections in our auth service might lead to a substantial performance win. Instantiating a database connection every call is costly and inefficient.

We believed we would observe something like the following:

Test: Attempt 2 Persistent Database Connections

Executing the load test following the same approach as attempt one (albeit this time remembering to execute the baseline first), we observed the following:

Analysis: Attempt 2 Persistent Database Connections

Our hypothesis was spot on! To ensure the system was functioning as we expected, we did a side-by-side comparison of the two. Of note, the PDO construction time decreased by an order of magnitude:

Report

After analyzing the data, we created a short write-up with graphs and data from our test run linked in the document. We attached this document to the ensuing pull request. Once reviewed, we shipped with confidence to production.

Here is the database CPU from our production environment:

The line in the graph above demarcates the time of deployment. A nice win for Spring Health Engineering!

The astute reader may be wondering, but what about PGBouncer? At the end of the day, we did not end up pursuing the PGBouncer route past the POC developed during this investigation. However, we documented our learnings, which may prove useful sometime down the road.

Conclusion

The performance lab at Spring Health is still in the very early stages. However, we’ve already amassed a respectable list of wins. Day by day, it continues to become more useful as more teams become familiar with it.

If you find this type of work interesting, check out our open positions. We’re hiring for several different engineering roles, and we’d love to speak with you.