How profiling could forecast latency impact of infrastructure downsizing

Save money with limited performance impact

--

Profiling is first and foremost associated with performance optimization. However, another use case of profiling could be cost saving. This is something we’ve been exploring at work recently.

Let’s say you have a backend service for some API endpoints. You allocate some number of CPU cores to this service.

3 things can happen:

  1. You’re overprovisioned. You requested 10 cores, but the maximum number of cores you observe your code using concurrently is 5. In this case, you’re likely overprovisioned by 5 cores. 50% of the money you spend is “wasted.”
  2. You’re underprovisioned. You set a request of 10 cores, and you’re consistently saturating your CPU and observing high latency and timeouts.
  3. In-between the other two. For simplicity, let’s call this scenario “well-provisioned.” Your observed p9x (p95, p99, p99.9 etc.) CPU usage is near the requested cores.

If you observe your service to be overprovisioned by X percentage, how confident can you say that downsizing by X percentage will have limited impact on latency?

This is the question profiling can help you answer.

The rest of this article is split into 3 parts:

Part 1: Cost saving without profiling and its limitations.

Part 2: An overview of how CPU profiling works.

Part 3: How profiling can help forecast latency impact of downsizing.

Note, I am and will be using CPU as the example metric. For simplicity, let’s say this service is constrained on CPU, not memory or I/O.

Part 1: Cost saving without profiling and its limitations

The way to save cost on infra is to use less infra. If you want to cut down on CPU cost, you have to downsize your machine to run on fewer CPU cores. But how do you evaluate the performance impact of downsizing a service?

You can get infra metrics like CPU usage from the machine itself without profiling. For example, Linux kernel provides an API to obtain these system level metrics like CPU usage of the machine. With these alone, you can get a good idea of your average and p9x CPU usage over time, and therefore how well-provisioned your system is.

However, infra metrics are limited, and they might not capture the reality of your service’s CPU usage. This is especially true for services with sporadic, short lived, high bursts of activity. The reason is due to the resolution of the data.

Many observability systems are set up to query infra metrics every few seconds to one minute. Profiling’s data from a continuous profiler has resolution at ~10ms without much overhead, and therefore, can better capture sporadic high intensity activity.

High resolution vs. Low resolution data

Without capturing occasional high bursts of activity, if you downsize, you cannot guarantee your service won’t be throttled.

The chart above shows how, if you query infra metrics every 10 seconds, you will miss all of the occasional bursts of CPU activity. This means infra metrics alone will not give you a correct estimation of latency impact.

If your CPU resource utilization is only 10–20% according to infra metrics polled at a 10 second interval, then you can perhaps more safely argue for downsizing without need of profiling. The value of profiling becomes more apparent as you use higher percentage of your CPU.

Even in the well-provisioned scenario, because profiling can better forecast the downsizing vs. latency tradeoff, it enables the developer to make an informed decision on when to save and when not to save.

Thus, to summarize: the value profiling can provide is a more accurate forecast of cost vs. performance tradeoffs.

Part 2: How CPU profiling works

At a high level, a continuous profiler collects CPU-related events among other types of events. You can estimate from these events how many cores are running concurrently at any given time.

Every runtime works slightly differently; for simplicity, I will be describing how Java works. In Java, the profiler by default collects an event every 10ms of CPU time per thread. Note that CPU time is different from wall time.

  • CPU Time: A function’s CPU time is the time spent on the CPU itself.
  • Wall Time: A function’s wall time is the time it took to execute the function from start to end as measured by the clock.
90 seconds of CPU time for 60 seconds of wall time

The CPU-related event is tagged with the timestamp when the event is emitted. This is the end time of the event. We estimate the start time to be the event timestamp minus 10ms (because we emit an event for every 10ms of CPU time).

The CPU-related events are what will help estimate the latency impact of downsizing. Let’s see how!

Part 3: How profiling forecasts latency impact of downsizing

From the CPU-related events, you build CPU utilization metrics. From these utilization metrics, you can estimate the latency impact of downsizing.

Step 1: Compute CPU utilization metrics

The CPU-related events from the profiler give you an idea of how saturated your CPUs are, and therefore how close you are to hitting your CPU cores allocation.

If your profiles are 60 seconds long, then all the events collected within those 60 seconds are combined to form a single profile.

For each profile, you can build histogram like the one below using the events that make up that profile.

CPU utilization metrics

In the above example of a single CPU profile, you have 8 seconds when 1 core was active, 6 seconds when 2 cores were active, 3 seconds when 3 cores were active, etc. This also means for approximately 40 seconds (of a 60 second profile) your CPU was idle.

These are your utilization metrics: i.e. how saturated your CPU cores are and for how long.

Step 2: Estimate latency impact from utilization metrics

For each profile, you have 1 set of utilization metrics. If you want to forecast the latency impact of downsizing from N cores to Y cores, you need to look at many utilization metrics across many profiles.

You can combine some or all of the utilization metrics collected during a specific time period (e.g. past 24 hours) by some kind of aggregation method such as sum, average or median.

From utilization metrics, you can compute something like the following:

  • 75% of the time, your service was using 1 core.
  • 21% of the time, your service was using 2 cores.
  • 3% of the time, your service was using 3 cores.
  • 1% of the time, your service was using 4 cores.
  • 0% of the time, your service was using 5 cores.

In this example, if you downsize from 5 to 4 cores, you wouldn’t see a latency impact. If you want to downsize from 5 cores to 3 cores, then 1% of the time, your service would be impacted. The exact method to estimate latency increase from this 1% impact area is an implementation detail.

If you have tracing as well, which measures endpoint latency, you can combine the CPU utilization metrics from profiling along with tracing data to forecast per-endpoint latency impact of downsizing.

High level process of estimating latency

Tracers work by creating a span at the start of the endpoint execution and closing the span at the end. When a CPU-related event gets captured, the profiler can interact with the tracer and annotate the event with relevant metadata from the active span (e.g. trace id, span id).

From these annotated CPU-related events, you go through the same process of computing the number of cores active for any given time and then derive the utilization metrics scoped to that endpoint.

In other words, you build the same bar chart as the one shown above, for each endpoint:

CPU utilization metric for endpoint /foobar

Once you have the utilization metrics for each endpoint, you can estimate the latency impact of downsizing from N cores to Y cores per-endpoint as outlined before.

Conclusion

Even though profiling is most often associated with performance optimization, it can have many other use cases too, including cost optimization.

The high resolution profiling data can give you a more accurate forecast of the performance impact of downsizing. The goal is to build a chart like this:

Profiling can estimate cores needed at X% throttle

You’re requesting a certain number of cores, but your data tells you you can downsize by N cores with only X% (e.g. 0.1%) throttle.

This methodology of using high resolution profiling data to forecast overall and per-endpoint latency impact is something we’ve been exploring at work. I’m excited to see where it leads us.

--

--