What is Sustainable Throughput?

Published in

Engineers @ Optimizely

4 min readJul 29, 2022

Last time we learned that latency distributions are NEVER NORMAL. Now we begin the climb to discover, through measurement, what the latency distribution of our system actually is.

Correctly measuring latency is hard. But, overcoming this difficulty repays the investment: Discovering your system’s true latency distribution feels like graduating from debugging with print to using a proper debugger. I’ll be telling stories about how we did it at P99Conf this year.

The first step is finding your system’s sustainable throughput. This article is part 1 of the third law of latency.

The Laws

There is no THE LATENCY of the system
Latency distributions are NEVER NORMAL
DON’T LIE when measuring latency (most tools do… and that’s not ok)
DON’T LIE when presenting latency (most presentations do… and that’s not ok)
You can’t capacity plan without a LATENCY REQUIREMENT
You’re probably not asking for ENOUGH 9s.

maximum throughput — a situation that happens as a flash in the pan moment where the state of the system aligns perfectly for unusually high throughput. Most of the time our system is not capable of maximum throughput. If we drive the system at this level then over time our latency will be unbounded because of queue growth. This is useless information for our purposes.
sustainable throughput — the throughput that our system can sustain indefinitely without latency becoming unbounded. This number is always less than the maximum throughput.
target throughput — the real world throughput we use for scaling decisions. We will plan to divide the externally provided load among enough instances of our service so that each instance, on average, remains below its target throughput. We choose this value conservatively such that brief spikes in load or reductions in capacity do not cause us to violate our latency requirements. This value must be less than the sustainable throughput.

Our measurement goal: we want to find the throughput our system can handle indefinitely while meeting our latency requirements.

“But, I don’t have a latency requirement.”

We all have a requirement on maximum latency. You do have the latency requirement “max latency < infinity”. This requirement is surprisingly useful. Another way to say this is “latency must be bounded.” We don’t necessarily need to know what the bound is but it’s critical to do our latency measurements at a point where the bound exists.

To do this, we must find the constant throughput that can be reliably maintained by the system without its latency becoming unbounded. This is the sustainable throughput. If a system is driven above its sustainable throughput, it will either fail requests or see its latency grow without bound.

A system I maintain being driven far above its sustainable throughput. In this state, latency is proportional to how long the experiment has been running. If we run our system for infinite time at this load, our latency will become infinite.

If latency is increasing over the life of your test than your load is greater than your system capacity.

“I don’t have unbounded queues. When I’m over capacity, I load shed!”

This is a perfectly valid engineering tradeoff. But, this doesn’t eliminate the queue. It merely shifts the queue to the user (otherwise the user is losing data). We’re going to pretend the queues are only server side for this article series so that we can observe the queues and talk about them more clearly. If a system load sheds then both queue growth and overload errors are the indicators that the system is operating outside of the sustainable regime.

“Are you saying that queues are bad?”

Fundamentally, non-blocking queues trade increased latency for increased availability. When a downstream resource is over capacity, a queue lets us wait for capacity instead of failing the request. When upstream demand spikes, queues let systems absorb that spike by increasing processing latency instead of by failing the requests.

This has a startling implication: Under constant load the latency of a system that is running within its capacity should not significantly be a function of load.

Said another way: The measured latency at any throughput should be approximately the same as long as the applied throughput was within the system’s capacity. This is because:

Throughput dependent latency is always caused by queue growth
Queues should not be growing if the system is running within its capacity.

“My dashboard clearly shows latency increasing as a function of load!”

The real world doesn’t exert constant load. It sends clumps of requests followed by gaps. Queues let us absorb the clumps and then finish processing them in the gaps. During the clumps the system is briefly over capacity and queues absorb the spike. During the clump, latency is a function of throughput because queues are growing. In the real world, we capacity plan our system with headroom so that we can tolerate being briefly driven outside our sustainable throughput and then can catch up in the gaps while staying within our latency requirements. The queues are not giving us extra capacity. The queues are letting us trade latency for availability and if we have enough headroom we can do that briefly without violating latency requirements.

“I’m driving my system as hard as I can and I’m seeing occasional latency spikes but not infinitely growing queues.”

Your load test is suffering from coordinated omission. We’ll talk about that next time.

Sustainable throughput is the throughput our system can handle indefinitely without latency becoming unbounded. It is an upper bound on the throughput we will select as our target throughput. Next time we will talk about how to measure this.

What is Sustainable Throughput?

The Laws

Written by Brian Taylor