On Scalability, Capacity, and Sensitivity

Often, when one hears the word “Scalable” used in a context, it is used in an informal way. It is a fuzzy term, used by people to say that their system can handle the future. As if one has divine knowledge of what happens.

What is often meant is how the system handles additional load, for some measure of load. That is, if a company is succesful, and it gains 10 times as many users as now, the system the company uses to run their business keeps operating with next to no perceivable drop in behavior. It is simply business as usual.

There are two ways to achieve this design goal: either design your system for handling the additional load vertically, or design your system so it can be copied horizontally. Scaling along the vertical axis amounts to buying more capable hardware and improving the software so it uses less resources. Scaling along the horizontal axis requires the system to interoperate in a distributed setting, where more than a single physical machine collaborates on handling the load.

But how does one really become scalable in the setting of the real world? As with so many others things in computer software, it is a property of your system architecture, not your programming language choice. That is, the limit is defined by how you constructed the system in the first place, and how you put together your technology stack. People may look down on PHP, but Facebook managed to use it in a way that scaled well, by coupling it with other solutions in the architecture.


One key indicator of a system is its capacity. This measure defines how much simultaneous load your system can cope with. A telephony switch can be snapshotted in time and we can measure how many phone-call-channels are currently in use. This is the load in this setting, maybe set at 70 simultaneous calls. The capacity of the system is how many calls the system can ultimately handle. Maybe it has been certified up to 500 simultaneous calls, but under an emergency accident it has managed to handle 1243 calls. The number 500 is called the engineering capacity whereas 1243 is the peak capacity. For normal operation, you want to be within the engineered capacity of the system. But emergencies do happen and you get close to the peak of the system.

If load increases beyond the peak capacity we say we have an overload situation. How a system operates at the peak capacity and under overload is important for gauging how a system scales, vertically. If the system breaks when it is overloaded, it is likely the system was not built to withstand additional load. And usually the problems start showing up at lesser loads.

Two things often contribute negatively to the load one can handle:

If a data structure or algorithm operates on size n and n is small, the complexity of accessing elements in the data structure, or carrying out the algorithm, may appear to be O(1). But in practice, as n grows, the real complexity class will rear it’s ugly head. It may be O(n) or worse. In modern architectures the actual behavior is further perturbed by the presence of cache hierarchies and lock contention. As load increases, you may end up being pushed out of caches, or you may end up contending on a lock everyone wants access to. In a system nearing its engineering or peak capacity, it will be easier to discern how the algorithms or data structures operate.

The other case has to do with outliers in your data. Maybe the average number of zots a user of your system has is 5. But there is this one user, who has managed to amass 60000 zots. This user, and their zot-abundant siblings, are going to wreak havoc in your system. Your system will try to cope with them and in turn use lots of resources doing so. Some times crashes will occur, even though the system doesn’t appear to be especially loaded in the first place.

If you want a scalable system, you need predictability. In turn, it may not be that you want the fastest algorithm in the common case, if it proves to have bad behavior when the zot-abundant enters. It may be that you want to pick an algorithm that is stable, no matter what happens over one which is blindingly fast. A naive hash-table implementation can degrade if many collisions occur in ways a self-balancing binary tree cannot, for instance. But access in the binary tree is more expensive due to the additional need for key compares and memory reads. The naive hash table may also double its size when it runs out of space due to density. The binary tree uses more memory due to the high number of pointers, but on the other hand, its resource usage is more linear, allocating new internal nodes as it needs them in small sizes.

Erlang (the unit)

In queueing theory, an Erlang is defined as E = λ·h.

The λ defines a “call arrival rate” over some unit of time: web server requests per second, telephony calls per minute, and so on. The ‘h’ defines the average call-holding-time in the same unit as λ. So for web server requests where the average request time is 30ms, h = 30/1000. For the telephony example, the average call may be 146 seconds, so h = 146/60.

The Erlang unit, E, measures slightly different things. One way to interpret it is as “instantaneous traffic”, in which we snapshot the system at an instant in time, and measure the amount of simultaneous work. Another way to interpret the unit is that it measures “carried traffic” which is the average number of simultaneous work over a reasonably time period, for instance 15 minutes, one hour, or 1 minute, depending on what it is used for.

The formula E = λ·h defines the relationship between the arrival rate and the holding time. For example, if our system processes 50 reqs/s and each request takes 60ms on average, the system is loaded at 50 · (60/1000) = 3 erlang. If we can improve the system such that the average request takes 45ms, then the load is 50 · (45/1000) = 2.25 erlang.

Aside: One might see that this formula is essentially identical to Little’s Law. I don’t know the exact relationship here, and my searches for how they relate turns up bleak. The remarkable thing about the theorem of Little is that we don’t need to know the distrubtion of the call arrival rate, nor the distribution of the service holding time. Both things also applies to the above relationship. End of aside.


The notion of load in a system is a way to see how much work it is processing. The problem is when load increases to the point where the system hits capacity and then breaks down. If one request requires full use of one processor core for its duration, then a modern Core i7 CPU hits its peak capacity at 8 erlang[0].

As we pack more requests vertically on one system, we make it more sensitive to fluctuations in holding time.

Imagine we have two systems A and B. In system A, we have 150 reqs/s at 45ms per request. This gives 6.75 erlang. System B is 5 times faster, completing each request in 9ms on average, so we have decided to load B with 400 reqs/s instead. It is thus loaded with 3.60 erlang.

Now, we suddenly hit a hiccup which makes the systems misbehave. Both sees 20ms increased holding time for systems A and B. This means 65ms and 29ms respectively. And now, system A has a load of 9.75 whereas system B has a load of 11.6.

What we see is that the tighter packing of requests on system B means it is more sensitive to small fluctuations in holding time. The tighter we pack requests, the worse the load when we let the holding time fluctuate. In other words, systems with high arrival rates doesn’t cope well with holding time fluctuations.

Wrapping up

One may wonder if we should always celebrate systems able to process a higher number of requests per second. The added onus of operating such systems is that we need to optimize every code path, or the system topples for outliers, or at system capacity. That is, the price of operating a system may be far higher than we think, simply by the notion of additional optimization needs.

In addition, we may want to celebrate a system having high engineering and peak capacities. Such a system will generally be far more predictable under load that systems with lower capacity limits.

Finally, the simply notion of increasing capacity by going faster isn’t always the right solution. As we have seen, such solutions make the system more sensitive.

The key insight is that in the question of scaling vertically or horizontally, you need to understand what capacity limits your system can cope with, realistically. And in practice, it may be necessary to avoid loading systems to much vertically in order to achieve correct operation when you strike gold when slaving away in the venture capitalist mine.

[0] Perhaps the peak capacity is less, due to the SMT of modern CPUs.