Rethinking When to Scale Your Servers

If I asked you to describe this photo, one of the words you might use is ‘traffic’. But how do you know it’s traffic? I’m going to assume your brain saw a bunch of cars close together on the freeway and associated it with traffic, but that can’t be what defines traffic. Traffic must be defined by how fast the cars are going on a road relative to the speed limit of that road. So without knowing the speed of the cars we can’t really know whether this is traffic. If the cars were all moving at 65 mph and the speed limit was 65 mph, I don’t think we would call this traffic. You wouldn’t tell your friend you were stuck in traffic if you were driving at 65 mph, you’d only say you were in traffic if you were involuntarily cruising at a speed significantly lower than the speed limit.

You might be thinking ‘so what?’. Well if you think about how most of us scale our servers today, most of us are spinning up a new instance once our CPU reaches a certain threshold. AWS lets you automatically add an instance once you reach something like 70% CPU usage for example. And that is the wrong way to scale IMO.

Imagine if your CPU usage was at 90% but all of your incoming requests were being returned within the time you deemed excellent. In this case you might want to scale since you’re about to reach 100% usage, but imagine you were only using 10% of your CPU and were responding to users in a time that you deemed terrible. Simply relying on CPU usage for when to scale would be the wrong approach leaving your users unsatisfied.

So what’s the solution? I propose that we calculate the percentage of traffic and scale when we’ve reached a percentage that is unacceptable. By traffic I don’t mean number of requests but speed of the request response life cycle.

On a recent trip to Big Sur I was wondering how Google calculates how bad traffic on a road is. Here’s the solution I came up with

In an ideal world, every car would have its own lane and be able to drive the speed limit or exceed it. Similarly in an ideal world, every web connection would get its own server, hampered by none other than the algorithms and hardware. Since resources are limited, we’ll have to determine when to scale. Since usage only measures throughput and not latency, we need to scale when the percentage of traffic, which I’m defining as the time of the request response lifecycle, exceeds the levels we define as acceptable.

In order to achieve this we’ll need to create something like a speed limit. This measurement will be the speed that we expect a request response lifecycle to start and end. So let’s say we expect the user to log in to our site and view their newsfeed in under a second. If we start to see traffic reaching speeds slower than a second we could scale immediately instead of waiting for the CPU to get to some arbitrary number like 70%.

In order to implement this we would need to benchmark our user stories and then monitor them in real time. As I think about this some more, I’m starting to realize why micro-services might be a good approach to complicated web apps. If one small part of my application is bottlenecking then I’ll only want to scale that piece while the rest of the system can wait to scale.

In conclusion traffic is latency not throughput, therefore scaling should occur when latency levels are unacceptable and those should be calculated by 1 minus the average request response lifecycle divided by a service level agreement.