Balancing Load is Hard
Imagine your business provides a lot of valuable data for free based on the assumption that you will ultimately make a transaction against that data. Amazon’s website is a (quite lucrative) example but there are many others.
Without care, it’s possible to ruin the company finances by serving huge quantities of unprofitable traffic. In the case of one of my gigs, we elected to implement a throttling system to discourage excessive consumption for no return. And so our troubles began…
Not long after the throttling was rolled out we started to see premature assertion of limits with attendant customer upset. Our monitoring which was known accurate showed request rates significantly below the level necessary for throttling to be triggered. Obviously, given it was new software, the first place the engineering team looked was the code. A lot of reviewing and testing started but it was quickly clear there were going to be no immediate answers. Not good…
There was something I’d seen in the data that had been bothering me:
It indicated a single machine was throttling more requests than made sense for the load.
Let’s say you have 200 machines and a load of 200k requests per minute. That’s 1k of requests per minute per machine, yet we’d see throttling on a single host at rates well above that. If our load was being spread as intended, that shouldn’t be happening.
Load balancers offer a number of different policies for distributing requests. Ours was doing it based on the number of connections per machine. The assumption here is that connections are a reasonable indicator of how busy a machine is. More connections, more load. There’s a problem though, what if your requests do not exert even load? What if you have some that are dispatched in 100ms and others taking 1000ms?
It’s tempting to think that given randomness in the load it’ll even out nicely. In fact it might but, only if there’s an even distribution of requests across the range of response times, like this:
Many loads unfortunately don’t behave like this though, they are long tailed:
In the simplest terms, imbalance is created because only a few of the machines in your cluster see the long-lived requests over a timeframe of minutes. These machines will have longer running connections thus the load balancer perceives them to be busier and requests are sent elsewhere.
Was this the culprit? The detailed monitoring required to determine this directly from production was missing. Damn, have I hit a dead-end?
Validating The Thinking
Setting up a test environment with all the attendant load generation and such was going to be difficult. There were too many moving parts to start with but also it would need to be a scale-unit of production processing power. Yikes.
I elected instead to build a simulation of production on my laptop. It was pretty bare bones:
- Load balancer and appropriate policy
- A number of front-end nodes
- An appropriate mixture of request durations
- An implementation of the throttling policy
I plugged in the numbers and kicked-off the program and saw a completely different picture from what was going on in production. Uh oh. I did a cursory code review and found a bug that explained why. One fix later and now I did see an accurate reproduction of our real-world issue. Hooray!
With repeated runs using various configurations I was able to find a throttling policy that would provide a modicum of protection and avoid undesirable customer impact whilst we engineered a proper solution.
The original simulation is the property of my then employer so I cannot share it (a hurried hack, you wouldn’t want to see it anyway). However, I wrote an example from scratch which you can see here. There are some improvements over the first edition including support for parallel simulations which speeds things up substantially.
One More Twist
There was one more interesting aspect to this problem. When the throttling system activated, we’d see request load on other healthy machines go down. Why would throttling on one machine be interacting with the request loads of its peers?
Our throttling terminated appropriate requests instantly, returning a suitable HTTP error code. A side-effect of this was a machine that had started to throttle suddenly looked ultra-fast to the load balancer. This created a nasty amplification where significant numbers of requests were vectored from other machines toward the apparent colossus. These would also be throttled and we’d suddenly have a chunk of machines doing nothing.
We fixed this by arranging for a small pause before returning the HTTP error code. It was sized to match the typical request processing time observed in our systems. The load balancer would no longer see a machine appear fast once its throttling activated.
The Simplest Thing That Could Possibly Work
I did the original design-work for the throttling system and constructed a prototype implementation to prove its viability. Importantly, it featured a globally-shared, replicated and sharded back-end for maintaining request counts and thresholds which would have avoided the problem we encountered.
In contrast, the engineering team chose to start with a simpler, quicker to deliver solution based on machine-local counting and limiting, removing the need for the back-end. That might seem like a bad decision given our painful experience above but it did have the merit of protection for the servers and the company’s bottom-line in a shorter timeframe.
So, if the engineering team had developed a solution based on my design would things have gone better? Difficult to tell because starting with a more complex system has its own potential for negative customer impact.
As I was writing this, it occurred to me the model above could be made capable of predicting a minimum safe limit for the machine-local throttles automatically. Consequently you’ll find it implemented in my re-hash
If you managed to read this far, you might wish to continue on to LinkedIn where there’s some additional discussion.