Improving load balancing with a new consistent-hashing algorithm

We run Vimeo’s dynamic video packager, Skyfire, in the cloud, serving almost a billion DASH and HLS requests per day. That’s a lot! We’re very happy with the way that it performs, but scaling it up to today’s traffic and beyond has been an interesting challenge. Today I’d like to talk about a new algorithmic development, bounded-load consistent hashing, and how it eliminates a bottleneck in our video delivery.

Dynamic packaging

Vimeo’s video files are stored as MP4 files, the same format used for download or “progressive” playback in the browser. DASH and HLS, however, don’t use a single file — they use short segments of video, delivered separately. When a player requests a segment, Skyfire handles the request on the fly. It fetches only the necessary part of the MP4 file, makes a few adjustments for the DASH or HLS format, and sends the result back to the user.

But how does Skyfire know which bytes it needs to fetch when a player requests, say, the 37th segment of a file? It needs to look at an index that knows the location of all of the keyframes and all of the packets in the file. And before it can look at it, it needs to generate it. That takes at least one HTTP request, and a bit of CPU time — or, for very long videos, a lot of CPU time. Since we get many requests for the same video file, it makes sense to cache the index and re-use it later.

When we first started testing Skyfire in the real world, we took a simple approach to caching: we cached the indexes in memory on the cloud server where they were generated, and used consistent hashing in HAProxy to send requests for the same video file to the same cloud server. That way, the cached data could be used again.

Understanding consistent hashing

Before moving forward, let’s dig into consistent hashing, a technique for distributing load among multiple servers. If you’re already familiar with consistent hashing, feel free to go ahead and skip to the next section.

To distribute requests among servers using consistent hashing, HAProxy takes a hash of part of the request (in our case, the part of the URL that contains the video ID), and uses that hash to choose an available backend server. With traditional “modulo hashing”, you simply consider the request hash as a very large number. If you take that number modulo the number of available servers, you get the index of the server to use. It’s simple, and it works well as long as the list of servers is stable. But when servers are added or removed, a problem arises: the majority of requests will hash to a different server than they did before. If you have nine servers and you add a tenth, only one-tenth of requests will (by luck) hash to the same server as they did before.

Then there’s consistent hashing. Consistent hashing uses a more elaborate scheme, where each server is assigned multiple hash values based on its name or ID, and each request is assigned to the server with the “nearest” hash value. The benefit of this added complexity is that when a server is added or removed, most requests will map to the same server that they did before. So if you have nine servers and add a tenth, about 1/10 of requests will have hashes that fall near the newly-added server’s hashes, and the other 9/10 will have the same nearest server that they did before. Much better! So consistent hashing lets us add and remove servers without completely disturbing the set of cached items that each server holds. That’s a very important property when those servers are running in the cloud.

Consistent hashing — less-than-ideal for load balancing

However, consistent hashing comes with its own problem: uneven distribution of requests. Because of its mathematical properties, consistent hashing only balances loads about as well as choosing a random server for each request, when the distribution of requests is equal. But if some content is much more popular than others (as usual for the internet), it can be worse than that. Consistent hashing will send all of the requests for that popular content to the same subset of servers, which will have the bad luck of receiving a lot more traffic than the others. This can result in overloaded servers, bad video playback, and unhappy users.

By November 2015, as Vimeo was getting ready to launch Skyfire to more than a hand-picked set of members, we decided that this overloading issue was too serious to be ignored, and changed our approach to caching. Instead of consistent-hashing based balancing, we used a “least connections” load-balancing policy in HAProxy, so that the load would be distributed evenly among servers. And we added a second-level cache using memcached, shared among the servers, so that an index generated by one server could be retrieved by a different one. The shared cache requiredsome additional bandwidth, but the load was balanced much more evenly between servers. This is the way we ran, happily, for the next year.

But wouldn’t it be nice to have both?

Why wasn’t there a way to say “use consistent hashing, but please don’t overload any servers”? As early as August 2015, I had tried to come up with an algorithm based on the power of two random choices that would do just that, but a bit of simulation said that it didn’t work. Too many requests were sent to non-ideal servers to be worthwhile. I was disappointed, but rather than wasting time trying to rescue it, we went ahead with the least-connections and shared cache approach above.

Fast forward to August 2016. I noticed a URL that the inestimable Damian Gryski had tweeted, of an arXiv paper titled Consistent Hashing with Bounded Loads. I read the abstract, and it seemed to be exactly what I wanted: an algorithm that combined consistent hashing with an upper limit on any one server’s load, relative to the average load of the whole pool. I read the paper, and the algorithm was remarkably simple. Indeed, the paper says

while the idea of consistent hashing with forwarding to meet capacity constraints seems pretty obvious, it appears not to have been considered before.

The bounded-load algorithm

Here is a simplified sketch of the algorithm. Some details are left out, and if you intend to implement it yourself, you should definitely go to the original paper for information.

First, define a balancing factor, c, which is greater than 1. c controls how much imbalance is allowed between the servers. For example, if c = 1.25, no server should get more than 125% of the average load. In the limit as c increases to ∞, the algorithm becomes equivalent to plain consistent hashing, without balancing; as c decreases to near 1 it becomes more like a least-connection policy and the hash becomes less important. In my experience, values between 1.25 and 2 are good for practical use.

When a request arrives, compute the average load (the number of outstanding requests, m, including the one that just arrived, divided by the number of available servers, n). Multiply the average load by c to get a “target load”, t. In the original paper, capacities are assigned to servers so that each server gets a capacity of either ⌊t⌋ or ⌈t⌉, and the total capacity is ⌈cm⌉. Therefore the maximum capacity of a server is ⌈cm/n⌉, which is greater than c times the average load by less than 1 request. To support giving servers different “weights”, as HAProxy does, the algorithm has to change slightly, but the spirit is the same — no server can exceed its fair share of the load by more than 1 request.

To dispatch a request, compute its hash and the nearest server, as usual. If that server is below its capacity, then assign the request to that server. Otherwise, go to the next server in the hash ring and check its capacity, continuing until you find a server that has capacity remaining. There has to be one, since the highest capacity is above the average load, and it’s impossible for every server’s load to be above average. This guarantees some nice things:

  1. No server is allowed to get overloaded by more than a factor of c plus 1 request.
  2. The distribution of requests is the same as consistent hashing as long as servers aren’t overloaded.
  3. If a server is overloaded, the list of fallback servers chosen will be the same for the same request hash — i.e. the same server will consistently be the “second choice” for a popular piece of content. This is good for caching.
  4. If a server is overloaded, the list of fallback servers will usually be different for different request hashes — i.e. the overloaded server’s spillover load will be distributed among the available servers, instead of all landing on a single server. This depends on each server being assigned multiple points in the consistent hash ring.

Real-world results

After testing the algorithm in the simulator and getting more positive results than my simpler algorithm, I started figuring out how to hack it into HAProxy. Adding code to HAProxy wasn’t too bad. The code is pretty clean and well-organized, and after a few days of work I had something that worked well enough that I could replay some traffic through it and see the algorithm in action. And it worked! Mathematical proofs and simulations are nice, but it’s hard to truly believe until you see real traffic hit real servers.

Armed with that success, in September I sent a proof-of-concept patch to HAProxy. The HAProxy maintainer, Willy Tarreau, was a real pleasure to work with. He recognized the value of the algorithm, and didn’t tell me how terrible my patch was. He did a thorough review and provided some very valuable feedback. It took a little while to work in those suggestions and get things up to snuff, but after a few weeks I had a polished version ready to send to the list. A few more minor tweaks and it was accepted in time for HAProxy 1.7.0-dev5, released on October 26. On November 25, HAProxy 1.7.0 was designated as a stable release, so bounded-load consistent hashing is now generally available.

But what I’m sure you want to know is, what did we actually gain from all of this?

Here’s a graph of the cache behavior before and after changing our HAProxy configuration.

The daily variation is caused by autoscaling: during the day, there’s more traffic, so we start more servers to handle it, and fewer requests could be served by local cache. At night, there’s less traffic, so we shut servers down, and the local cache performance went up somewhat. After switching to the bounded-load algorithm, a much bigger fraction of requests hit local cache, regardless of how many servers were running.

Here’s a graph of the shared cache bandwidth over the same time:

Before the change, each memcached server reached as high as 400 or 500 Mbit/s in outgoing bandwidth during peak hours (about 8Gbit/s in total). Afterwards, there’s less variation, and the servers stay comfortably below 100 Mbit/s each.

What’s not graphed is performance, in terms of response times. Why? Because they stayed exactly the same. The least-connection policy was doing a good job of keeping servers from getting overloaded, and fetching things from memcached is fast enough that it doesn’t have a measurable effect on the response times. But now that a much smaller fraction of the requests rely on the shared cache, and because that fraction doesn’t depend on the number of servers we run, we can look forward to handling a lot more traffic without saturating the memcached servers. In addition, if a memcached server ever goes down, the overall effect it has on Skyfire will be much less.

All in all, I’m very happy to see how a little bit of algorithm work turned a single point of failure into something a whole lot better.