FAIR: Allocating Resources Fairly at Scale

Mihir Sathe
10 min readOct 5, 2024

--

I recently published a Go library called FAIR, designed to maintain fair resource distribution using a randomized algorithm called stochastic fair BLUE with constant memory requirements. My vision for this project is to make it a “fit-and-forget” solution for most applications, requiring minimal tuning or operational overhead. The project has already gained notable attention, including a feature on the top of the HackerNews, which was incredibly exciting.

In this post, I’ll take a deep dive into the algorithm behind FAIR and discuss both the optimizations I’ve made, as well as those in the pipeline, to ensure it works effectively in real-world scenarios. If you’re curious about the project, feel free to check out the GitHub repo and give it a star to stay updated!

What is Fairness?

Fairness in a multi-tenant service is a complex topic. AWS has an excellent paper on Amazon’s view of fairness, which, given my long tenure there, also informs my perspectives. In the context of FAIR, we’re primarily concerned with non-quota-based admission control scenarios.

Simply put, the job of a service is to “serve” resources. These resources can be high-level business objects like sports event tickets, map tiles, or slots on a worker node to run compute jobs. They also include implicit resources like the CPU and memory of the service node itself needed to handle a request. Both types of resources are often limited (the latter always are). When a service starts running out of resources, it must either keep requests waiting or reject them and ask clients to try again later. It’s almost always better to fail fast (i.e., throttle) rather than wait for a resource to become available.

Now, let’s consider multi-tenancy, where a single service handles requests from a large number of uncorrelated clients. When you have to start throttling requests because supply can’t keep up with demand, unless you explicitly address fairness, you might end up allocating virtually all available resources to a minority of clients with very high request rates (often called “heavy hitters”), while starving the rest of the clients who are making requests at a reasonable rate. This discourages good behavior and forces your clients to aggressively retry, risking a full outage.

Figure 1 shows a simulated example of such an event: the service can only handle 600 requests per minute, and client-0 is trying to consume all of it, which works fine when it’s the only client. As soon as more clients making more modest requests arrive, they go largely empty handed, while client-0 continues to receive a disproportionately large share of the resources due to its higher request rate.

Figure 1

Now, consider how fairness throttling works when we implement it using FAIR. In this approach, we aim to minimize the number of clients that experience significant throttling by concentrating most of the throttling on the heavy hitters. As a result, clients making requests at lower rates see very few throttles and can largely acquire the resources they need. Meanwhile, heavy hitters can still access resources, but at reduced rates.

Once the resource constraints are resolved, all clients — including the heavy hitters — can return to consuming resources at their desired rates. This method ensures a fair distribution of resources among clients with varying request rates. Seems fair, right?

Figure 2

Quota-free Fairness

A common method to maintain fairness is the token bucket algorithm. Each client has a bucket that regenerates tokens at a rate of X per second. Every incoming request consumes a token; if none are available, the request is either delayed or throttled (usually throttled). The bucket can accumulate unused tokens up to a fixed capacity called the burst. With a burst of zero, the token bucket enforces a strict request rate.

This approach works well when clients require a predictable resource rate — such as external customers or paid services. However, for internal services, restricting all clients to a fixed rate can lead to underutilized resources. You might consider increasing everyone’s quota, but this risks resource exhaustion if more clients join. Allowing large bursts lets clients temporarily exceed their quotas, but if many do so simultaneously, you again risk depleting resources.

The alternative approach, which FAIR adopts, is to let clients consume as many resources as they need while resources are plentiful. Once we start running low, we use that signal to throttle incoming requests across clients based on their request rates, thus maintaining fairness. The key challenge is accurately accounting for these throttled requests and attributing them properly. This ensures we minimize the impact on clients making requests at reasonable rates while still serving heavy hitters at a reduced rate. In essence, paraphrasing the AWS paper, we aim to maintain the illusion of single tenancy for most clients.

The Probabilistic Data Structure

One of the standout features of FAIR is its constant memory usage, regardless of the number of clients. This efficiency is achieved by utilizing a probabilistic data structure inspired by the Stochastic Fair BLUE algorithm. I’ve previously written about probabilistic data structures, and the one we’ll explore here is closely related to the counting Bloom filter.

Figure 3

Figure 3 illustrates the design of the FAIR data structure. We employ a multi-level structure with l levels, where each level contains a fixed number of buckets. Each bucket holds two pieces of information: the throttling probability (a value between 0 and 1) and the last update time.

When a request arrives, its client ID (you can define what constitutes a client ID based on your needs) is hashed into a bucket at each of the l levels. To make this process efficient, we use the technique from Kirsch et al., which allows us to compute all l hashes using a single Murmur3 hash function.

We then examine the throttling probabilities from all levels and take the minimum value as the final throttling probability for the request. The library also provides an option to use the mean instead of the minimum, which can be beneficial in certain scenarios.

  • If the request succeeds, we decrease the throttling probability in all associated buckets by a fixed decrement, Pd.
  • If the request fails due to resource exhaustion (which you define based on your criteria), we increase the throttling probability in all associated buckets by a fixed increment, Pi.

Our goal is to quickly mitigate the impact of misbehaving clients while cautiously restoring their access. Therefore, Pi is typically larger than Pd. The default setting makes Pi 100 times larger than Pd. The default values are configured so that Pi increases the throttling probability from 0 to 1 after approximately 25 consecutive failures. All these parameters are adjustable to suit your specific requirements.

Avoiding False Positives

A key challenge when using probabilistic data structures like counting Bloom filters is the potential for false positives. In our context, a false positive occurs when an innocent client’s requests are mistakenly throttled because their client ID hashes to the same buckets as a misbehaving client. This can unfairly penalize well-behaved clients and undermine the fairness that FAIR aims to achieve.

Understanding the Probability of Collision

If we have L levels and every level has B buckets, the probability of a client colliding with the given client is (1/B)^L . So with 4 levels and 512 buckets per level, this probability is 1.5 x 10^-11 which most would consider practically 0. However, this is not the whole picture as this is only true if there is only one blocked heavy hitter client in the whole system. If there are only a small number, that’s not a big deal since the probability roughly linearly grows with number of heavy hitters (because the chances of cross-overlaps can be considered almost 0). If that number is high, we have to actually consider this exponential of M — estimated number of heavy hitters. With that, our probability is the following:

So if you have 100 heavy hitters with 512 buckets and 4 levels, the probability of a given client overlapping with a heavy hitter is 0.001. FAIR provides a function called GenerateTunedStructureConfig to take M, B and p to provide the appropriate number of L to use for the structure. This way, you can fully control the probability for yourself. The default B is 1000 and minimum L is 3 which can be more based on the above calculation. The default structure takes about 384kb space and should be enough for most cases — even at a fairly large scale.

Hash Rotation

In probabilistic data structures like the one used in FAIR, the worst-case scenario is when an innocent client is consistently hashed alongside a misbehaving heavy hitter. This persistent collision could result in the innocent client being unfairly throttled indefinitely, severely impacting their experience. While the probability of this happening is extremely low, it’s crucial to address this edge case to ensure fairness for all clients.

To mitigate this issue, FAIR implements a strategy called hash rotation. Here’s how it works:

  1. Dual Data Structures: The system maintains two instances of the data structure at all times — a “live” structure and a “shadow” structure.
  2. Parallel Updates: Both structures receive and process the same incoming traffic, updating their internal states accordingly. However, only the live structure is used to make throttling decisions.
  3. Periodic Rotation: Every few minutes (five minutes by default, but this is configurable), the system discards the current live structure and promotes the shadow structure to become the new live structure.
  4. New Hash Seeds: Each new structure uses a fresh random seed for its hash functions, ensuring that client IDs are hashed differently in each rotation cycle.

By rotating the hash functions and effectively re-mapping client IDs to new buckets, we break any persistent coupling between clients. This means that any innocent client affected by an unfortunate collision will, at most, experience throttling for the duration of a single rotation interval.

One might wonder why we maintain a shadow structure in parallel instead of starting fresh after each rotation. The reason is to avoid giving misbehaving clients a “free pass” during the learning phase of a new structure. If we began with an empty structure, heavy hitters would temporarily evade throttling until the system re-learns their behavior.

By updating the shadow structure in parallel, it accumulates accurate throttling probabilities based on real-time client behavior. When it becomes the new live structure, it already has the necessary state to make informed throttling decisions without any downtime.

Maintaining two structures simultaneously does require double the memory usage. However, considering the relatively small memory footprint of the data structures involved (about 384 KB by default), this overhead is minimal compared to the significant benefits in fairness and user experience.

Redemption for the Heavy Hitters

So far, we’ve focused on how FAIR effectively throttles workloads during periods of resource contention, ensuring a fair distribution of resources among all clients. However, when the resource crunch subsides — whether through auto-scaling, reduced demand, or other factors — it’s important to allow throttled heavy hitters to return to their desired request levels.

FAIR includes two recovery mechanisms that gradually decreases the throttling probabilities for clients as resources become available. Here’s how it works:

  • Decrease on Success: When a previously throttled request now succeeds (because resources are no longer constrained), the algorithm decreases the throttling probability in all associated buckets by a fixed decrement, Pd. If a heavy hitter is fully throttled with probability 1, some of its buckets may still see a decrease due to collision with other workloads. Since we use minimum of all probabilities, that gives the workload a chance to come back.
  • Gradual Decay: In a situation where the heavy hitter is stuck on probability 1 and unfortunately none of the buckets collide with any other workload, it may get stuck forever. To prevent this, we implement an exponential decay function with time that’ll slowly decrease the probabilities of every bucket no matter what. A non-zero probability means some of the requests will be tried and if they succeed, the probability will start going down in Pd steps. This is in fact the reason why we store the lastUpdateTime in every bucket.

Conclusion

In this post, we’ve delved into how FAIR implements fairness throttling in multi-tenant environments using probabilistic data structures. By allowing clients to consume resources freely when available and intelligently throttling heavy hitters during resource contention, FAIR ensures equitable resource distribution without the need for per-client quotas or excessive memory usage.

We explored the design of FAIR’s multi-level counting Bloom filters, the mechanisms to avoid false positives through hash rotation, and the strategies for heavy hitters to regain normal request rates when resources become plentiful again. These features collectively maintain the illusion of single tenancy for most clients, providing a seamless and fair experience.

FAIR aims to be a “fit-and-forget” solution that requires minimal tuning and operational overhead. Its constant memory usage, adaptability, and focus on fairness make it a valuable tool for managing resources in distributed systems.

I encourage you to check out the FAIR GitHub repository to see how it can benefit your projects. Feel free to contribute, provide feedback, or star the project to stay updated on future developments. Your support and input are greatly appreciated!

--

--

Mihir Sathe

I love software, coding and building large systems. I am a software engineer at Snowflake. All opinions are my own. My Twitter: https://twitter.com/_mihir_sathe