Designing a Distributed Rate Limiter — Deep Dive

Hiresh Trivedi
wineofbits
Published in
6 min readAug 4, 2021

Pre-requisite

This article follows Designing a distributed rate limiter — Introduction. The beginning might seem abrupt if the first article is skipped.

We now have the motivation to implement a rate limiter, discussed many questions that might arise before implementing rate limiters, and had a look at real-world examples where rate limiters saved the day and examples of existing services using rate limiters. We can now have a deep dive into implementation details.

Where to place rate limiter in your architecture?

Image courtesy : Microsoft cloud architecture guide

We can place it in API Gateway, as an external service, or inside your application server. Mostly a matured architecture would be using an API gateway for authentication, IP whitelisting, SSL termination, etc. If we have sufficient control on API gateway we can have our rate limiter inside API Gateway and have a fallback mechanism because they can fail too (Slack outage because of AWS gateway failure).

If we do not want to commit forever to API gateway and decouple rate limiter service we can place it as an external service after API gateway.

It is important to address the urge to place rate limiter in application container using sidecar pattern, arguing that if a threshold is violated we do not proceed with sending requests to the database, cache, or other services, thereby effectively getting rid of invalid requests in less time and not overworking application server, but this defies the core principle of keeping application servers alienated from such requests, to begin with, and thereby allowing more valid user requests to reach our application servers.

Based on tech stack, control on API gateway, and infrastructure we can decide placing rate limiter at the appropriate level, however, it will be tough to find substantial cons for having a separate service of rate limiter after API gateway.

Algorithms to implement rate limiting logic

We have various options to select the underlying algorithm, some of them are

  • Token bucket algorithm
  • Leaky bucket algorithm
  • Sliding window counter
  • Fixed window counter

Token bucket algorithm is arguably the most widely used algorithm, we will discuss about Token Buckets here, exclusion of other algorithms here is intentional to keep article brief.

Token Bucket Algorithm

Identifiers can be user, IP, API path, or any other entity we wish to track request count for.

we specify 2 parameters :

  1. Bucket size (maximum number of requests allowed for an identifier)
  2. Refill rate (refill the bucket with bucket size at every X seconds)

At each incoming request we check the remaining number of tokens of the identifier, if the remaining number of tokens is above the permitted threshold in the rule, we allow the request else we drop it.

Finally we schedule a timer that refills the bucket with specified tokens after a constant time interval.

Where are the rules to limit requests stored?

We will need a specification for threshold of requests allowed for each identifier that gives information like

  1. How many times a user is allowed to access a premium content for free?
  2. How many tweets per second a user can tweet?

There are existing rate limiters which store such configuration in files, we can also have a in-memory cache storing these rules given that these rules wont change frequently.

How are the availed request counters for each identifiers stored?

We will need to store available tokens for each identifier and update the number of tokens on each identifier request so that when it hits its limit we can identify the same. This seems to be an important decision.

Using a database might not be the best decision here because of the disk reads, storing the counters in a persistent cache like Redis with inbuilt features like INCR and EXPIRE can substantially reduce latency.

How does a client know it is being throttled?

Clients can also handle rates at which they send requests if informed about limits. Certain HTTP response headers can be used to achieve this :

  1. X-Ratelimit-Remaining: remaining number of allowed requests in the window
  2. X-Ratelimit-Limit: number of calls client can make in time window
  3. X-Ratelimit-Retry-After: number of seconds to wait

and others that can help client react accordingly.

We can now visualize our design as pictured below:

Rate limiter in distributed environment

Now that we have the design represented in a way that we can scale it further, we need to anticipate any problems that we could face in distributed environment.

Synchronization problem or consistency issues

Consistency of data might not be maintained if second request of the user comes before the first update is propagated to all the nodes.

To demonstrate the problem, consider the following events in order

0. Rule for user — maximum 4 requests allowed per second.

  1. ‘User 1’ sends request to Rate limiter server “A”
  2. Server “A” updates the counter in its Redis cache node, number of tokens is now 3 (was previously 4).
  3. The same ‘User 1’ again sends request but to server “B” whose Redis cache instance does not have the updated count for this user (user 1), makes the counter 3 (should be 2).

One solution is to use sticky sessions, the user sends the requests to the same server, but this is not fault-tolerant, and adding or removing servers will cause issues, it is mostly recommended to go for stateless web-tier.

We can use locks and use gossip protocol to propagate the change to other servers so that when the next request comes every node has updated data, this will increase the inter-service communication and thereby increase latency.

We can instead have a centralized Redis store and have all the rate limiter servers reach out to the same centralized store, but this does not scale well. Our rate limiter nodes can overwhelm the centralized Redis store.

As we see, with these approaches we either compensate consistency for low-latency or its inverse.

We can shard the Redis store based on the identifier and route the requests to the Redis store that has the details for this identifier. This will solve the issue of consistency since requests of a particular identifier will reach only the concerned Redis store and low latency is achieved because there are multiple rate limiter servers. What if a shard fails? We might lose the data for a set of the identifiers, hence we maintain replicas, this is will give us a good balance of low-latency, fault tolerance, and consistency (eventual consistency).

For possible race conditions please refer to an interesting solution — Better rate limiting with Redis sorted sets

We should now have a decent idea on features, implementation details and high level design of rate limiters and we are now equipped with the basic concepts that can help us find answers if any further curiosity on the topic arises.

I wish medium had a feature where all the embedded links could be listed at one place in the article, there are some interesting links added in both of my articles, consider this as a reminder to visit them if they were overlooked earlier.

--

--