Distributed rate limiting: Why it’s needed?

Published in

Internshala Tech

7 min readAug 18, 2022

There was a person who used to run a restaurant. They had a small setup and were able to serve 20–30 regular customers at a time. They were happy with the process. Until one day there was a campaign in the neighborhood and hundreds of people gathered at the restaurant. The staff tried hard but were unable to serve all of them and eventually many of them returned empty-handed. And in the process, they were unable to serve their regular customer. They didn’t have control over the traffic and hence their service crashed

This is exactly what happens with the websites if they don’t have rate limiting. Rate limiting is the process of controlling requests coming to the server.

Let’s consider another example.

There is a website that runs on 3 servers. Each server is capable of handling thousands of requests. If a burst of requests comes to a specific server then it will meet the same fate as the restaurant. And if the server goes down, automatically the traffic will be diverted to the other two servers. These servers will see a jump of 50% in resource consumption. If any of these two servers are not able to handle this jump in traffic and shuts down then all the traffic will be diverted to the only remaining server and eventually it will also shut down. This is called cascading failure where one failure leads to another. And this is why rate limiting is important.

Rate limiting is a defensive measure for API or services. It provides a mechanism to limit the number of requests to API or service in a given time period. This protects an API or service from malicious attack, overuse, occasional request spikes, DoS attack, and other types of abusive behavior targeting the application layer. Rate limiting can be implemented per user or IP address or region or as a whole.

Without rate-limiting, each user may request as many times as they like, which can lead to “spikes” of requests that can starve other consumers or can even crash the server. After rate limiting is enabled, the users are limited to making a fixed number of requests per unit time.

How to mitigate this?

The issue with the conventional approach of implementing rate limiting using NGINX or web servers is that they are limited to servers. But servers don’t interact with each other thus they are unaware of which requests have been blocked by which server. There could be a scenario where a request is fulfilled by a server but was blocked by the other.

Consider an example where three servers are running with an allowed rate of 3 requests per second and an attacker fires 9 requests. If the request gets distributed across the 3 servers then all the 9 requests will be fulfilled. And if all the 9 requests go to only one server then 3 will be served and 6 will be declined.

To overcome this drawback, these servers should interact with each other. One can use locks and use gossip protocol to propagate the change to other servers so that when the next request comes every server has updated data, this will increase the inter-service communication and thereby increase latency. Thus the moment 3 requests are made irrespective of server, all the remaining 6 will be denied.

Distributed Rate Limiting

Distributed rate limiting is an approach where all the rate limit counters are stored at a distributed location that is accessible by all the servers or services. Since the counter is on the distributed service, the need to have inter-server communication or stickiness diminishes.

The three key components of a rate limiter

Configuration store which keeps all the rate limiter configurations
Request store which keeps API request data against keys, can handle heavy read and write, and uses in-memory datastore like Memcached/Redis
A decision engine is an algorithm that uses data from the configuration store and the request store and makes the decision

At Internshala, the request store is managed by AWS Elasticache.

Amazon ElastiCache is a fully managed in-memory data store and cache service by Amazon Web Services. The service improves the performance of web applications by retrieving information from managed in-memory caches, instead of relying entirely on slower disk-based databases.

Advantages of implementing distributed rate limiting using Elasticache (Request store) and codebase?

Impact on the load balancer
Elasticache provides a distributed memory management system and using it makes the rate limiting algorithm server independent. For instance, if the current rate of a user at a time ‘t’ is ‘n’ requests on the server, say, ‘A’, and the next request goes on server ‘B’ then instead of initializing the request count to 1, the current rate gets incremented by 1 if a request comes in the same time window i.e (n+1).
Advantages over Nginx DDoS protection
Nginx provides mechanisms like limiting the rate of requests, the number of connections, etc. but these preventive measures are limited to a single client on a single instance only. There will be a need to implement the logic for inter-server communication.
Advantages of implementation using codebase over Nginx
Using codebase one can customize the way of tracking clients. One can consider parameters like user id, location, session id, IP address, or even a combination of any of these.
Consider a scenario where there is a burst of requests coming from a single college. Since each unique user in a college shares the same network and all requests come from a single IP address, Nginx may block all the requests exceeding the limit. But by following the programmatic approach, one can distinguish each user by using a combination of IP address, session id, or cookies.

Decision engine: Sliding Window Algorithm

There are many algorithms to implement this. There is a token bucket, fixed window, stop and wait, and sliding window. We implemented using the Sliding window algorithm.

If the number of requests served on <key> in the last <time_window> is more than <number_of_requests> configured for it then discard, else the request goes through while the counter is updated.

The sliding window is based on a fixed window algorithm. Here instead of completing the counter after every window, one has to use the information from the previous counter to estimate the size of the current request rate for the current window. Instead of a fixed window size, it has a rolling window of time to smooth bursts.

The windows are typically defined by the floor of the current timestamp, so 12:03:15 with a 60-second window length would be in the 12:03:00 window.

Let’s say we want to limit 100 requests per hour on an API and assume there are 84 requests in the time window [12:00–1:00) and 36 requests current window [1:00 to 2:00) which started 15 minutes ago.

Now imagine a new request arrives at 1:15. To decide, whether the request will be accepted or denied will be based on the approximation. The approximation rate will be calculated like this:

limit = 100 requests/hour
rate  = 84 * ((60-15)/60) + 36
      = 84 * 0.75 + 36
      = 99
rate < 100
Hence, it will be accepted

Since the requests in the current window [12:15–1:15) are 99 which is less than the limit of 100 requests/hour, hence the request will be accepted. But any new request during the next second will not be accepted.

Pros of the sliding window:

It smoothens the traffic spikes problem compared to the fixed window method
It results in an approximate value, but the value is very closer to an accurate value (an analysis on 400 million requests from 270,000 distinct sources shows only 0.003% of requests have been wrongly allowed)
It has very little memory usage: only 2 numbers per counter.

Flow diagram & Pseudocode

Conclusion

There isn’t any perfect algorithm for rate limiting using distributed systems. The algorithm will evolve gradually and one just needs to remain a step ahead of the attacker

Happy coding :)