System Design Task: API Rate Limiter

Alexey Martishin
Verbetcetera Tech
Published in
4 min readMay 23, 2020

Task

Design a rate limiter

What is a rate limiter?
Rate limiter is a tool which monitors and limits the number of requests per a window time.

It blocks requests once request count in a specific time window exceeds the number agreed by the service owner and the sender.

Requirements for the system

  • Limit the number of requests the sender can send to an API within a time window
  • Be highly available
  • The rate limit should be enforced across different servers, in case of a distributed system
  • Introduce low latency, so that the the requests would not take significantly longer time to be processed

Algorithms for rate limiting

Leaky Bucket

Creates a queue (FIFO) with requests, allowing to process them at average rate. Requests are added to the end of the queue and processed from the beginning of it. If the queue is full, new requests will be discarded.

Example:
We have rate limiter with a limit of two requests per minute
["request_1"]
We want to process request_2 and add it to the queue
["request_1", "request_2"]
If after this we want to add a new request request_3 to the queue, it will be discarded

Pros:

  • easy to implement
  • memory efficient

Cons:

  • no guarantee that the request will be processed in certain amount of time
    old requests can fill the queue and recent requests will not be able to be processed

Fixed Window

A window (bucket) with fixed size is used to track the rate of requests. Each incoming request increments the counter for the window. If the counter exceeds the window’s threshold, new requests will be discarded. When a new window is created, the old one is deleted.

Example:
We have a rate limiter, which allows 10 requests per hour

{
"1AM-2AM": 9
}

If we want to process a new request at 1:45AM we add it to the bucket

{
"1AM-2AM": 10
}

If a new request arrives at 1:50AM we wouldn’t process it, because the bucket for this hour is full.

Pros:

  • more recent requests get processed
  • memory efficient

Cons:

  • it doesn’t allow fair distribution of the requests, for example a burst of traffic can happen at the end of the window and it will not affect the next one
  • consumers may stampede API, if they will make all requests at the same time after window reset.

Sliding window log

A queue of timestamps for each request is used to track the rate of requests.
When a new request comes in, two things happen:

  1. all timestamps older than the window time are deleted
  2. the number of elements in the queue is calculated

If the number exceeds the threshold, a new request will be discarded.

Example:
We have a rate limiter with a limit of two requests per minute

[
(1590084187, "request_1"),
(1590084213, "request_2")
]

For example, the current timestamp is 1590084250 and we want to process request_3. We remove request_1 and add request_3 to the queue

[
(1590084213, "request_2"),
(1590084250, "request_3")
]

If we want to process request_4 at 1590084255 we wouldn’t be able to do it, because less than a minute has passed since request_2.

Pros:

  • the rate limit is enforced precisely
  • log is tracked for each consumer, so no stampede effect like after window reset in fixed window algorithm

Cons:

  • expensive to store logs for every request
  • expensive to compute sum over the consumer’s prior requests for each new request

Sliding window counter

Combination of ideas from Fixed Window and Sliding window log.
The entire window is broken into smaller buckets. Each bucket stores the request count corresponding to the bucket range.
When a new request comes in, we compute a weighted value of the previous buckets’ requests count and process it if the threshold wasn’t reached yet.

Example:
We have a rate limiter of 10 requests per hour, with bucket size of 20 mins, so there are 3 buckets.

{
"2AM-2:20AM": 2,
"2:20AM-2:40AM": 3,
"2:40AM-3:00AM": 4
}

If a request is received at 2:50AM, we find out the total requests in last 3 buckets including the current and add them, in this case they sum up to 9(<10), so a new request is added to the bucket of 2:40AM–3:00AM

{
"2AM-2:20AM": 2,
"2:20AM-2:40AM": 3,
"2:40AM-3:00AM": 5
}

For the rest of the hour, we will not be able to process new requests.

Pros:

  • memory efficient
  • more recent requests get processed

Cons:

  • can’t enforce hard throttling, because there are some buckets that we will not consider. In the above example when we process a request at 2:50AM we would need to take requests from 1:50AM to 2AM into account, but we don’t do it.

The sliding window approach is considered the best because it gives the flexibility to scale rate limiting with good performance.

Rate Limiting in Distributed Systems

We want to be able to enforce limits on a cluster serving client’s traffic.

One solution is to use consistent hashing on balancers so that each consumer requests are sent to exactly one node.
Pros:

  • easy to implement

Cons:

  • Scaling problems when nodes get overloaded.

The better solution is to use centralised data store to store the counts for each window and consumer
Pros:

  • flexible balancing rules

Cons:

  • increased latency, because of making requests to the data store
  • race conditions

Race condition — a situation when two or more nodes access and increment counters from the store at the same time, leading to incorrect (lower) counter values written into it.
We can deal with this problem by using “lock” when accessing the counter, preventing other nodes from accessing it.

Conclusion

So, the right approach is to use the sliding window counter algorithm with a centralised data store.
But there is one question remaining: is there anything we can do to optimize the performance of our data store? Because with the current solution every node will make a call to it during each request, which can introduce latency to our system.

Can you solve this problem by yourself?

--

--