System Design — Rate Limiter

Tahir Rauf
7 min readJan 25, 2024

--

An API rate limit refers to the maximum number of API calls that a user(userId, API token, IP address, device id) is permitted to make within a specified time interval. If this limit is exceeded, subsequent requests from that user are temporarily blocked or ‘throttled’ to prevent overuse of the API.

Why?

Rate limiting ensures fair resource allocation and maintains the API’s performance and reliability. It restricts the number of requests a user can make, preventing server overload and preserving response times for all users. Additionally, rate limiting helps protect against malicious attacks like DDos and prevents system overload by subscribed users.

Implementation strategies

The basic idea is to maintain a counter for the requests made over a specified time interval. If the count is below the set limit, the request is allowed. Otherwise, the request will be disallowed (either dropped or enqueued), requiring the user to wait. Typically, the counter is stored in an in-memory cache like Redis, rather than a database, due to its faster performance and support for time-based expiration strategies.

High level implementation strategy for the rate limiter.

Below we’ll discuss few algorithms to implement the rate limiter

Token Bucket

System has centralized bucket with maximum capacity c. New tokens are added to the bucket at preset rate r. Once the bucket is full, no new tokens can be added. Every request consume ttokens. If the bucket is empty, reject the request. Every user will have the bucket.

A bucket has a capacity of c=5 tokens and a refill rate of r=0.5 tokens per second. Each box represents one request, consuming t=1 token. A red box indicates that the bucket is out of tokens.

Redis based implementation

A “bucket” is stored in the cache for each client. A bucket is an object that consists of two attributes:

  • the number of available “tokens” (or remaining calls that can be made)
  • the timestamp of the last refill.
Each user has record (or bucket). UserId can by an API key, IP address or location.
import redis
import time

class TokenBucket:
# 1 Token every 2 seconds
def __init__(self, redis_host='localhost', redis_port=6379, rate=0.5, capacity=5):
self.redis = redis.StrictRedis(host=redis_host, port=redis_port, db=0)
self.rate = rate
self.capacity = capacity

def _refill_tokens(self, key):
# Get available tokens and last refill Timestamp.
tokens, last_refill_timestamp = self.redis.hmget(key, ['tokens', 'last_refill_timestamp'])
# If this key (user) does not exist in DB. Add a record.
tokens = float(tokens) if tokens is not None else self.capacity
last_refill_timestamp = float(last_refill_timestamp) if last_refill_timestamp is not None else time.time()

now = time.time()
elapsed = now - last_refill_timestamp
tokens_to_add = self.rate * elapsed
if tokens_to_add >= 1:
# No matter how long has been passed since last refill, you can not exceed capacity.
tokens = min(self.capacity, tokens + tokens_to_add)
self.redis.hmset(key, {'tokens': tokens, 'last_refill_timestamp': now})
return tokens

def allow_request(self, user_id):
key = f"token_bucket:{user_id}"
tokens = self._refill_tokens(key)

if tokens >= 1:
self.redis.hincrbyfloat(key, 'tokens', -1)
return True # You can call 'forward_callback' function before return
return False # You can call 'drop_callback' function here

# Example usage for 3 different users:
bucket = TokenBucket(rate=0.5, capacity=5) # Same rate and capacity for each user

user_ids = ['user1', 'user2', 'user3']
for user_id in user_ids:
allowed = bucket.allow_request(user_id)
print(f"Request for {user_id} allowed: {allowed}")

Pro:

  • Memory efficient. Less data to save.

Cons:

  • In a distributed system, requests are accessible through multiple servers. The API rate limit should apply to the combination of all the servers in the cluster. Algorithm faces race condition challenges due to concurrent requests from the same user.
  • System might have to process bursts of requests.

Leaky Bucket

The Leaky Bucket Algorithm is based on a FIFO queue. Incoming requests are admitted into our queue at any rate. If there is no slot available for new requests, they are discarded. The Leaky Bucket strategy processes requests at a constant rate, thereby enforcing a consistent output rate.

This algorithm prevents sudden spikes in traffic and enforces a smooth flow of requests, ensuring stability in the system.

Leaky Bucket — Random input rate. Constant output rate. Smooth out bursts of requests to the server.

Pros:

  • Memory efficient given the limited queue size.
  • Requests are processed at a fixed rate therefore it is suitable for use cases that a stable outflow rate is needed.

Cons

  • If some requests take longer than expected or got stuck. Then our server will face the starving for new requests.

Fix Window count

Rate limit is set within fixed time interval. A certain number of tokens are assigned to the user (identified by UserId, API token, or IP address) within a fixed time window. The user can consume these tokens throughout the duration of this window. If all tokens have been consumed, no new requests will be entertained during that window. The user must wait for the next window to have their request processed.

WindowSize=1 minute. NumberOfRequestsAllowedPerMinute=5. Once the tokenCount reaches its limit, subsequent requests will be dropped. At the start of a new time window, a new count will begin.

Implementation

A key like user123 doesn’t convey that it's related to rate limiting. The prefix 'rate_limit' clarifies that this key is for ratelimiting purpose. With <timewindow> postfix, you effectively create a new key for each time window. This means that the count is reset for each window, which is crucial for the rate limiting functionality.
import redis
import time

class FixedWindowRateLimiter:
def __init__(self, redis_host='localhost', redis_port=6379, window_size=60, limit=100):
"""
:param window_size: Duration of the window in seconds.
:param limit: Maximum number of requests allowed per window.
"""
self.redis = redis.StrictRedis(host=redis_host, port=redis_port, db=0)
self.window_size = window_size
self.limit = limit

def is_allowed(self, user_id):
"""
Determines if a request is allowed under the rate limit.
:param user_id: The identifier of the user making the request.
:return: True if the request is allowed, False otherwise.
"""
# ensures requests within the same window are counted under the same key.
# When the time rolls over to the next minute, a new key is generated, and a new count starts for that window
# [time.time() / self.window_size] get the number of windows that have elapsed since the Unix epoch.
# int() effectively grouping any time within that 60-second window into the same category.
key = f"rate_limit:{user_id}:{int(time.time() / self.window_size)}"
available_tokens = self.redis.get(key)

if available_tokens is None or int(available_tokens) < self.limit:
self.redis.incr(key)
# Set an expiration time on a specific key in the Redis database.
self.redis.expire(key, self.window_size)
return True
else:
return False

# Example usage
rate_limiter = FixedWindowRateLimiter(window_size=60, limit=10) # 10 requests per minute

user_id = "user123"
for i in range(15):
if rate_limiter.is_allowed(user_id):
print(f"Request {i+1} for {user_id}: Allowed")
else:
print(f"Request {i+1} for {user_id}: Denied")

Con: A spike in traffic at the edges of a window could allow more requests than the permitted quota to go through.

WindowSize=1 minute. NumberOfRequestsAllowedPerMinute=5. Between 7:05:30 and 7:06:30, over the period of one minutes, 10 requests went through (intented quota is 5 request per minute). Service was over used at edges.

Sliding Window Log Algorithm

To avoid the ‘exceeding quota’ issue associated with the fixed window counter, the sliding window log algorithm employs a rolling count of requests over a window, starting from now — window_size. Essentially, the window is recalculated at the time of each request. The algorithm checks the number of requests served within the specific time interval leading up to the exact time of the latest request. If this number exceeds the predefined threshold, the new request is discarded.

Each request timestamp is stored in the sortedSet.
At the time of each request, the number of timestamps within the sliding window is counted. The request will be allowed if the count is below the limit. For example, with a window size of 60 seconds and a limit of 5 requests, a request at 7:05:35 will be disallowed. At 7:06:20, the window will span from 7:05:20 to 7:06:20. Older requests outside of this window will be deleted. The count of requests will then be just 1, and the request at 7:06:20 will be allowed
import redis
import time

class SlidingWindowRateLimiter:
def __init__(self, redis_host='localhost', redis_port=6379, window_size=60, limit=100):
"""
:param window_size: Duration of the window in seconds.
:param limit: Maximum number of requests allowed per window.
"""
self.redis = redis.StrictRedis(host=redis_host, port=redis_port, db=0)
self.window_size = window_size
self.limit = limit

def is_allowed(self, user_id):
"""
Determines if a request is allowed under the rate limit.
:param user_id: The identifier of the user making the request.
:return: True if the request is allowed, False otherwise.
"""
key = f"rate_limit:{user_id}" # Key points to sorted set.
current_time = time.time()
window_start = current_time - self.window_size

# Remove timestamps outside the current window in sorted set.
# Remove elements in the sorted set stored at key that have a score between 0 and window_start.
self.redis.zremrangebyscore(key, 0, window_start)

# Get the count of requests in the current window
current_count = self.redis.zcard(key)

if current_count < self.limit:
# Add the current request timestamp to the sorted set
# every request adds a new unique member to the set.
# First zadd will automatically creates a sorted set.
self.redis.zadd(key, {current_time: current_time})
# Set the expiry of the log to the window size
self.redis.expire(key, self.window_size)
return True
else:
return False

# Example usage
rate_limiter = SlidingWindowRateLimiter(window_size=60, limit=10) # 10 requests per minute

user_id = "user123"
for i in range(15):
if rate_limiter.is_allowed(user_id):
print(f"Request {i+1} for {user_id}: Allowed")
else:
print(f"Request {i+1} for {user_id}: Denied")

Pro:

Green boxes are allowed requests. Red are disallowed requests. Sliding window log will not allow the requests after 7:06:29 as rolling window starts from 7:05:29. It fixes the problem of Fixed window count

Cons:

  • Needs to store more data.
  • Expensive to compute.

Challenges

Implementing rate limiter on single machine is easier. Distributed implementation is challenging because

  1. In highly concurrent environment, race conditions can happen.

Locks are the most obvious solution, though they slow down the system. Lua script and sortedsets in Redis can solve the issue.

2. You might need multiple rate limiter servers to support millions of users. Client may contact to different servers for their requests. Synchronization is needed.

Client1 sends first request to rateLimiter1. Second request goes to rateLimiter 2. Both rate limiter needs to be synchronized to track the number of requests.

Sticky sessions and centralized data stores like Redis (preferable) solves the Syncronization problems.

References

https://www.imperva.com/learn/application-security/rate-limiting

--

--

Tahir Rauf

Passionate about technology, I am currently working at Amazon. My career journey includes experiences at Nvidia, VMware, and various startups.