Implementing a sliding log rate limiter with redis & go

Circles.Life
Circles.Life
Published in
6 min readNov 3, 2020

Written by Akshat Agarwal

Introduction

The circles-payments-service is an API-only application that communicates with multiple payment providers. Each provider has its own rate limit for us. We did not want to exhaust our rate limit with any of the payment providers, while also making the most of what limits we are allowed. We could afford to delay requests to the payment provider for a small amount of time, as bulk payments are processed offline as async jobs.

During Circles.Life billing day, we run a large number of payments in a short span of time. Are we breaching our payment processors’ rate limits? Should we do something about that? I’ve heard about something called a rate limiter.

What is a rate limiter? A rate limiter caps the number of requests a sender can issue in a specific time window. It then blocks requests once the limit is reached.

Usually, the rate limiter component is at the receiving system, but since we depend so heavily on third party processors, it also makes sense to accommodate the provider’s rate limit and control outbound requests so as to not get too many — or optimistically, even a single — 429.

We started designing the rate limiter with these requirements in mind:

  1. It should limit the number of requests in a given time period to a particular payment provider
  2. Since our systems run on a cluster (as opposed to a single server), the rate limit should apply on the sum total of requests made, and not per application process.
  3. Our late limiting logic should be atomic, and not fail even if multiple requests land on our systems concurrently

System architecture

Before we dive deep into the rate limiter design, here is a bird’s eye view of our async payment processing system:

Any message that lands on the payment consumers’ queues gets processed by the payment consumers. A payment consumer fetches some data from the persistent database in order to make a request to the corresponding Payment Gateway (hereafter referred to as as PG)

There are several standard algorithms used in the industry for rate limiting, like the leaky bucket, fixed window and sliding log. Without getting into the details of the decision making, I chose the sliding log because it suits our needs and is simple to implement.

The algorithm

Let’s say for a PG pg1, we’re allowed to do 100 API calls each second. Every time we do an API call we add the timestamp (in Nanoseconds) of the API call to a list.

timestamps << Time.now

When we’re about to do an API call in our consumer we need to check if we’re allowed to do one:

calls_already_made = timestamps.count {|t| t > now — 1.minute }

remaining API calls allowed

allowed = max_calls_per_minute — calls_already_made

Where max_calls_per_minute comes from payment gateway configuration.

if allowed > 0// do magicelse// Oops! Rate limit breached, please try later!end

Storing the sliding log

We need some kind of database to store a list of timestamps grouped per payment gateway. Let’s choose from something we already have. Could we use MySQL or MongoDB? Yes, we could but then we would need a system that periodically removes outdated timestamps since neither MySQL nor MongoDB allows us to set a TTL on a row. What about Redis? Redis is a key/value store that supports lists, sets, sorted sets and more. We can use a sorted set to hold our timestamps!

Since one of our key requirements is that our window is populated atomically, we could use either redis transactions or write lua script and eval it. A transaction would not be a wise choice because we would need the output of 1 command (ZCARD) to determine whether or not to run a subsequent command (ZADD).

Deeper into redis

Here is what a transaction would look like

// start the transactionMULTI// remove all values to the left of the start of the windowZREMRANGEBYSCORE <pgName> 0 <windowStartTimestamp>// Count the values in the windowZCARD <pgName>// Add current timestampZADD <pgName> <currentTimestamp> <currentTimestamp>
// expire the whole set after the window size
EXPIRE <pgName> <windowSize>// execute the transactionEXEC

Note how we are performing the ZADD regardless of whether the limit is reached. This means that our sliding log will be flooded with failed requests and we will not be able to fulfil legitimate requests.

Instead, we want to only add the current timestamp if we are performing the request.

Hence the need for ‘eval’ing a lua script — it allows us to use the output of one command before executing the subsequent command, including allowing us to write an if condition.

// Set variables from argumentslocal pgname = KEYS[1]local now = tonumber(ARGV[1])local window = tonumber(ARGV[2])local limit = tonumber(ARGV[3])// Remove keys older than now — windowlocal clearBefore = now — windowredis.call(‘ZREMRANGEBYSCORE’, pgname, 0, clearBefore)// Get already sent countlocal already_sent = redis.call(‘ZCARD’, pgname)if already_sent < limit then // if allowed, then add to sorted setredis.call(‘ZADD’, pgname, now, now)end// for cleanup, expire the whole set in <window> secsredis.call(‘EXPIRE’, pgname, window)// return the remaining amount of requests. If >= 0 then request is allowedreturn limit — amount

Since all our payment consumers are connected to the same redis cluster, the usage data is shared. The return value of the lua script being > 0 means that that more API calls are allowed to be made.

Here is how we can run this in redis-cli

eval “local pgname = KEYS[1]; local now = tonumber(ARGV[1]); local window = tonumber(ARGV[2]); local limit = tonumber(ARGV[3]); local clearBefore = now — window; redis.call(‘ZREMRANGEBYSCORE’, pgname, 0, clearBefore); local amount = redis.call(‘ZCARD’, pgname); if amount < limit then redis.call(‘ZADD’, pgname, now, now) end; redis.call(‘EXPIRE’, pgname, window); return limit — amount” 1 <pgname> <timestamp_secs> <window_size_secs> <rate>

This is how we embed it into our golang program:

rdb := redis.NewClient(…)now := time.Now().UnixNano()windowSize := int64(time.Second) // 1000000000 NanoSecondsrateLimit := 1000 // get from config
luaScript := `local pgname = KEYS[1]local now = tonumber(ARGV[1])local window = tonumber(ARGV[2])local limit = tonumber(ARGV[3])local clearBefore = now — windowredis.call(‘ZREMRANGEBYSCORE’, pgname, 0, clearBefore)local amount = redis.call(‘ZCARD’, pgname)if amount < limit thenredis.call(‘ZADD’, pgname, now, now)endredis.call(‘EXPIRE’, pgname, window)return limit — amount`vals, err := rdb.Eval(ctx, luaScript, 1, pgName, now, windowSize, rateLimit ).Result()

Ok, this works. But I don’t want to pollute my neat go file with inline lua script. Surely there is a way to maintain it as another .lua file and load it in from there

filename := “ratelimiter.lua”content, err := ioutil.ReadFile(filename)if err != nil {// handle errorlog.Errorf(err, “Could not read file %s”, filename)}luaScript := string(content)vals, err := rdb.Eval(ctx, luaScript, 1, pgName, now, windowSize, rateLimit ).Result()

…and we’re done!

Numbers

After we deployed the rate limiter in production, we have not faced any scale issues. It currently sees peaks of ~450 RPM, but that is limited because of the rate of incoming payment requests.

So how do we know what rate our limiter is actually capable of? We tried to do some benchmarking.

The average operation of checking rate limit was taking 55.37 microseconds, after a total of 2,236,110 runs. 55.37 translates to over 18,000 Requests per second from my local system, using a local redis instance.

I tried the same benchmarking from my local system to our staging redis setup, and the results were alarming. Each operation was taking over 10 milliseconds.

Therefore, it means that most of the time is being taken by the redis connection. We then benchmarked time to make simple redis commands — SET & GET, and that came to be 34.01 microseconds in my local setup, and again over 10 milliseconds in the local-remote setup.

Therefore it is safe to say that compared to a simple redis SET operation, the rate limiter takes about ~21 microseconds longer. This is not going to be a bottleneck for our payment system in the near future. We just need to be sure that our server to redis connection is fast enough.

Hope this article was useful, please feel free to reach out for questions or feedback!

About the Author

Akshat Agarwal is a Polyglot software engineer with a passion for scaling.

Appendix

Alternate implementations

Pure go, memory based

Redis & lua, GCRA based

Resources

https://rafaeleyng.github.io/redis-pipelining-transactions-and-lua-scripts

https://konghq.com/blog/how-to-design-a-scalable-rate-limiting-algorithm/
https://rafaeleyng.github.io/redis-pipelining-transactions-and-lua-scripts
https://redis.io/commands/zadd
https://redis.io/commands/zcard
https://redis.io/commands/zcount
https://engagor.github.io/blog/2017/05/02/sliding-window-rate-limiter-redis/
https://engagor.github.io/blog/2018/09/11/error-internal-rate-limit-reached/
https://app.diagrams.net/#G1USTEf_sVbyi0ri0NjN_xUYmXraM7bLES
https://medium.com/@saisandeepmopuri/system-design-rate-limiter-and-data-modelling-9304b0d18250
https://www.figma.com/blog/an-alternative-approach-to-rate-limiting/
https://gist.github.com/ptarjan/e38f45f2dfe601419ca3af937fff574d
https://stripe.com/blog/rate-limiters
https://github.com/go-redis/redis

https://github.com/abhirockzz/redis-geo.lua-golang/blob/master/redis-geo-lua-example.go

--

--

Circles.Life
Circles.Life

Circles.Life is on a global mission to give power back to the customer through highly personalized digital services.