Redlock — the silver bullet

Mikk Mangus
Pipedrive R&D Blog
Published in
9 min readOct 12, 2021

There are countless technologies, languages and frameworks out there, each designed with a specific purpose in mind. So how do you go about finding the right one?

While you might be tempted to believe the answer to that question is “depending on your needs,” the truth is it really isn’t. The answer is most probably reusing one of the technologies from your existing stack.

The following is a write-up of how the Redis lock (Redlock) can be used excessively yet successfully to achieve more than seems feasible at first.

What is a Redis lock

Redis is an in-memory data store that supports different data structures and functions. It can be used in various ways: as a key-value database, cache storage, message broker, etc.

A combination of its powerful set of functions (such as SETNX) along with it being single-threaded and performance-oriented make Redis an ideal tool for implementing a distributed lock. The ability to set the time-to-live (TTL) value for a key is built-into Redis and can be used to achieve extra functionality or fail-safety for the locking mechanism.

Note that in real-life scenarios, using SETNX alone might not be enough for some edge cases. Consider using a library for a bulletproof Redlock.

Why use explicit locking

Most databases, file systems and programming languages use a built-in locking mechanism for shared resources. All you need to do is pick a few of the best practices for the technology in question.

For instance, when updating fields in a relational database, there’s usually no need to try to figure out how to lock a row for other parallel updates or worry that an update would result in a deadlock. Such issues can, in most cases, be resolved by the database engine.

Yet, cloud computing and microservices are a whole other story, as they usually incorporate one or more occasions of a piece of code running in parallel, where data is split between databases and work needs to be synchronized. In such cases, the struggle for resources is real and explicit distributed locking of resources with a tool like Redis becomes a lifesaver.

How Redis performs

Implementing distributed locking is easiest when there’s a single instance of Redis running for the synchronized locking mechanism. This brings us to how well Redis handles the load in case we simplify our distributed locking system and only have one instance for Redis to communicate with.

On today's hardware, a single Redis instance can serve around 100k to 1M requests per second.

In Pipedrive, we serve up to 50k companies from a single datacenter. As such, a single instance of Redis as an in-memory key-value store per service or per use case is usually more than enough to cover our entire locking needs.

Redlock use cases

The following is a partial list of Redlock use cases in Pipedrive application backend over a period of time.

Processing every message once

One of the more common locking use cases is when there happen to be multiple instances of a microservice consuming the same message stream without the stream being sharded between different pods. In such a case, and without locking in place, every instance would be consuming and processing the same message so the work would be done multiple times.

To tackle this, we can shard the stream into multiple sub-streams, each consumed by a separate instance of the service. However, and although this solution would split the task between multiple processes, it will not use every process to its maximum capacity. For example, one consumer might end up handling most of the load, whereas others would remain at idle.

Stream split into substreams with each having a dedicated consumer process

In some scenarios, a better approach is letting each pod read a message from the stream. Before proceeding to the job at hand, the pod tries to acquire a message-specific Redlock. If it succeeds, it processes the message. If not, it would skip to the next one, assuming another instance was able to do the job for this message.

Stream consumed by all the pods synchronizing the work using Redlock

Recurrence recalculator

For Pipedrive’s calendar, we calculate the recurrences of events by the synced-in RRULE string. For performance reasons, the recurrences are pre-calculated and materialized. The downside of those calculations is that we cannot calculate and materialize the upcoming occurrences of an event till the end of time. Calculating just two years ahead is the norm in most cases.

Materializing the event recurrences

However, and suppose we have materialized occurrences for two years ahead, time starts passing. If the user doesn’t change the event to trigger the recalculation byRRULE, the materialized occurrences will not be calculated again after the initial syncing.

To address this, we use another service/mechanism that calculates and materializes future occurrences at least once per month.

Recalculating the event recurrences

We could possibly keep only one instance of that service running, but to better balance the load and make it less prone to failures, we run multiple instances in each datacenter. And, as multiple instances only make sense if they can divide the load between them, before each recalculation, the code uses Redlock to run the recalculation for a specific user. Great success! No duplicate work!

Lock the field

From the same series of activities and calendar sync, there is a mechanism to exclude some occurrences of an event from RRULE called the EXDATE. This is filled when you “delete only this instance” of an event in your calendar.

In the database, theEXDATEwas initially designed with a single text field to keep all the EXDATE values in a comma-separated format.

It all worked fine until the day we found out that multiple EXDATE updates for a single event could be consumed in a heartbeat, making parallel updates override one another. To mitigate this issue we used… A That’s right! Redlock!

Before every update, a lock is set with a key update-activity-${activityId} limiting to one concurrent update when changing the text field in the database.

Although this solution might not be the cleanest one around, it required the least amount of implementation efforts. Thanks to Redlock, we could utilize the existing database structure. Even more importantly, have it working in production without issues for a few years now.

Sending notifications

We used Redlock for fixing another service under high load. The service is designed to detect which notifications need to be sent out to users and trigger their sending at the right time, with the respective metadata persisted in a MySQL database.

The service was built with the intention of fetching a batch of notifications every minute, trigger their sending, then repeat. However, this setup had run into a few problems:

  • When sending of the batch would take more than one minute, the service would try another batch, thereby burdening the service sending out the notifications with too much work
  • Sending out a batch would only be possible once per minute
  • There had only been one instance of the service running and no viable way to upscale it

Increasing the batch would have taken more than a minute and decreasing it would have resulted in some notifications not being sent out on time. Similar issues arose while attempting to clean the database from old, already sent notifications, resulting in a clogged service.

Rather than refactoring every piece of the chain, the service operation was modified to enable the upscaling of the service.

Instead of triggering the sending every minute, the service now requests a batch, sets a separate Redis lock for each item. The sending is only triggered for items that the lock was acquired for.

The following is part of the actual code that acquires a lock per user, allowing other instances to do the job for other users:

Using the Redis lock allowed us to upscale the service to multiple instances, resulting in more granular processing and zero hiccups.

The deletion is now processed in much smaller batches and is only triggered when the service (or to be more precise the MySQL database) is not overloaded by pending notifications’ modifications. As a standalone job, the deletion sets a Redlock to avoid multiple instances deleting the same items.

The result is an efficient cleanup of all notifications and a robust, issue-free service.

Statistics

Sometimes, a service needs to collect statistics from its database. Upscaling the service to multiple instances makes every instance collect the statistics.

There are a few potential solutions to this issue. For example, by assigning one of the instances with a specific status to execute the collection of statistics, or reversing the control using a tool like Prometheus.

However, when reversing the control doesn’t fit and the service is scaled up to multiple instances, Redlock is one of the best solutions available. Every instance of the service would try to push the statistics within a timeframe, but only one that successfully acquires the lock will do that.

Redlock tips

Persistence

Usually, Redis is used without being persisted to the disk. In such cases, make sure the saved data kept in Redis is temporary to avoid an outage in case of a service restart.

Remember to warm up the service. If the Redis data is lost (e.g., upon restarting Redis), a specific lock could possibly be set twice, resulting in the job being executed multiple times.

Time-to-live (TTL) usage

If a specific job is scheduled once in a timeframe (for example, sending statistics), it can be achieved by simply utilizing the TTL of the Redlock.

We advise that you set the TTL for every lock to make sure the lock is released and the service cures itself when one job executor dies or gets stuck for whatever reason.

Multi-DC routing

In Pipedrive, if a microservice is serving a request for a company that cannot be found in the region it is running, it can respond to our multi-DC router with a specific HTTP response code, indicating that the company is not in this region.

Once, we had an interesting issue related to locking a resource/job. The lock for a specific job was acquired multiple times in multiple regions because the locking occurred before routing the request to another region.

If the job is expected to be executed in only one region, handle the routing before acquiring the lock for the job.

Container health

From time to time, something might interfere with your service’s Redis connection. For example, we’ve had a few instances when the connection wouldn’t drop, but Redis was neither responsive nor did it throw errors when the service would attempt to set a lock.

To mitigate this, and if your service uses Redlock, we advise that you acquire a random lock and release it with every health check request of the container. If setting the lock throws an error or gets it stuck, the container will be marked unhealthy, and the new/restarted instance might pick up the Redis connection.

Conclusion

As our Kubernetes backend setup makes it easy to add an instance of Redis next to a service, Redis has become our tool of choice whenever there is a need for synchronization between parallel-running processes. When there’s a single Redis instance available for every instance of a service, using Redlock is easy.

Sometimes, instead of rewriting an entire new piece of code, a technology like Redlock can become a vital addition to the mix and help the code run for years to come.

Have fun locking resources!

Many thanks to Laura Vicente Mesonero, Jevgeni Demidov and Yael Ilani!

Interested in working in Pipedrive?

We’re currently hiring for several different positions in several different countries/cities.

Take a look and see if something suits you

Positions include:

  • Back-end developer
  • Full-Stack Developer
  • Senior Front End Developer
  • Junior Developer
  • Quality Engineer
  • And several more

--

--