In Pursuit of Perfect Locking

Published in

Expected Behavior Blog

5 min readMay 19, 2020

Phil Karlton famously noted that there are only two hard problems in Computer Science: cache invalidation and naming things. To this list, many have added off-by-one-errors. What happens when you decide to take on a problem that involves all three? You start looking for a good distributed locking library.

Sometimes, you really do need perfect locking — maybe you’re writing code that controls a critical piece of medical or military hardware. Other times, doing the same work isn’t that big of a deal — maybe your output is idempotent by default, or you’re counting how many people have smashed that like button, and a few extra likes per million isn’t a big deal. Most of the time, you’re somewhere in the middle.

It’s a question of trade-offs. Think of it in terms of reliability engineering — how many nines do you really need from your locking system? If using something like RedLock lets you add another nine without much additional overhead, is it worth it? Does using something like RedLock give you a false sense of security that might lead you to a poorer implementation when it comes to testing results and other side effects?

We’re big fans of Redis and have an existing locking mechanism built on top of it. In looking to switch to a more complete system of locking, we stumbled across a very interesting discussion from 2016. It starts with this post by Salvatore Sanfilippo, AKA Antirez (the creator of Redis), proposing RedLock, a Redis-based locking mechanism. This was followed by an analysis of Antirez’s proposal by Martin Kleppman, a distributed systems expert and researcher. The discussion was mostly civil, but there was no agreement on whether or not RedLock would be a valuable addition to the distributed tools landscape.

Since it was interesting and we are in the process of making this decision right now, we thought we’d summarize the discussion for future investigators while we’re at it.

The Arguments

Before we get into the heart of it, it’s worth noting that all of their posts (and this post) are in the context of locks that have expiration. Most of the time, we’d rather execute twice (and hopefully have some safety on the other end of the execution) than not execute at all, so expiring locks is important.

Throughout the related discussions, there were debates on several detailed points, but the majority of them seem to come down to theory vs practice. In theory, this means that they should agree, but in practice, they do not.

Kleppman’s point is that RedLock isn’t fundamentally different or better than a single Redis instance because they suffer from the same set of potential problems. Antirez counters that even if it is vulnerable to the same kinds of problems, the likelihood of them occurring can be greatly reduced or potentially eliminated.

Ultimately, they’re both right — a network partition at JUST the right time is fatal to both a naive single-server locking implementation AND to RedLock. The question is, how big of a target is the “right time”? There are two main categories of potential disaster:

Network Splits

If you lock with one Redis server and that server has issues, no one can do locked work.

If you use Redlock, have five servers, and need to lock the majority, you can still lose two servers without having issues. If the network has issues at just the right time during the locking process, you may still have issues, but the system should be more resilient in practice.

Process/Network Pausing

In the case where a process gets a lock, does some work, and acts on some other system, it’s always possible that acting on the other system gets delayed such that the effect is realized after the lock has expired.

Of course, even with a single-server lock implementation, it is possible to add some additional error-checking (fencing tokens, lock checking/extending before writing, using monotonically increasing lock keys), but most of the discussion contrasts (or, rather, points out the shared flaws) of a naive implementation and RedLock.

Assumptions

Kleppman makes some very strong points about the unbounded nature of things you can’t control. Network and storage issues may cause pauses of unlimited length. Likewise, stop-the-world garbage collection may lead to multi-minute waits and concurrency nightmares. These are real things that really do happen, but other locking systems that are considered much safer also make those assumptions. For example, Zookeeper makes these assumptions:

1. Only a minority of servers in a deployment will fail. Failure in this context means a machine crash or some error in the network that partitions a server off from the majority.
2. Deployed machines operate correctly. To operate correctly means to execute code correctly, to have clocks that work properly, and to have storage and network components that perform consistently.

The issue about clocks, storage, and network components performing consistently is exactly the surface area that he exposes as being RedLock’s fatal flaw. Is Zookeeper less vulnerable than RedLock to these issues? Almost certainly. The question is, what’s the level of effort, and what’s the risk?

Conclusion

If you read Martin and Antirez’s posts and felt like they were talking past each other, you weren’t alone. They’re both right. In Theory, RedLock could be a disaster waiting to happen. In Practice, it is probably much more reliable than a naive single server implementation. Reading their posts was interesting, and the points of agreement between the two provide some guidance, whatever system you choose:

Manage your server clock carefully
Pick your expiration TTLs wisely
If at all possible, use a monotonically-increasing lock key
Use fencing tokens when writing to storage, and understand your storage layer’s consistency model
Don’t be afraid to check that you still have a lock and extend the TTL as you go

If it seems like we didn’t reach a satisfying conclusion on what system to use, you’re right. Ultimately, if you’re in a system that already uses Zookeeper to manage Kafka or some other service, go ahead and use Zookeeper. If you’re comfortable with Redis and don’t want to maintain additional tooling, RedLock is probably a reasonable choice as long as you understand the potential issues. Distributed locking is hard — who knew?

In Pursuit of Perfect Locking

The Arguments

Assumptions

Conclusion

Written by Expected Behavior