Distributed locks
To begin with, consider the following example.
Consider we have an application that is running in three different instances which are physically placed in different parts of the world. These machines are trying to read these input files, process them, and write the output in 1.out,2.out, and 3.out respectively.
In an ideal situation, each instance will pick a different file for processing and write the output in the output file. Consider a situation where i1 picks up 1.in and i2 also picks the same file. we end up in a situation where two of the machine works on the same file. There can be a situation where the processing of a file say 1.in in this case takes more than an hour. Two of our instances are working on the same thing and are eventually making the system inefficient. Another issue is Integrity. The output of the 1.in will get written in 1.out. Since two instances are working on the same file, i.e 1.out will get corrupted.
The only solution to avoid these situations is to put a lock once an instance picks up the file.
why distributed lock?
Let's assume, we are maintaining one DB/common access location where as soon as any machine is picking up the file we are making an entry there. But doing so we have added a single point of failure. If this goes down all of the data gets corrupted. Again we end up with efficiency and integrity as a problem. Also, we can’t use DB here it will be not efficient as there will be too much read and write. We can use an in-memory cache, but again if this instance goes down we hit the same issue.
Hence, we need to distribute lock here.
So similar to this DB/machine, we will be having two/n more DB/machines where we will be maintaining the same data and this data gets transferred from one machine to another may be in an asynchronous way. As soon as i1 picks a file say i1.in, it makes the entry in one file and it gets replicated in another machine as well. These machines again can lie anywhere physically, hence we end up having “no single source of truth” as replication takes time. What will happen if before the information gets replicated to machine2 from machine1, machine1 crashed. We end having the same issue of efficiency and integrity.
Before resolving this issue, let’s become familiar with the properties of a lock.
- Mutual exclusion: We don’t want two or more processes to acquire the same lock or to access the same resource.
- Deadlock free: We don't want our system to wait for the lock which is acquired by another process.
- Fault tolerance: We don't want any fault to happen in our system.
Consider this above image, here we have two instances i1 and i2, a central lock manager and a cache(just for the sake of fast read and write). When i1 needs to work on file i1.in, i1 asks for a lock from the lock manager, who inturns marks an entry in cache LOCK and returns the lock to the i1. Once i1 finishes the task, i1 releases the lock, and entry from the cache gets removed. The same happens with i2 as well. However, we still have a single point of failure. Another issue with this system is what if i1 is holding a LOCK for eternity i.e i2 checked at time t1, finds LOCK in cache again at time t2, and so on. Because of the issue of one i1, all the other instances are getting affected. To resolve this issue we can introduce, TTL in the cache. However with TTL, we introduce another issue, let’s say our TTL is 5sec. i1 acquires the LOCK, we have an entry in cache ‘LOCK 5’, after 5-sec lock got released, however, i1 is unaware of this at 5th sec since i1 is still processing/completing the job. At 6th sec, i2 acquires the lock since at this point of time there’s no LOCK in cache i2 successfully acquires the lock and at 7th sec i1 completes the job and tells cache to release the lock and cache will release the LOCK which is basically acquired by i2. So as we can see here we have created a mess. We need to have a unique id as well so that cache should be aware that lock is acquired by which instance.
let’s reiterate through the above scenario, i1 asked lock manager for a lock,i1 gets it's with the name ‘lock_1’ and an entry is created in the cache with ‘lock_1 i1’, at 5th-sec entry from the cache gets removed and when i2 asked from the lock from lock manger he gets it. At 7th sec when i1 asked the lock manager to release the lock, the lock manager searches from the entry won't find it and rejects the request.
Who is responsible for generating a unique id?
There should be some mechanism/coordination service which gives a random number/range of number to generate a random number or else we can take the time up to micro sec precision i.e LOCK-ms1
How to make the system fault-tolerant? To resolve this issue we need to have multiple lock managers.
Now before acquiring the lock, instance i1, need to check with n/2 + 1 machine i.e in this case,i1 checks with 3 LM if is it possible to acquire the lock or not.
This is just one way to tackle the problem. I hope you understand the concepts and like this blog as well. Do share your views in the comments. For more articles like these stay tuned.