Chubby: A lock service for distributed coordination
In this post, we will cover google’s Chubby lock service. This paper introduces a lock service, that can help with coordination in a distributed environment using locking semantics. Chubby is used extensively inside Google in various systems such as GFS, BigTable. The primary goal is to provide a reliable lock service. Chubby is NOT optimized for high performance, frequent locking scenarios. There are thousands of clients that use Chubby, but they use Chubby occasionally — for coarse-grained locking. Generally such coarse grained locks are held for hours or days and NOT seconds. One typical use of Chubby is for multiple applications is to elect a master — the first one getting the lock wins and becomes the master.
Paxos or Lock Service?
If we expand on the example mentioned in the last section, this specific problem really converges to a problem of establishing consensus in a distributed system. So one could solve this by using paxos than to build a centralized lock service. For paxos to work, one could build a library and all applications would use that library to participate in consensus. Following points give a good idea of why lock service was chosen for the google eco system.
- Developer friendliness: Generally, it is far easier for developers to add locking semantics to their code than consensus based mechanics. Specially lot of applications don’t start out big and as they grow bigger, they add code to establish coordination. Most developers are also far more familiar with locking semantics.
- Event notification with data: In most cases, applications need to access small amount of data in such distributed environments -which means need for writing data and reading it in a consistent manner. This fits very well into locking mechanics.
- Application progress: In a paxos like setup, you need majority of applications to be up to make progress. In a centralized lock service mode, even if only one client is up, it can make progress as long as it has correct access to locks from Chubby.
Design decisions
Following main design decisions come out from the topics mentioned in the last section.
- Coarse grained locking — Applications don’t need locks of shorter duration. For example, electing a master is not a frequent event.
- Small data storing(small file manipulations)capabilities in addition to a lock service
- Allow thousands of clients to observe changes. So lock service needs to scale to handle many clients, although the transaction rate may not be that high.
- Notification mechanism by which client knows when the change occurs in the file that is shared e.g. when a primary changes
- Support client side caching to deal with clients that may want to poll aggressively
- Strong caching guarantees to simplify developer usage
System Architecture
There are two main components in the system, chubby master and the chubby client library. Each application interested in distributed coordination links with the chubby client library. The client library then executes the locking protocol on client application’s behalf.
Master: Chubby master consists of multiple replicas, with one of them getting elected as the master using distributed consensus protocol like paxos. All replicas also grant a lease to the master, during which they don’t elect a new master.
Once the master is elected, it is responsible for writing to the database any persistent state that it needed, which is then replicated at other replicas. A write needs to be replicated at majority before being acknowledged back to the client. A read can be served back to the client by the master as long as the lease hasn’t expired — this indicates that there is no other master around.
If the master fails, consensus protocol is again run to elect a new master.
Client: A chubby cell serves thousands of clients, so these clients are connecting to a master for all the coordination needs. Clients use DNS to find the master. Replicas respond to clients issuing DNS queries by redirecting the clients to the current master. Once client finds the master, all requests go to that master. Clients run the locking protocol on application’s behalf and notify application of certain events such as master fail-over has occurred.
File based interface
Chubby exports UNIX file system like APIs. Files and directories are called nodes. There are no links allowed in the system. Nodes can be permanent or ephemeral. Ephemeral nodes go away as no client using the node go away. A file can be opened in a read/write mode indicating the exclusivity. Clients get a handle to the given node. The following metadata is also allocated per node:
- Instance number — always increasing for the same name
- Content generation number — Increased anytime content is overwritten
- Lock generation number — Increases when lock transitions from free to held
- There also ACLs on nodes like in a traditional file system for controlling access and an ACL number increases on ACL changes.
There is other metadata with handles that clients use. This has mostly to do with cases when a new master sees a handle that was generated by a previous master.
Locks, Lock-Delays and Sequencers
A client can create a node/file in a write(exclusive) or read(shared) mode. All the locks are advisory i.e. participating entities need to follow the locking protocol for accesses the distributed critical section. Having a lock on the file, doesn’t prevent unfettered accesses to the file.
One of the issues with locking in distributed systems is that applications holding locks can die. Consider the following example, where R1 ends up accessing data in an inconsistent manner. In the last step(after 6), update from R1 lands on the master and can corrupt data. R1 does not have a valid lock at that time because it died in step 4 and master then granted the lock on N to client R2 in the meanwhile.
One of the ways this is handled is using a lock-delay. When an application holding the lock dies without releasing the lock, for some configurable time, no one else gets a lock on the locks held by the now-defunct application. This makes for a simple and effective(but not perfect) solution where the client can specify the threshold upto which a faulty application can hold a lock.
Another possible solution that Chubby provides is a sequencer based checking. When a client acquires a lock, it can request for sequencer from the chubby master. This is a string that consists of lock name, lock generation number(that changes on every transition from free to held) and the mode of acquisition. This string can be passed on to the modules needing the lock for protected transactions. These modules can check for the validity of the lock using sequencers by checking against the chubby master or using the module’s chubby cache.
Detection of changes using events
Chubby also allows some small aspects of a publish and subscribe mechanisms. Files in chubby also allow for storing a small amount of data which makes it more effective than just for indicating whether a lock was taken or not. As we discussed earlier, clients are interested in knowing when a new master has been elected or when the contents of the lock that they are using have changed. This is accomplished using events and callbacks that are registered at the time of opening of the files. The following events are used:
- File contents have changed: Used to describe the new locations for the given service
- Child node added to a directory: Used for describe addition of a new replica
- Chubby master fail-over: Used for client to go into recovery mode
- Invalid Handle: Some communication issues
Electing a primary using Chubby
Using the mechanisms described so far, client can now elect a primary. It is fairly straightforward to do:
- All the entities that want to become a master, try to open a file in write mode.
- Only one of those get the write mode access and others fail.
- The one with write access, then writes its identity to the file
- All the others get the file modification event and know about the the current master now.
- Primary uses either a sequencer or a lock-delay based mechanism to ensure that out-of-order messages don’t cause inconsistent access and services can confirm if the sequencer for the current master is valid or not.
Caching and KeepAlive calls
Clients keep a cache that be used for reading and is always consistent. For writes, the write is propagated to the master and doesn’t complete until master acknowledges it. Master maintains state information about all the clients and hence can invalidate a client’s cache if someone else writes to the same file. The client that issued the write in such cases is blocked until all invalidations have been sent to the other clients and acknowledged by them.
There are KeepAlive calls that client makes to the master. At any point, for a well behaving client, there will always be one outstanding KeepAlive call at the master. Basically a client acknowledges master’s response by issuing the next KeepAlive call. Server can send some information back as a response of this call at a later time e.g. an invalidation can be sent to the client as response of a prior KeepAlive call. Client will see the response and then invalidate its own cache and then open another KeepAlive call at the master for future communication from the master. Another advantage of this mechanism is that no additional holes need to be punched in the firewalls. Outbound calls from clients are generally allowed and clients don’t need to open and listen on ports for the master to initiate connections to clients.
Sessions
We discussed KeepAlive RPCs in the last section. These establish a client-master chubby session. When a client makes this KeepAlive call to the master, master blocks this call. Master then also assigns a lease to the client. This master lease guarantees that the master won’t unilaterally terminate this session. When the lease is about to expire or if there is some event to which the client is subscribed, master can use this blocked call for sending the information back. In the former case, master may extend the lease or in the later case master can send information such as which files have changed.
Clients cannot be sure if the master is alive and the lease that the client has is still valid. So clients keep a slightly smaller local lease timeout. If this timeout occurs and master hasn’t responded, then client isn’t sure if the master is still around and if its local lease is valid. At this time, client considers that it’s session is in jeopardy and starts the grace period. It also disables its cache. If client heard back from the master during the grace period(45s), then the client can enable the cache once more. If client doesn’t hear back from the master then it is assumed the master is inaccessible and clients return errors back to the application. Applications get informed about both jeopardy and expired events from the chubby client library.
Master fail-over
One of the most disruptive events that can happen is that the master fails for a considerable time and a new master gets elected after some time. Let’s look at how that sequence looks like. Local lease timeouts, jeopardy-grace periods trigger during this time at the client. All clients using this master need to switch over to the new master and new master needs to recreate the state information about the clients using the replicated database that master uses to manage the persistent state.
Let’s walk through all the vertical slanted KeepAlive calls from left to right.
- KA1: Client makes the call. No leases have been assigned yet. Master assigns lease M1. Client currently has the lease C1.
- KA2: In the response toe KA1, a smaller local lease C2 is assigned for the client. Master also extends its lease to M2.
- KA3: Client acknowledges KA2 and makes the RPC call that needs to be outstanding at the master. Master dies after getting the call. So no new leases can be assigned. Client’s C2 lease expires and client library informs the application that it has entered jeopardy. The grace period starts on the client.
- KA4: In the meanwhile a new master gets elected and clients get notified about it. The new master gets this KA4 message and it responds back. Master also assigns lease M3 conservatively based on previous master’s lease promises — remember that client caching is dependent on this.
- KA5: Master doesn’t know if this is just a delayed packet or really an attempt by the client to establish a new session. For this master uses an epoch number and if it notices an old epoch number(which clients send to the master) then that call doesn’t actually extend the lease on either side and the response is one of rejection.
- KA6: This one uses the new epoch number sent by the master. So succeeds, but leases are not yet extended by the master.
- KA7: This one allows the client to extend its local lease to C3. Now client is out of jeopardy and can exit the grace period.
- KA8: This one again keeps the session alive with the master and we are back to normal operations. Since all of this happened before the grace period ended, application didn’t see any errors. If KA6 had happened after the grace period was over, application would have seen the errors.
A new master that gets elected goes through a fairly detailed process of reconstruction. This seems like a pretty tricky code to get right. At a high level:
- Master creates the epoch number mentioned above
- Reconstructs the in-memory state about client leases etc. by reading the database maintained by the last master
- Now it stops rejecting KeepAlive session requests only. Everything else is still being rejected.
- Sends out fail-over event information to clients. Clients can then invalidate their cache because some events might have been missed.
- After all client acknowledge or timeout, systems seems to be in a stable state and all other RPCs are allowed by the master
Conclusion
I found the idea of a centralized coordination service with locking semantics very useful. Rather than participating in consensus based mechanisms, the locking semantics seem to simplify development of applications in distributed environments as most developers are very familiar with those. Also the way server and client protocol operates via client side libraries and related callbacks, presents a nice mechanism to abstract intricacies of the protocol from the applications and allows for easier integration.
Join Coinmonks Telegram Channel and Youtube Channel get daily Crypto News
Also, Read
- Copy Trading | Crypto Tax Software
- Grid Trading | Crypto Hardware Wallet
- Best Crypto Exchange | Best Crypto Exchange in India
- Best Crypto APIs for Developers
- Crypto Telegram Signals | Crypto Trading Bot
- Best Crypto Lending Platform
- An ultimate guide to Leveraged Token
- Best VPNs for Crypto Trading
- Crypto Trading Signals for Huobi | HitBTC Review
- TraderWagon Review | Kraken vs Gemini vs BitYard
- How to trade Futures on FTX Exchange
- OKEx vs KuCoin | Celsius Alternatives | How to Buy VeChain
- 3Commas vs. Pionex vs. Cryptohopper
- How to use Cornix Trading Bot
- Bitget Review | Gemini vs BlockFi cmd| OKEx Futures Trading
- 10 Best Places to Buy Crypto with Credit Card