Consistency & Consensus for System Design Interview (9): membership & coordination services
PREV | HOME | NEXT | System Design Resource List
Don’t forget to get your copy of Designing Data Intensive Applications, the single most important book to read for system design interview prep!
Check out ByteByteGo’s popular System Design Interview Course
Introduction
Services such as ZooKeeper, etcd, and Consul, etc are called as coordination and configuration services or distributed key/value stores. These services are set up to hold small amounts of data in-memory but also write to disk for durability. The small amount of data is replicated to multiple nodes of the system using a fault-tolerant total order broadcast algorithm, i.e. the writes are applied in exactly the same order across all the nodes, which ensures all replicas are consistent with the leader.
Zookeeper in particular is modeled after Google’s Chubby lock service. In general, developers rarely work directly with Zookeeper as it’s not supposed to be treated like a general purpose database for storing information. Projects such as Hadoop, YARN, Kafka, OpenStack Nova, and HBase internally depend on Zookeeper for storing configuration data and also for leader election and consensus. Zookeeper also offers some other interesting features that are useful in the context of distributed systems:
- Failure detection: Clients have long-lived sessions with Zookeeper servers and periodically exchange heartbeats to check if the other node is alive. For transient failures the session isn’t terminated, however, if the heartbeats aren’t received for a period greater than the timeout, then the session is terminated. Locks held by a session can be configured to be released automatically when the session times out.
- Watch Notifications: Zookeeper clients can read values created by other clients and look-up the locks. Clients create ephemeral nodes called zNodes in Zookeeper which disappear when a client’s session terminates or times out. Other clients can keep a watch on the zNode created by a client and be notified of any changes. This way clients don’t have to continuously poll for changes.
- Linearizable atomic operations: Zookeeper can be used to implement a distributed lock. The lock has an expiry time after which the lock is released — a mechanism that helps safeguard against situations where a lock held by a dead/failed node is eventually released after timeout. The lock can be created using an atomic compare-and-set operation. The operation is guaranteed to be atomic and linearizable even in the face of network failures by the consensus protocol. Several clients can attempt to acquire the lock using the atomic compare-and-set operation but only one would succeed.
- Total ordering of operations: Zookeeper totally orders all operations and assigns a monotonically increasing transaction ID called zxid and a version number called cversion. These Zookeeper produced values can be used as fencing tokens. We have talked about fencing tokens earlier and they come in use when say, multiple clients attempt to acquire a resource (e.g. a lock or serially write to file) but only one is successful. The successful client holds onto the resource but then experiences a process pause e.g. GC pause. While the first client is paused, its lease on the resource expires and a second client acquires the resource. When the paused process resumes, it thinks it still has exclusive access to the resource and proceeds potentially causing corruption or data loss. This situation can be averted by using a fencing token that is sent with each request to the resource. If the resource observes that it has already processed a higher numbered token than the one received, it can reject the request.
Use Cases
Zookeeper can be helpful in situations where several instances of a process or service exist and one of them needs to be chosen as the leader. In case the leader fails then one of the other nodes gets selected as the new leader.
Land a higher salary with Grokking Comp Negotiation in Tech.
Another application of membership and coordination services is in the case of partitioned databases. Partitions are assigned to nodes that make-up the database. However, when a new node is added, the existing nodes delegate some of their assigned partitions to the new node or when a node leaves the cluster, the leaving node’s partitions are distributed among the rest of the nodes in the cluster. Zookeeper has the concept of ephemeral nodes and watches — a watch triggers when a change to the ephemeral node (also called a zNode) is detected. The functionalities can help build a layer on top that can rebalance partitions or recover from faults without human intervention. An open source library Apache Curator offers tooling atop Zookeeper for implementing such higher-level tasks. Also, it’s not possible for every distributed system to implement voting and consensus, especially when nodes in the system number in the thousands. Zookeeper offers distributed systems a way to outsource these activities to ZooKeeper rather than implement themselves. Generally, Zookeeper stores information on behalf of systems, that changes at the scale of minutes or hours, e.g. the “IP address of the leader is xxx”. Zookeeper isn’t designed to store information that may change at the rate of thousand times or more per second.
Check out the course Coderust: Hacking the Coding Interview for Facebook and Google coding interviews.
Another use case of Zookeeper and similar services is to keep record of node membership within a cluster. Distributed systems can experience unbounded network delay and usually timeouts are used to detect failed/dead nodes. A service like Zookeeper can help achieve consensus among the nodes of the cluster as to which nodes are alive and active members of the cluster. It is entirely possible that the consensus about a dead node is incorrect, i.e. the node is just fine but experiencing a prolonged network failure/partition and thus considered to be dead. However, having a consensus exist among nodes as to which nodes are currently the members of a cluster is useful. For instance, a leader could be selected simply based on the highest or the lowest ID of the active nodes in a cluster.
Yet another use case for Zookeeper and related services is service discovery. Traditionally, a domain name server (DNS) has been used for retrieving the IP of a service to connect to. However, the reads from a DNS aren’t linearizable and that is fine as long as the DNS delivers a very high uptime, i.e. it is reliably available and continues to operate in face of network failures. The tradeoff is that some reads from the DNS can be stale. Zookeeper’s ephemeral zNodes can be used for service discovery. An ephemeral zNode disappears when the session of its owner ends. When using ZooKeeper for discovery of hosts in a distributed system, each server publishes its IP address as an ephemeral node, and should a server lose connectivity with ZooKeeper and fail to reconnect within the session timeout, then its information is deleted. Other services can lookup the zNodes to find out the IP addresses of the currently active nodes in the cluster.
Get a leg up on your competition with the Grokking the Advanced System Design Interview course and land that dream job!
Consensus is not required for service discovery but for leader election and it can be helpful for other services to be able to know which node is the current leader. For this purpose, some consensus systems also operate read-only caching replicas. These replicas don’t participate in the voting process and asynchronously receive the log of all the decisions made by the consensus algorithm. Other services can make read queries against these read-only cache replicas which are not linearizable.
Your Comprehensive Interview Kit for Big Tech Jobs
0. Grokking the Machine Learning Interview
This course helps you build that skill, and goes over some of the most popularly asked interview problems at big tech companies.
1. Grokking the System Design Interview
Learn how to prepare for system design interviews and practice common system design interview questions.
2. Grokking Dynamic Programming Patterns for Coding Interviews
Faster preparation for coding interviews.
3. Grokking the Advanced System Design Interview
Learn system design through architectural review of real systems.
4. Grokking the Coding Interview: Patterns for Coding Questions
Faster preparation for coding interviews.
5. Grokking the Object Oriented Design Interview
Learn how to prepare for object oriented design interviews and practice common object oriented design interview questions
6. Machine Learning System Design