Replication Strategies in Distributed Systems: A Comprehensive Overview
Replication involves maintaining copies of the same data on multiple machines connected via a network. The key reasons for replication include reducing latency by keeping data close to users, increasing system availability in the face of failures, and improving read-throughput by scaling out the number of machines.
The discussion assumes that the dataset is small enough for each machine to hold a copy of the entire dataset. It sets the stage for later chapters where partitioning (sharding) of larger datasets will be explored.
To handle changes for replicated data, there are three popular algorithms:
- single-leader
- multi-leader
- leaderless replication
Single-leader replication
Properties:
1- Write requests are handled only by the leader while reads can be accepted by any of the three nodes.
2- When a write request comes to the leader node, it can process in three ways:
- synchronous replication: Leader send the changelog to all the followers, wait for an acknowledgment from all of them, once received, respond to the client confirming that the write has been processed.
- asynchronous replication: Leader accept the incoming write request, send a changelog to all the followers, but respond to the client without confirming with all the followers that change has been applied at their end.
- Semi-synchronous: Leader do not wait for an acknowledgment from “all” the followers but at least one.
Synchronous and asynchronous replication:
Different methods of replication:
- Statement-based Replication:
- The leader logs every executed write request and sends this statement log to its followers.
- Challenges arise with nondeterministic functions, auto-incrementing columns, and side effects in statements.
- This approach was used in MySQL before version 5.1 and is still employed in some cases like VoltDB.
2. Write-ahead Log (WAL) Shipping:
- The leader appends every write to a log, which is also sent to followers for replication.
- This method is used in PostgreSQL and Oracle.
- The disadvantage is its tight coupling to storage engine details, making upgrades challenging.
3. Logical (Row-based) Log Replication:
- Different log formats for replication and storage engine allow for a decoupled logical log.
- The logical log describes writes to database tables at the row level, making it more backward-compatible.
- MySQL’s binlog with row-based replication is an example.
4. Trigger-based Replication:
- Replication can be moved to the application layer using triggers and stored procedures.
- Tools like Oracle GoldenGate or features in relational databases like triggers and stored procedures enable this.
- Trigger-based replication offers flexibility but has higher overhead and potential for bugs.
Replication Lag
The delay where asynchronous replication have between writes on the leader and their reflection on followers. This lag can lead to temporary inconsistencies, known as eventual consistency.
Two challenges related to replication lag:
- Reading Your Own Writes:
- Users may experience data loss if they view their submissions on a follower shortly after making a write.
- Solutions include reading from the leader for user-modifiable data or using timestamps to ensure consistency.
2. Monotonic Reads:
- Users may observe data moving backward in time when querying from replicas with varying lags.
- Monotonic reads guarantee that users, making several reads in sequence, will not see time go backward.
Consistent Prefix Reads
It is a potential anomaly in systems with replication lag. The scenario involves a conversation between Mr. Poons and Mrs. Cake, where the replication lag causes an observer to perceive Mrs. Cake’s response before Mr. Poons’s question, creating a confusing situation. To prevent such anomalies, the author introduces the idea of “consistent prefix reads,” ensuring that a sequence of writes appears in the same order when read by anyone. Solution can be writing causally related data to the same partition and algorithms that track causal dependencies.
Solutions for replication lag
It emphasizes the importance of understanding how the application behaves with increased replication lag and suggests designing systems for stronger guarantees if user experience is affected. The complexity of addressing these challenges in application code is acknowledged, and the author emphasizes the role of transactions in providing stronger guarantees.
Multi-Leader Replication
Multi-leader configurations allow multiple nodes to accept writes, with each leader acting as both a leader and a follower.
use cases for multi-leader replication:
- Multi-datacenter operations
- Offline operation for clients
- Collaborative editing applications.
Handling Write conflicts in multi-leader replication
- synchronous vs asynchronous conflict detection
Conflicts are often detected asynchronously. In principle, we could make conflict detection synchronous. But we lose the advantage of multi-leader replication.
- Conflict avoidance
It is a strategy to mitigate conflicts by ensuring that all writes for a particular record go through the same leader, although it may break down in certain scenarios where leaders need to be changed for various reasons.
- Covering toward a consistent state
In a multi-leader database configuration, where there is no defined ordering of writes, resolving conflicts becomes crucial to ensure data consistency across replicas. Various methods of conflict resolution include assigning unique IDs to writes and selecting the one with the highest ID (last write wins), giving each replica a unique ID, merging values, or recording conflicts for later resolution by application code.
- Custom conflict resolution logic (on read, on write)
Custom conflict resolution logic, often implemented using application code, allows developers to tailor conflict resolution to specific application requirements. This logic can be executed either on write, immediately resolving conflicts when detected, or on read, presenting conflicting versions to the application, which then resolves them and writes the result back to the database.
Conflicts can arise from concurrent modifications to the same field in the same record, and detecting conflicts becomes crucial for maintaining data integrity. Subtle conflicts, such as overlapping bookings in a meeting room application, may require more nuanced conflict resolution strategies.
Multi-Leader Replication Topologies
Multi-leader replication topologies define communication paths for propagating writes between nodes. Examples include all-to-all, circular, and star topologies. The choice of topology influences fault tolerance, network efficiency, and conflict resolution.
Leaderless replication
inspired by systems like Amazon’s Dynamo, eliminates the concept of a leader and allows any replica to accept writes directly. In leaderless systems, writes are sent to multiple replicas in parallel, and read requests are also sent to multiple nodes.
Writing When a Node is Down
The replication schema should be sure every data is copied to all replica.
There are two mechanisms used in Dynamo:
- Read Repair
When a client requests for read from several replica in parallel, it can detect any stale response. Then it writes the newer value to the replica.
- Anti-entropy process
there is a background process that constantly look for differences in the data between replicas. For values that rarely read, this mechanism update data, but there may be a significant delay before data is copied.
Quorums
Quorums, determined by parameters like w (number of nodes required for a successful write) and r (number of nodes required for a successful read), ensure consistency in leaderless systems. However, these systems may still exhibit edge cases where stale values are returned.
w + r > n
Limitations of quorum consistency
Limitations of quorum consistency include potential scenarios of stale reads, especially with smaller w and r values. Dynamo-style databases are designed to prioritize eventual consistency and might not provide guarantees like reading your writes, monotonic reads, or consistent prefix reads. Stronger consistency guarantees typically require transactions or consensus mechanisms.
Monitoring Staleness
Several aspects related to monitoring database staleness, particularly in the context of replication discussed here. It emphasizes the importance of monitoring replication lag in leader-based replication systems, where writes are applied to a leader and followers in a specific order. In leaderless replication, monitoring staleness becomes more challenging due to the absence of a fixed order for writes.
Sloppy Quorums and Hinted Handoff
The concept of “sloppy quorums” is introduced as a fault-tolerant mechanism for databases with leaderless replication. This approach allows databases to accept writes even when they can’t reach a quorum of nodes, and it involves hinted handoff to transfer temporarily accepted writes to the appropriate nodes once network issues are resolved.
The text also touches on multi-datacenter operation, explaining how some databases implement support for cross-datacenter replication within the leaderless model. It mentions examples like Cassandra and Voldemort, which include nodes from different data centers in the replication process.
Concurrent writes in Dynamo-style databases are discussed, highlighting the possibility of conflicts due to variable network delays and partial failures. The text explores conflict resolution methods, such as “last write wins” (LWW), where the write with the largest timestamp prevails. However, it notes that LWW can compromise durability and lead to data loss.
The text concludes by discussing the determination of concurrency in operations and introduces an algorithm based on version numbers to track concurrent writes. It explains how the algorithm works, including the role of version vectors in managing dependencies and handling writes across multiple replicas. The need for merging concurrently written values and dealing with conflicts in application code is also discussed. The importance of tombstones to indicate deletions in a distributed context is mentioned, along with efforts to automate conflict resolution using data structures like CRDTs (Conflict-free Replicated Data Types). Finally, version vectors are introduced as a mechanism to ensure safe read and write operations across replicas.