Data Replication in distributed systems (Part-1)

12 min readOct 12, 2019

Nothing is as easy as it looks. Everything takes longer than you expect. If anything can go wrong, it will, at the worse possible moment
— Murphy’s Law

Introduction

Copying the same data over multiple nodes in distributed systems is critical to keep the database up and keep on serving queries even during faults. Following are other reason behind data replication in a distributed system :

Higher Availability: To ensure the availability of the distributed system( System keeps on working even if one or fewer nodes fail)
Reduced Latency: Replication assists in reducing the latency of data queries (By keeping data geographically closer to a user. For example, CDN(Content Delivery Networks) keeps a copy of replicated data closer to the user. Ever thought how Netflix streams videos with such short latencies!)
Read Scalability: Read queries can be served from replicated copies of the same data (this increase overall throughput of queries)
Network Interruption: System works even under network faults

Replication is fairly easy when data on a node to be replicated on other nodes doesn’t change. However, replication becomes trickier when data is in constant flux (Most web applications fall under this category). Single-leader, multi-leader and leaderless are well-known approaches to data replication. Moreover, each such approach copies data either synchronously or asynchronously. Furthermore, replication is not the one solution for all problems associated with distributed systems. Replication rather comes up with its own set of problems, like replication lag and consistency issues. In this article, we will dig deeper into the general principle of single leader replication and it’s associated issues.

Part-1 will only cover Single leader replication. Part-2 will focus on multi-leader and leaderless replication.

Single leader replication

In single leader replication, the leader(master or primary) replicates data to all of its followers (slaves, read replicas, secondary )nodes. This is the most commonly used mode of replication. Whenever a new write comes to the master node, it keeps that write to its local storage and sends the same data to all its replicas as a change stream or replication log. Each slave then updates its own local copy of data in the same order as it was processed on the leader node.

Many relational databases like PostgreSQL, MySQL, and Oracle Data Guard and NoSQL databases like MongoDB, RethinkDB, and Espresso use this mode of replication. Message brokers like Kafka and queues like RabbitMQ also employ single leader based replication. Data to replicas from a leader is copied either synchronously or asynchronously. Each such method of replication has its own set pros and cons.

Synchronous vs Asynchronous leader based replication

When a write comes to a leader, the leader can send the same write request to all its followers and wait for the acknowledgment from each of them. Once the leader receives acknowledgement it can then notify the client of a successful write. This is the synchronous mode of replication. In Asynchronous replication, the leader sends writes to all its followers and doesn’t wait for an acknowledgment. This way leader can continue to take writes even if all of it’s follower node fails. Asynchronous replication has its own demerits like consistency issues(weaker durability), replication lags, etc.

Benefits of using synchronous replication :

Data consistency: All followers keep an up-to-date copy of the data as it is on the leader, so we can be sure of no data loss in case a leader fails
Easy to upgrade a follower to a new leader: If a leader fails, any of the follower nodes can be upgraded to become a new leader without any data loss. Thus a system with n nodes can cope up with the failure of up to n-1 replicas
Reads without any data inconsistency: Since all followers are in sync with the leader, any read query from any of the followers will always give consistent results. Issues due to stale data read, don’t arise here.

Looking at the above advantages, prima facie it appears synchronous replication must be getting heavily employed by many distributed systems, however, in reality, synchronous replication is rarely used. Following are the pitfalls of synchronous replication :

Increased latency: As the leader is forced to waits for an acknowledgment from all of its followers, this waiting time adds to the overall latency
Reduced Availability: If any of the follower nodes becomes unavailable and doesn’t respond (Node got crashed, network fault or because of any other reason), then the leader blocks all write and wait until that synchronous replica is available again. Thus, the failure of any of the replica nodes makes the whole system comes to a standstill.

Due to the impact on latency and availability synchronous replication is rarely used(MySQL can be configured for such configuration). Rather it’s the combination of some synchronous and asynchronous replication, which is sometimes known as semi-synchronous.

For example, if a leader has two followers(replicas), then the leader can synchronously write to one follower and asynchronously to another. Thus, acknowledgment by just one follower can complete the required acknowledgment(Resulting in reduce latency) and higher availability. The asynchronous follower can replace synchronous replica (this approach has its own shortcomings but it is still better than synchronous replication)

Booting up a new follower node in a single leader replication

The beauty of distributed systems is that a node can come and go any time, but the system as a whole still keeps on working. When one node goes down(Due to node crash or forced down by coordinator node in a cluster), then we might require to boot up a new node.

This new node wouldn’t have any data to start with. We can copy the data from the leader node, but since data is in constant flux on the leader node, hence by the time data is copied on the new follower node, this copy of data gets modified on the leader. Thus both nodes are not in sync. You could lock the leader database until data is copied on the follower, but this approach is against the holy grail of availability.

In practice, data copying is done in the following manner :

A consistent snapshot of data is taken without taking the lock-on database(Done with append-only log ). This snapshot is then copied on the follower node.
The follower node then requests the leader node to send all the data changes that occurred since the snapshot was taken(From the last Offset of the log). When a follower gets all the data from the leader, the follower is in then sync with the leader and the leader can push all new data directly to the follower.
The last offset from where the follower requests data from the leader has different names in different databases (It’s called log sequence in PostgreSQL, and binlog coordinates in MySQL)

Failures and Recovery in a single leader replication

Master or slave node can fail at any point. Nodes could be down either due to a crash or due to maintenance. Failure of nodes should not stop the system, shouldn't degrade the system's performance and the system should recover from such failures. This is the expected behavior of nodes in distributed systems.

Failure of follower and it’s recovery

Each follower keeps a copy of data on its local storage. If the follower node is crashed or restarted or isn’t able to sync with the leader for a long duration due to network issues, the follower can easily catch up with the leader with the help of its log offset. From the last offset in its log, it can request the leader to send all data that it missed to copy due to temporary failure. After applying those changes followers get in sync with the leader and can continue to take further data changes.

Failure of leader and its recovery

Managing a slave node failure is fairly straightforward, however, handling failure of a leader node is complex. Failure of leader requires promoting one of the followers as a new leader, other followers to change their leader to the newly appointed one and all clients to reset its configuration and send all writes to this new leader.

Even before performing all the above actions, the most important step is to detect that a leader has failed. A leader can fail due to crash, power outage or due to network interruption. Since no one can surely answer why a node(leader) failed, so most systems opt for a timeout(What should be the ideal timeout ?). Nodes pass messages among themselves at regular interval and if one node doesn’t respond with a given duration, then it’s assumed to be dead (Node may be still working but due to some network issue, it becomes temporarily unavailable. It is still assumed to be dead as no other system can know for sure the reasons behind such non-responsiveness)

A newly appointed leader could be a predefined controller node, or it could be a follower with the most up-to-date changes.

Once a new leader is decided, clients will now send all writes to this new leader and if the older leader comes back(To avoid the split-brain issue), it is forced to become a follower.

In asynchronous systems, the newly appointed leader might not have all the writes from the previous leader. If previous leader comes back online and becomes a follower, it will have conflicting writes. Most systems which give weaker guarantees of durability usually discard those changes from previous leader.

How does replication work?

There are three ways through which leaders replicate data to its follower :

Statement-based replication
Write-Ahead log
Row-based replication

Statement-based replication

In this strategy leader logs every write statement(INSERT/UPDATE/DELETE) that it executes and sends the same to its follower and then the follower executes it on its node. This seems like a good strategy however it has many pitfalls.

For example, any non-deterministic function can result in different writes on follower and leader(Like RAND() or NOW() functions). Moreover, if a write statement(Like UPDATE) depends upon the previous write(INSERT) and both of them reach to the follower in out of order fashion(Can happen in asynchronous application), then it would have an unpredictable result on follower node. Furthermore, a write can be lost in-network and followers cannot request from the leader of any such lost write. This again results in data inconsistency.

Non-determinism can be tackled by sending a fix return values when it is sent to a follower by the leader, but it still difficult to cover all edge cases. VoltDB uses statement-based replication by making use of above philosophy along with making deterministic transactions to avoid out of order or lost write.

Write-Ahead Log(WAL)

Whenever a query comes to a system, even before executing that query, it is written in an append-only log file also known as Write-ahead log file. This log file can be used to replicate data to follower nodes.

PostgreSQL and Oracle employ this strategy of data replication. The issue with WAL is that log describes the data on a very low level and it contains information like which bytes were changed in which disc blocks of the leader. Thus it's quite tightly coupled with a storage engine. This makes a software version upgrade on the leader and follower separately (Rolling upgrade) nearly impossible. It forces downtime during such upgradations.

Row-based replication

A row-based replication involves a sequence of records describing writes to the level of row. For example, an insertion of row contains information about all new values of columns. A delete contains information to identify rows to be deleted and an update contains new values of all columns.

If within a transaction multiple rows are involved then all such rows are added to the log.MySQL’s binlog uses this strategy.

Since row-based replication doesn’t require information about bytes at a particular location of the disk, it doesn’t suffer from the same issues as with WAL. Moreover, because offsets of logs are used for transferring data from the leader to follower it doesn’t suffer from out of ordering issues as well.

Replication Lag issues with Single leader replication

Asynchronous replication is better than synchronous one due to the lower latency of writes and higher availability of the overall system. But when it comes to consistency of reads from a follower(for read scalability), it only furnishes eventual consistency (All writes will eventually reach all of its followers, but no one can know, when?)

If you run the same query on leader and follower(concurrent execution), you can get different results. The reason behind such discrepancy is the replication lag (Write from the master is yet to reach the slave). Usually, this lag is only in microseconds but can reach to minutes if the system is working at near capacity or if there is some problem in the network(Network congestion) between the leader and its followers. These inconsistencies due to replication lag can give bitter experiences to users of your application. Following are the issues that arise due to replication lag :

Inconsistenct behavior of reads after writes
The difference in results of the same query for multiple reads/Violation of casualty

Inconsistency behavior for reads after writes

Have you ever observed these anomalies?

Getting the same older view of your profile just after submitting your profile changes and refreshing your profile page on social media? And then another refresh reflects the newer changes in your profile.
Commenting something on a post which suddenly disappear after one-page refresh and suddenly appears in the next?
Getting SMS notification from your bank of successful fund transfer but your bank application showing the ancient balance? (Moment of happiness) But then after a page refresh, it shows the actual correct balance?

All these anomalies occur because of the replication lag in the asynchronous follower. Since read is served by followers and any new data might not have reached to the followers. One solution to handle such anomalies is to continue reading from the leader after writes for some duration(Like a minuter or so) and then move all read queries to followers(It adds complexity to an application). Social media websites can choose to serve all read requests of a user’s own profile from the leader and for different user’s profiles from followers, as only a user can edit his own profile. This approach fails for other writes and reads like posting comments. Moreover, it also fails for cross-device reads after writes(Like writing from the browser and then reading from a mobile). You can use some central server to keep information on the latest write but then it becomes a single point of failure.

The difference in results of the same query for multiple reads

Another aberration that we observe due to asynchronous replication is: going back in time. For example, let say we are following a cricket match on a cricket website and we keep refreshing page to have the latest update about the runs scored by the team. What if in one moment we observe that our team has won the match and on the next refresh, we observed match is still going on? This is bound to confuse the user and the user will feel going back in time.

The reason behind this incongruity is the reads from different followers. First, read was served by a follower with a lower lag and another by one with a higher lag. The follower with higher lag wouldn’t have received the latest update and the second refresh landed the read request on this follower. It would have been much better if the user would have observed that the match is still going on, rather than getting different results.

One way of handling this absurdity is to serve requests by one user from the same follower. Different users will be served by different follower node, but the same user’s request will land upon the same node (We can use user’s UUID for this purpose). However, the failure of this follower will require routing to a different replica.

Handling Replication Lag in applications

Asynchronous replication systems are eventually consistent systems. There is always a change of serving stale data to a user if there is a long replication lag. We can read for some duration just after a write, but this still doesn’t guarantee strong consistency (As it’s difficult to predict how far a replica node can lag) . Moreover, this approach is quite complex and you can always miss edge cases. Using hashing bases approach is a way to handle casualty violation of events.

Transactions are another way of dealing with such anomalies and are fairly easy to implement on single-node databases but it’s avoided in distributed systems due to the performance penalty and impact on availability.

In the Second Part (Part-2)of this series we will discuss multi-leader and in the Third Part (Part-3) leaderless replication.

Don’t forget to clap if you enjoyed reading this article!

References :

Designing Data intensive applications Martin Kleppmann

The Log: What every software engineer should know about real-time data's unifying abstraction

I joined LinkedIn about six years ago at a particularly interesting time. We were just beginning to run up against the…

engineering.linkedin.com

https://www.coursera.org/lecture/data-manipulation/eventual-consistency-yzid6

https://cs.nyu.edu/~mwalfish/papers/falcon-sosp11.pdf

What's Split Brain and how to avoid it like the plague? | StarWind Blog

In this article, we will discuss the split brain issue, and different approaches to prevent it from happening. The…

www.starwindsoftware.com

https://en.wikipedia.org/wiki/Network_congestion