Data Replication in Distributed System

InterviewReady
4 min readJul 26, 2022

By — Avash Mitra

What is data replication?

Data replication is the process of making multiple copies of a data item to ensure availability. Copied data is usually stored in different database instances, so even if one instance fails, we can get the data from other instances.

One popular architecture to implement data replication is Master-Slave architecture.

Master Slave Architecture

To understand this architecture let’s taken an example.

Suppose,

  • We have four clients each connected to a load balancer.
  • The load balancer then distributes the requests to three application servers.
  • Each server is connected to one database instance.

Do we notice any issues here?

YES. Our database is a single point of failure. If it crashes our entire system stops working.

To avoid this single point of failure, we can use another database (preferably a different database instance) that stores a copy of the original data. Now if the original database crashes, we can route the requests to the secondary database.

But how do we keep the secondary database in sync with the main database? Well, there are two approaches

Replicating data synchronously

  • In this method, data is written to the primary storage and the replica simultaneously
  • Data is always consistent. i.e., If the data is written to primary storage, it will be written to a replica
  • Load on the database is high

Replicating data asynchronously

  • In this method, first, the data is written to the primary storage and the updates are written to replicas in regular intervals
  • Since replication occurs in fixed intervals, there is a chance of data loss and inconsistency
  • Load on the database is comparatively lower

The database that gets the write requests is the master. And the replicas are known as slaves.

The master maintains a transaction log. To update the data in slaves it sends the command and the slaves execute these in the same order.

What happens if a server sends a written request to a slave?

There are two ways to handle this situation

  • Not allowing write requests to slaves A slave cannot write to the database, it just stores copies of data.
  • Allowing slaves to write data. We will allow slaves to write data. The slave then propagates the change to the master. In this case, the slave has taken over the role of the master. So it is no longer a master-slave architecture but a master-master architecture

Problem with Master-Master architecture

Communication failure can cause data inconsistency in a master-master architecture.

Let’s understand this using an example,

Suppose we have two database instances, A and B.

  • Both are masters.
  • The router between them has failed. So A believes B is offline and B believes A is offline.
  • They have a data item X whose value, initially is 100.

Now user sends the following request,

  • Deduct 20 from X. This request is routed to A. Now the value of X in A is 80.
  • Deduct 80 from X. This request is routed to B (Since both are master, write requests can be routed to any database). Now the value of X in B is 20.

Since there is a communication failure, A and B cannot be in sync, they have different data values and are therefore inconsistent.

  • Now if the user makes a read request, he/she will get different values depending upon the database he/she will connect to.

This problem is known as Split-Brain Problem.

Solving the Split-Brain Problem

We can solve the Split-Brain Problem by adding a 3rd node (database instance).

Here we are assuming that the chance of a node crashing and the router between the other two nodes crashing is highly unlikely.

Let us consider three database instances A, B, and C.

  • If C crashes A and B are masters and they are in sync. So they are in a consistent state. When C is online, they can read from A or B.
  • If there is a communication failure between A and B
  • When A gets a written request, it propagates its state to C. Initially the state is S0, then it moves to Sx. So now both A and C have Sx.
  • When B gets a written request it moves its state from S0 to Sy. It tries to propagate its state to C, but it fails because the previous state of B is not equal to C. Now B aborts the write request and updates its state to Sx. Now B can accept write requests and propagate changes to C.

This is known as Distributed Consensus. Multiple nodes agree on a particular value. In this case, A, B, and C are agreeing at the final state.

That’s it for now!

You can check out more designs on our video course at InterviewReady.

--

--

InterviewReady

Simplifying interview preparation for software engineers