DISTRIBUTED DATABASE REPLICATION

Intro to Replication

Introduction to ideas behind Replication in Database Systems to improve Availability and Fault Tolerance

Sriram R
4 min readMar 3, 2023

In the previous article, we explained how partitioning can help improve the scalability of a system. In this article, we will discuss why replication is important in Distributed Systems, especially in databases.

Replication

Replication is a crucial technique used in distributed systems, particularly databases, to ensure high availability and fault tolerance.

The process involves redundantly storing the same data in multiple nodes (called replicas). This way, if one of the nodes crashes, the data isn’t lost and can still be served by other replicas.

Although replication provides significant benefits to a distributed system, it comes at a cost. Replicated systems are complex to manage and require careful handling.

Benefits of Replication

Availability and Fault Tolerance

System availability refers to its ability to continue working as expected despite faults. For example, if you have 10 database instances that are replicated, even if one crashes, you can switch over to other instances easily, ensuring high availability and fault tolerance.

A single point of failure is a major reason why a system may not be highly available. To eliminate such points in any data system, we can use replication.

Scalability

As we have seen, a single machine can only handle a limited amount of load. If your system is read-heavy and requires many concurrent reads, replication can help spread the load of reads across multiple machines. This makes it easier for your system to scale.

Drawbacks of Replication

Data loss

Assuming that we are building a system that follows the Crash-Stop/Fair Loss/Asynchronous System Models, we want to create a database that supports replication. Let’s start with the case of having three replicas, where it takes 10ms for each replica to communicate with each other when everything works correctly.

Suppose a piece of data is written to Replica A. If Replica A goes down, that piece of data is lost. To prevent this, we decide that as soon as a node receives some data, it sends that data to other replicas as well, which on average takes 10ms. This ensures that all replicas are in sync within 10ms, and no data will be lost if Replica A fails.

However, there is also a possibility that Replica A fails before sending the data to Replica B or Replica C. In this case, the data is lost forever.

Inconsistency

Consistency is a crucial property of most databases. This essentially means that when you ask for a piece of data after it’s written, we should always return the updated data, not old and stale data.

Suppose we write to Replica A, and the above scenario dictates that it takes 10ms on average (and maybe even more) to replicate the data to Replica B and Replica C since we’re assuming a Fair Loss Link. Before the data is replicated, if we get a read request to Replica B within 5ms of writing to Replica A, Replica B doesn’t have the updated value yet and will respond with the previous state of the data.

Hence, replication isn’t perfect. It does give us higher availability and fault tolerance at a cost. Understanding these costs and how they impact your system is crucial when you build distributed systems.

Types of Replication

There are two main strategies for implementing replication:

  1. Pessimistic Replication
  2. Optimistic Replication

Pessimistic Replication

Pessimistic replication attempts to guarantee that all replicas are identical to each other, as if there was only one copy of data all along. However, achieving this replication type is almost impossible without sacrificing both read and write performance.

Optimistic Replication

Optimistic Replication, also known as Lazy Replication, allows replicas to diverge or be out of sync from time to time. This guarantees that they will eventually get into sync if the system doesn’t receive any further updates for a period of time.

Replication Algorithms

There are two main types of algorithms widely used in replication:

  • Single Master Replication
  • Multi-Master Replication

We will discuss both of these in detail in upcoming articles.

--

--

Sriram R

A software engineer on the way to a jack-of-all-trades.