The Impact and Mitigation of Replication Lag in High-Availability Systems.

Tackling Replication Lag in High-Availability Systems

Published in

Plumbers Of Data Science

5 min readJul 5, 2023

Replication lag refers to the period of time that elapses between data being initially stored in the primary database and its subsequent replication to the standby databases. In simple words, the delay between a write happening on the leader and being reflected on a follower.

In high-availability systems that rely on standby databases for failover, replication lag can carry significant ramifications. When the replication lag reaches a critical level, it can lead to data loss during the failover process. The presence of significant lag poses more than just a theoretical concern; it becomes a practical issue with real problem for applications.

Replication lag may occur under various circumstances:

1. The primary database fails to transmit changes promptly to the replica.

2. The replica encounters delays in receiving the changes.

3. The replica experiences difficulties in applying the changes promptly.

While we cannot completely eradicate replica lag, it is crucial to minimize it to the greatest extent possible. To achieve this, we should investigate:

Potential configuration differences between the primary and replica instances.
Assess if the primary instance is burdened with excessive write workload.
Examine the duration of transactions.
Verify the correctness of parameter settings.
Review version change logs.

There are several consistency models that aid in determining the expected behavior of an application in the presence of replication lag:

1. Read-after-write consistency :

Users should consistently view the data they have submitted themselves. This guarantee ensures that when a user reloads the page, any updates they have submitted will consistently appear. However, it does not extend the promise to updates made by other users, which may not be visible until a later time. Nevertheless, it provides reassurance to the user that their own input has been accurately saved.

There are multiple approaches to implement the “read-your-own-write” feature, which are explained as follows:

a) Retrieve a User’s Editable Data from the Leader Consistently.

Typically, the owner of a user profile has exclusive editing rights over their profile information on a social network. As a result, a straightforward guideline could be established: consistently retrieve the user’s personal profile from the designated leader.

b) Read only from Leader for 1 minute following an update.

One possible approach is to monitor the timestamp of the most recent update and, within a one-minute timeframe following the update, direct all read operations to the leader.

c) While reading from a follower node, make sure the follower has all the updates at least till the timestamp of the last update made by that user.

If a replica is not sufficiently up-to-date, either the read can be handled by another replica or the query can wait until the replica has caught up.

2. Monotonic reads:

Once users have accessed data at a certain point in time, they should not encounter older versions of that data at a later stage , i.e., they will not read older data after having read a newer value.

For example : When accessing emails while being mobile, you have the capability to read them without making any modifications. Each time you establish a connection with a new email server, that server retrieves all the updates from the previously visited server.

To achieve monotonic reads, one approach is to ensure that each user consistently accesses data from the same replica. This replica can be selected based on a hash of the user’s ID. However, if the chosen replica becomes unavailable, the user’s queries will need to be redirected to another replica.

3. Consistent prefix reads:

Users should observe the data in a coherent order that follows causality, ensuring, for example:

Martha: Hi Michael , did you go visit Grandma yesterday?
Michael: Yes I did visit her yesterday.

The sequence in which someone receives the preceding conversation holds great significance. If a user receives the conversation in a different arrangement, as shown below, it would lack coherence and meaning:

Michael: Yes I did visit her yesterday.
Martha : Hi Michael, did you go visit Grandma yesterday?

To avoid such an anomaly, a specific assurance known as Consistent Prefix Reads is necessary. This issue specifically arises in partitioned (sharded) databases. In numerous distributed databases, individual partitions function autonomously, lacking a global order for write operations. Consequently, when a user retrieves data from the database, they may observe certain portions in an older state while others in a newer state.

One proposed solution involves ensuring that any writes with a causal relationship are consistently stored within the same partition.

In conclusion, replication lag remains a critical concern in high-availability systems. Its potential consequences, such as data inconsistency and loss, highlight the importance of addressing and minimizing this issue.

You might be interest in this series where I’m introducing several important concepts that new Data Engineers should be aware of. The other topics I talked so far:

Replication

Sharding and Partitioning

Partitioning Data

Optimizing data

Enhanced Query Performance

Indexing

Scalability

Slowly Changing Dimension

Distinctions between CTEs, SubQueries and TempTables

Thanks for the read. Do clap👏 and follow if you find it useful😊.