MySQL diary #1: database replication lag or how to (not) invalidate the cache

In order to scale databases in large traffic applications we usually set up master and a few slave database nodes where writes are replicated and which are used by the app to read data.

Published in

Legacy Systems Diary

3 min readMay 25, 2018

While this approach solves scaling problems, it brings its own set of problems. One of them is The Infamous Replication Lag.

All writes (like INSERT, UPDATE and DELETE queries and schema ALTERs) go to a master node. Slaves are connected permanently to a master node and receive updates form it that are then applied to a copy (which happens with a slight delay compared to when master received an update). Such setup improves the performance of database writes (a master can now focus solely on them, instead of handling reads as well) and reads (more slaves can be added as the application gets more load). Any slave can be taken out of a pool for a maintenance without causing a problem to the application.

Enough theory. Let’s take a look at some real life examples…

Replication lag example

Imagine the typical scenario of a user performing a page edit:

user makes some changes and clicks “Publish” button
HTTP POST request is performed, database master node gets an INSERT query, caches are invalidate to allow new data to be read
the web server redirects you to the page that you’ve just edited. Slave database node is queried, but it did not manage to get the latest data from the master (the replication in MySQL is asynchronous). The code gets old data from the slave.
user asks himself — where are my changes?

Dealing with the replication lag

Reading from slave shortly after altering the data we’re interested in is a tricky business — we may get data from before the update. So, should we read from master then? No.

But what if we simply build an application in such a way that this tricky read query will simply not be needed. How?

When handling the aforementioned request, we can do the following. Instead of invalidating the cache we update it with the latest data. We can use the data we got in the request and / or simply query a master node and cache the results. Then when a user is redirected to the page with the updated content — instead of getting a miss from the cache (that would force us to fetch data from the database and risk hitting the replication lag) — we actually get the up-to-date entry from the warm cache.

Don’t invalidate the cache — update it on change

We used this approach to improve the behaviour of the code that was responsible for 21% of database queries that were made to a master node when handling GET requests (for instance when viewing a wiki article). Caching was added there too.

Master node queries rate from `the code` method mentioned above dropped from 52k to 4,8k a minute (by **91%**).

A similar fix was applied to a code that was handling talk page edits notifications. Before it the code was making 800 k SELECT queries an hour and they were always performed on a master. After the change only slave nodes are used — we now update the cache when a new talk page edit arrives. Cache miss is handled by a slave database node only. Number of queries dropped by 78%.

TL;DR

Moving read SQL queries to slave nodes (and adding caching around them) makes your application more resilient to master database problems. Your app can survive one slave node going down and still serve the traffic. Having a strong dependency on a master node (that you cannot scale by simply adding more masters) for your read-only requests is a risky game.

Let’s make master database nodes and your DBA life (hi, Drozdo :) ) a bit less stressful and your application more resilient to database issues. Remember that usually the problem is not the tool itself.