MySQL diary #5: Querying the database master and why you should consider asking a slave

Database performance optimisation is an endless journey.

Published in

Fandom Engineering

3 min readMar 19, 2019

Recently we worked on a ticket “Assess MediaWiki DB writes and migrate to background tasks where possible” and its sub-tasks to improve database master nodes queries rate. Our goal was to review code that makes explicit queries to database master when we’re handling GET requests (which, according to the HTTP specification, should be used for idempotent requests only). Why did we do it?

Slaves are way easier to scale.

We can survive one slave node being down — Consul will just take it out from the pool. However, that does not apply to a master node — there’s simply no fallback here so creating hard dependency on accessing master host should be carefully considered.

So, whenever you make an explicit query to a master just to fetch the data (i.e. perform a read operation) you should ask yourself a question: do I really need to do it? A possibility of The Infamous Slave Lag is not an answer here, as you can easily guard against it in your code. How? Bear with us

And, let’s repeat ourselves here, the best query is the one not made (or at least deferred to an offline task) — read more in:

MySQL diary #4: Defer your work

One of the rules that should be applied when developing a feature or fixing a bug (however, try to avoid it in your…

medium.com

Caching to the rescue

The typical solution for database performance issues is caching. We can use caching to guard ourselves against slave lag which can be encountered in the following situation:

POST request is performed (e.g. an edit is saved), database master node gets an INSERT query
Server redirects you to the page that you’ve just edited. Slave database node is queried, but it did not manage to get the latest data from the master (the replication in MySQL is asynchronous). You get old data.

How to protect the code without making an explicit query to the master in point 2? What if we add caching? Let’s see:

Instead of invalidating the cache, we update it with the latest data (we can use the data we got in the request or simply query a master node and cache the results).
Instead of getting a miss from the cache (that would force us to fetch data from the database), we actually get the up-to-date entry from the cache.

Results

Our blog posts simply cannot be published without a nice graph or two from Kibana, and this one is not an exception.

Below is the amount of all queries made explicitly to master node when handling GET requests — it dropped from 170k queries per minute to 54k (that’s 68%).

As for the first fix — master node queries rate from Wikia\Service\User\Permissions\PermissionsServiceImpl::loadLocalGroups method, dropped from 52k to 4,8k a minute (yes, that's 91%). And this does not include the second fix (that updates the cache instead of invalidating it and queries slaves only).

The number of User::checkNewtalk queries dropped significantly as well:

before the fix — 800k queries were made each hour (always to master)
after the fix — 180k queries (they all hit slaves now) — drop by 78%.

Let’s make master database nodes and DBA life a bit less stressful and our site more resilient to database issues. Remember that usually, the problem is not the tool itself.