Replacing memcached proxy — a bumpy road from Twemproxy to Mcrouter

Published in

Fandom Engineering

10 min readOct 7, 2022

This is an example of how to replace a critical component with zero downtime, how and why projects grow in size, and where risks can come from.

At Fandom, we make heavy use of Memcached as our primary caching solution for backends. We’ve been using Twemproxy as a layer of abstraction between our backends and Memcached, as it provides us with consistent caching and other useful features. Twemproxy was serving us well but recently we’ve decided to replace it with Mcrouter. In this blog post, I’ll describe the project I took on — migrating our biggest service, the MediaWiki-based Community Platform, from Twemproxy to Mcrouter. Although executed successfully, it was not without its surprises.

The obvious approach is to set up Mcrouter and run some load testing. Then, based on the result, scale up the number of Mcrouter instances so they can handle all of the production traffic. Finally, switch MediaWiki to use the new proxy and we’re done.

Unfortunately, it turns out it’s not that simple. Although both proxies support consistent hashing, their algorithms aren’t compatible with each other. This means that Mcrouter and Twemproxy will pick different memcached instances to store the data. So, after switching to Mcrouter, the application would start with a nearly empty cache, as most read operations would result in cache misses. This wasn’t a viable option because losing this cache layer would most probably bring our site down. Not only do we keep database results there but also data that is very expensive to regenerate, like Parser cache which is a wikitext that is parsed and translated to HTML.

To avoid starting with an empty cache, we could warm up the Mcrouter before switching to it for reads:

For some time MediaWiki sends write requests to both memcached proxies while still reading from Twemproxy. This should populate the heavily used cache keys to the memcached instances that Mcrouter picks via its hashing implementation.
After the first stage is done, we can start reading from Mcrouter while still writing to both proxies. At this point, probably some keys are still missing so it may result in a performance hiccup. If it gets really bad, we can switch back to point 1. If it’s fine, let it run for a while and monitor our production system.
If everything looks fine, stop writing to Twemproxy. The migration is done.

There are some issues we have to keep in mind:

We cannot warm up the whole cache
We’re going to populate Mcrouter only with entries that get written to Twemproxy during the warming-up phase. The longer this phase is, the more writes we’re going to replicate. The data we store in memcached has different TTLs. Some items are stored only for a few minutes or hours and those are very likely to be refreshed. Parser cache is stored for up to two weeks so to regenerate it the warm-up would have to take several days.
Cache evictions
During the write operation each key would be written to both proxies and, most likely, would end up on two separate memcached instances. But because the read operations are not replicated, the value written via Mcrouter isn’t read. So the longer the warm-up phase lasts, the bigger the chances this cached value is going to be removed.
Lower cache hit ratio
Cache warmup makes sure keys are stored in two places. This is going to decrease our hit ratio because the total cache size is limited by the capacity of the memcached cluster. So the keys used less frequently have bigger chances of being removed.
Much more writes
Not only are we going to get more cache misses, but during each cache miss MediaWiki is going to make two write requests instead of just one. That may slow it down a little.

Divide and conquer

The performance-related issues mentioned above can have a hard-to-predict and possibly disastrous effect on our platform if rolled out globally. Luckily, each of our 280k communities can be configured independently so we can roll out changes to any number of wikis at once. This gives us more flexibility in terms of rolling the changes out with zero downtime.

This has two advantages that are worth emphasizing:

The migration is incremental
As we can split the work, we can limit the performance impact on the whole platform. We can also start with the smallest wikis and observe how the system behaves. Based on our observations, we can prepare better for the next batches.
It’s bi-directional
If things go out of control at any point, we can roll back the changes made to the last batch of wikis, address the issue and try again.

This plan looked reasonable. I needed to set up the Mcrouter cluster, prepare the code and configuration change and perform the migration of 280k wikis. I estimated that the work I knew about would take me around 3 weeks.

I won’t go into the details of the actual migration. Instead, I’ll focus on the unknowns that made the project last 2 months. And how, despite some bumps along the way, the project was successful and the migration did not cause any outages.

Learning points

Have both: the plan and the backup

An extra few hours spent during the planning phase of the project can save you days or weeks later on. Things look simple from a high angle and get complicated once you start looking at the details. Both proxies supported consistent hashing but, as described above, they weren’t compatible and that required large adjustments. There were also smaller surprises that I missed. Like the fact that the MediaWiki’s class used for writing to both proxies escaped the memcached keys differently than the class writing to a single proxy so I had to provide my own implementation. Or that one of our MediaWiki extensions was using an infrequently used API to talk to memcached and that caused a huge spike of requests during the migration. It’s always a good idea to double-check if your assumptions are correct.

It is also crucial to have a backup plan in case things do not go as expected. Here I had an easy way of rolling back my changes and I made sure this process worked at the beginning of the project. The harder it is for you to take a step back, the more time you should spend preparing.

Metrics let you understand your system better

You may spend countless hours analyzing the code but you’ll still have trouble predicting the system’s behavior under load. The requests come in bizarre patterns. There are a lot of direct and indirect dependencies on other services, hardware, network and cache. Other changes are being rolled out by your coworkers. Make sure you’ve got all the graphs ready. Also, observe them for a while upfront to get a better understanding of “what’s normal” before introducing your changes.

Are you getting more requests to your cache servers? Check the number of external requests to your application. Maybe there’s a traffic spike? Slice the cache request by their type — read, writes, misses. Check the database load. Look for graphs that changed and those that stayed the same. Gather as many clues as possible. The more blind spots you have, the harder it is to find the root cause of production issues.

Add new metrics and data points when needed

At the start of this project I set up a new tool — memkeys. It turned out to be invaluable as it allowed me to monitor requests received by memcached proxies in real-time. Then I could group them by a wiki or a key class to see whether the top offenders in terms of load came from a given community or some part of our MediaWiki stack.

One thing to keep in mind. We’re inclined to look at metrics and add new data sources while rolling out new features or debugging issues. Be careful. You’ll be missing a baseline for those graphs. So take your initial conclusions with a grain of salt. Memkeys exposed serious issues with the system but also gave me some leads that were dead ends. The metric you’ve introduced may be off and maybe something is broken there. But it doesn’t mean that’s the issue that caused the current outage. You may be chasing some imperfections that have been there for a long time and not getting close to the root cause of the current issue.

If you ignore small issues, they’re going to grow and bite you back

A few months before this migration we upgraded our MediaWiki to version 1.37. After this upgrade the backend load was a little bit higher than usual but nothing extraordinary. Nothing obvious stood out, so we just assumed everything was ok. Maybe the new version was a little heavier on our backends. It happens.

At some point in the migration I was dealing with a huge number of memcached requests related to MessageBlobStore — a JSON object containing translated user interface messages. The number of cache misses was very high and it was adding a lot of extra load. But why would we be missing so many translations?

Eventually, it turned out that since MediaWiki version 1.37 each time the update.php script was launched, it was purging the MessageBlobStore cache for all our communities. This script was executed every time a new wiki was created. So since our last MediaWiki upgrade, MessageBlobStore was purged several times per hour for all our wikis. That’s what was causing a slightly higher load.

Things went sideways when at some point another team started running update.php as quickly as possible to roll out their changes to all wikis. Luckily this happened while I was migrating wikis and observing the memcached usage. We were able to identify the root cause and pause the update script before it resulted in an outage.

If you haven’t fixed it, it’s probably still broken

Somewhere around the middle of the migration, our memcached backends were receiving over 800k requests per second instead of the usual number of ~250k. That was suspicious. I expected a higher load due to missing cache entries but not to that extent.

As it was getting close to the weekend, I considered rolling back my changes. But eventually, I thought I just overdid the migration pace so I slowed it down and soon the number of requests started going down. A little too slow, which was strange as I expected the cache to regenerate much quicker. But I was glad things were starting to look normal again.

Well, 2 hours later all hell broke loose and this time even stopping my migration did not help. Mcrouter cluster was getting a whooping 1.5M requests per second. I’ve added many new production instances to handle the load and started investigating, as it wasn’t going to resolve itself.

I’ve seen this pattern more than once. If a problem goes away by itself before you solve it, you may be lucky, but you often expect it to come back. But if you change something and the problem goes away, you assume you’ve fixed it. That can be a dangerous assumption. First, you want to understand what causes the problem and how to reproduce it. Then, you want to test your changes so you trigger the issue to verify if your solution helped. And then, ideally, you may even want to revert your changes for a while to confirm that the problem comes back if your fix is not there.

You have more technical debt than you think

Going back to 1.5M requests per second. Both the performance graphs and the memkeys tool were indicating something was wrong with our category pages. That was strange as we hadn’t touched this code in the recent past.

It took us some time to figure out that the whole thing was broken. At some point, we changed the way the information about images on category pages was cached. It was suboptimal, to say the least. And there were a few extra edge cases that were making it even worse. But it was never bad enough for us to notice something was off. Until the Mcrouter migration, when the mix of increased cache misses, writing to both caches at the same time and Mcrouter behaving a little differently than Twemproxy caused the whole thing to collapse.

We ended up redesigning the caching API and using it in a few extra places that were also affected. After pushing it to production, the number of memcached requests dropped and the whole project was done:

Total number of Mcrouter requests stabilized after the releasing category pages fixes

On top of it, our backend response time improved by around 10% compared to times measured before the migration. It is often the case that small imperfections go unnoticed. Maybe their impact is negligible. Or we make the mistake of changing several things at once and then are unable to evaluate their individual impact. Anyway, those things often accumulate over time. Until they either reach a critical point or some external factor triggers them and you end up rewriting the whole thing.

Leave a large margin for the unknowns but explore them early

Unless you’re working on something very simple or you’ve done it before, things will pop out during the project. There is a fundamental difference between knowing something is simple and not seeing any obvious difficulties at the first glance.

Initially, this project was about setting up Mcrouter instances and pointing the application there. During the initial research, it turned out consistent hashing wasn’t compatible and that largely increased the size of the project.

I also suspected there would be hidden complexities and unidentified risks. Some parts of the code I was about to touch were old and likely due a rewrite. The existing set of metrics around caching was insufficient, mainly because caching was simply working for years and we weren’t investing much in this area.

Don’t beat yourself up just because the initial estimate is off — it usually is. Initially, leave a large margin for the unknowns but also try to explore risky areas at the start so you don’t get surprised during later phases of your work. Here, I estimated the known work for 2–3 weeks but I also assumed the whole project would take between 6 and 12 weeks in total because of unexpected difficulties. Eventually, it took me 8 weeks.

If possible — avoid going all-in without being confident it’s safe. There are several tools such as canary releases, A/B testing or blue-green deployments. In case of this project I split the migration so the smallest unit was a single community. This was great for evaluating the effects of the proxy replacement without putting the whole production system at risk.

Create a roadmap. Split your project into smaller pieces and evaluate where you’re at after each phase. You cannot avoid the surprises, so prepare yourself to handle them one by one.

Originally published at https://dev.fandom.com.