From Memcache to Cloud MemoryStore

Mason Leung
Tech @ Quizlet
Published in
5 min readNov 12, 2020

Quizlet manages memcache on virtual machines (VMs) in its infrastructure to provide responsive user services. It is an ephemeral storage for web applications to store processed data after database retrievals. We refer to memcache operations within our infrastructure as “Poking a bear with a stick”. This blog post recounts our #NotSoFairyTale memcache migration to Cloud MemoryStore (CMS), a Google managed cache service.

Once upon a time, I posted the question “What was the issue when a memcache VM was restarted?” to the SRE slack channel not long after I joined Quizlet. The responses boiled down to two main challenges.

  1. Thundering herd to databases, a mad rush from web applications to get data from databases because they are not available in cache.
  2. Increase in web latency due to database load increase.

Memcache is a critical part of the Quizlet infrastructure. Put simply, the common understanding was “If memcache goes down, Quizlet goes down with it”

Quizlet encounters the highest traffic from August to October when students return to school. We coined that period the Back to School (BTS) season. As I worked on sizing memcache for BTS 2019, I thought to myself, “It couldn’t be that bad, right?”

Turns out it was really that bad! It took four hours to add eight memcache VMs to our infrastructure. As I rotated the VMs into production, I saw a slew of 5xx errors reported from our web servers.

When the operations completed, we were running 24 virtual machines, each with 45GB of memory allocated for caching and a 8000 connection limit.

A few days later, SREs noticed memcache connections crept close to the limit as BTS 2019 approached. Luckily, we identified the idle_timeout option to disconnect any idle connections. This meant another round of service restarts.

Another wrench was thrown our way as we noticed web applications randomly produced 5xx errors. The team spent hours to mitigate the problem and eventually identified the 5xx errors came from Out-of-memory (OOM) killer terminating web application processes. Furthermore, we discovered a bug in web services trying to sort a billion keys returned from memcache when we turned on a TV in the office. The full story about that is detailed in the “How turning on a TV turned off Quizlet” blog post.

The lessons I learned from sizing memcache were

  1. Memcache operations shock nodes in the memcache ring.
  2. Only touch memcache when you absolutely needed!

The rest of BTS 2019 was quiet.

Fast forward to BTS 2020 preparation and we pondered the same question “How should we scale our memcache service?”. This time with a touch of traffic uncertainty due to COVID-19.

In May of 2020, we ran across a blog post in scaling infrastructure and it mentioned Memcache for CMS, a google managed service. This beta service came out quietly in April 2020. The SRE team was excited to learn about this offering because “Move to managed services” has been an important theme for Quizlet. Moving to the google managed service would give us

  1. An updated version of the memcached binary.
  2. Higher scalability and lower maintenance costs.
  3. Immediate cost reduction,
  4. Lower SRE anxiety.

The SRE team quickly drew up a load testing and migration plan. A memcache ring is a logical ring. Ring nodes are not restricted by the underlying hardware as long as they understand the memcache protocol. This is an important realization because it allows us to run memcache cluster with both VMs and CMS nodes. We’ve been using ketama consistent hashing. (Spoiler: This turns out to be crucial later).

We ran a series of get and set latency tests against memcache on VMs and CMS and even put one CMS node into production to collect more data. Although memcache on VMs performed slightly faster, the difference was not significant enough for our use cases. With that knowledge, the SRE team braced to poke the bear again.

And the bear decided to poke back.

Quizlet was in the middle of moving to Kubernetes when we began the memcache migration. The 5xx errors we saw on VMs became fatal in Kubernetes and pods were killed by the liveness probe.

The team eventually identified a bug in libmemcached for ketama hashing reported in 2016 and patched it. We made three attempts to migrate to CMS in August and finally cut over the last memcache VM in September

We noticed the memory usage between memcache VMs and CMS were different. In CMS, the memory usage was around 12%

On the other hand, a memcache VM was running at 90% memory capacity because it combined the system and memcache memory usage together. This was troublesome because if OOM killer terminated the memcached process, it would be disruptive to Quizlet services.

Another observation was each CMS node cached about 25 millions items

On the other hand, memcached VM had 140 millions items

Even with more items cached in VM and each CMS node held significantly fewer items, the item hit rate for both were around 65%

We have been running CMS since August 2020 and it has been stable. It was a long journey from managing our own memcache services to use caching as a service. Memcache operational works are still known as “Poking a bear with a stick”.

Photo by Andre Tan on Unsplash

With the work we put in, we hope the bear looks more like this the next time we decide to poke it.

--

--