The Journey of Redis at UrbanClap

Aditya Chowdhry
Aug 8, 2018 · 7 min read

By Aditya Chowdhry

TL;DR: This post aims at describing the architectural changes we did in the implementation of Redis for caching at infrastructure and code level over a period of time due to scale and use-cases.

First phase

When we first started we had only one instance of Redis on a server having 7GB of RAM. Only a few of our API’s at that time used Redis for caching. On one of our marketing heavy day, we got a notification that our Redis server crashed.

Other teams reaction

During debugging we realised that the Redis memory usage went up to 70%-80% and crashed on a BGSAVE operation. We were using BGSAVE operation to persist data in RDB files.

BGSAVE operation forks a new process which requires approximately the same amount of memory that is currently utilised by Redis instance.

As the memory consumption of our server was around 80%, it didn’t have much memory to execute the BGSAVE operation and thus it crashed. One solution we tried was tuning vm config.

vm.overcommit_memory = 1

Setting over commit memory to 1, makes kernel pretend that there is always some memory present to execute the action like malloc. Here the underlying concept is of virtual memory. Programs see a virtual address space that may, or may not, be mapped to actual physical memory.

But here our usage was still hitting the total RAM available, so we needed to upgrade our RAM and find a solution keeping in mind the above problem.

For temporary fix, we increased the RAM from 7GB to 15GB

Second Phase

At this phase, our scale started to grow rapidly which inevitably increased the cache usage. Caching was now required to be reliable as few minutes of caching outage could reduce the speed of our API’s which in turn could impact app performance, user experience and critical services like SEO.

Few key changes we incorporated in this phase:

a. Upgrading server RAM

Scaled up our Redis server from 15GB to 30GB of RAM.

b. 32-bit Redis Instance

A 32-bit compiled Redis instance uses very less memory per key because the address pointers are small. One caveat is that it is limited to 4GB of memory as it is 32-bit. You can read some more memory optimisations here redis-optimisation.

c. Multiple Redis Instances

Rather than running only one instance, we now installed multiple Redis instances (6 instances with maxmemory of 4GB) capped with maxmemory option in Redis conf. This helped us in

  1. Solving BGSAVE problem — Each redis instance was capped to 4GB of memory, so at any point of time our max usage can go upto 24GB (6 instances * 4GB).
    — We now used to trigger BGSAVE for one instance at a time. So at any point of time, we had around at least 6GB of memory available for BGSAVE to run. ( 30GB RAM-6 * 4GB of Redis instance usage)
    — Here we kept 2 GB as buffer to allow other OS related operations. (As BGSAVE per Redis instance needed 4GB of extra RAM)

d. Twemproxy

Now, as we had multiple Redis instances we needed a way to distribute data among them. After some research, we decided to use Twemproxy.

Twemproxy is a twitter open source project, which helps in sharding data across multiple Redis nodes through consistent hashing.

Cache Infrastructure using Twemproxy

Third Phase

Third Phase development was triggered due to two factors: Phase 2 issues and the advent of our micro-services architecture.

Phase 2 Issues & solutions

a. Out Of Memory issue

As the cache usage grew we started to get OOM issues in multiple Redis instances periodically.

After debugging, we realised whenever the usage of any Redis instance reached around 3GB, it crashes. Further debugging brought us to the conclusion that:

Recommended usable memory of 32-bit Redis instance is 3GB but it starts to give OOM issue even around 2.5GB of usage. This is due to memory utilised by other things like Redis own internal buffers. <stackoverflow>

Solution: To avoid OOM issue we replaced 32-bit Redis with latest 64-bit Redis. Also, 64-bit Redis can take advantage of OS level settings like vm.overcommit_memory, swap space etc.while 32-bit instance could not.

b. Twemproxy — Loss of data & outdated

We use consistent hashing algorithm in Twemproxy to shard data among multiple servers

In consisting hashing, addition or deletion of a node leads to loss of k/N keys where k is number of keys & N is total number of nodes.

Twemproxy doesn’t provide solutions to reshard data on addition or removal of Redis node.

Also, Twemproxy was not in active development and its last commit was more than an year ago.

Solution: Around this time stable version of Redis Cluster was available. We then shifted to Redis Cluster as it helped us remove the third party dependency i.e Twemproxy. Also, it provides a script redis-trib through which we can automate the re-sharding process.

Micro-services Architecture

As we kickstarted our micro-service journey we needed to keep few things in mind:

  1. Code redundancy : The connection logic and basic functionality of cache were needed to be abstracted out as multiple services will start using it.
Teams dumping all data in cache be like

The Final Architecture

Server-Side Architecture

On the server side, we launched two Redis Clusters HA(High Availability) and LA(Low Availability).

Our cache infrastructure

a. HA cluster

We utilised master-slave concept provided by Redis Cluster to have high availability. Currently, we have 7 masters in this cluster and each master has one slave.

Performance critical services utilised HA cluster.

b. LA cluster

This cluster consists of only 4 master Redis instances. The services utilising this cluster are those which won’t get affected if there is an outage. Though there hasn’t been a single instance of LA cluster going down.

Client-Side Architecture

On client side as our micro-services were to be made in NodeJs, we decided to make an internal library in NodeJs named Flash . This solved our majority of the use- cases/issues of micro-services architecture.

Flash

  1. Flash — A singleton pattern based library which internally maintains a single connection to Redis Cluster. This helped us maintain the number of connections to Redis Cluster.

Tracking and Metrics

We have three sources through which we track metrics

  • For server level metrics we use Nagios
Metrics of one of the Redis instance
  • Micro-service level metrics like cache hit/miss, number of operations, performance are derived through our ELK stack.
Service-Level cache usage consisting of all type of Redis operations. This graph helped us solve a bug in media service.

This pretty much sums up our journey. The current architecture is working quite well for us for some time now and we hope this will be helpful for anyone who has started facing scale issues with their caching layer.

Thanks Mohit Agrawal for guiding me throughout the journey.

We are hiring!!! , drop your resumes at mohitagrawal@urbanclap.com.

Thanks,

Aditya Chowdhry

UrbanClap Engineering

Team, Technology & Data Science behind UrbanClap

Aditya Chowdhry

Written by

www.adityachowdhry.com

UrbanClap Engineering

Team, Technology & Data Science behind UrbanClap

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade