Distributed Cache Management

Published in

Trendyol Tech

5 min readOct 3, 2023

Greetings! I’ll talk about how we use our approaches to build a distributed cache in the article.

Let’s start by talking about our applications in Trendyol. Our system consists of two applications, one of which serves up the data while the other consumes external data and feeds the database.

These apps, which are consumers, listen to data from a variety of domains, and we need to cache some data before feeding it into these fields. Every time a document can be reindexed or invalidated, we may make a lot of requests on other services, and there may be latency while getting data from the source domains. These operations occasionally reach speeds of 1–2 m rpm.

We have a data example similar to the one below.

{
  "category": {
    "id": 594,
    "name": "Cep Telefonu",
    "installment": 12,
    "parentId": 1181,
    "passive": false,
    "installment": 12,
    "version": 0
  },
  "content": {
    "id": 11896943,
    "product": {
      "id": 8615919
    },
    "name": "iPhone x"
  },
  "brand": {
    "id": 594,
    "name": "Apple",
    "passive": false,
    "version": 0
  }
}

If a product is created in Trendyol, an event is sent to Kafka. Then we consume this message and feed the category and brand information along with the content to the document.

As in the example below, the ‘Cep Telefonu’ category is kept on millions of documents, and to give another example, ‘Apple’ brand information can be kept on millions of products.

Our applications run on Kubernetes, and we usually scale our applications at load time by the number of partitions of the Kafka topic. I shared a representative picture below.

Now, going back to our topic, when a category is updated, we must update our caches on pods.

To summarize how we think. When we first designed our project, we thought of using Redis pub-sub. Our applications listen to the channel in Redis, and when we consume a new message, we send it to the channel. Our applications handle the message and these fetch the category from the source again.

Many problems may arise here.

First of all, when we tell Redis to invalidate the category, the synchronization between write db and read db can be left behind, and we can get the old category information during invalidation.
Secondly, This will cause data inconsistency when we encounter a problem such as a timeout, etc. while pulling the category.
Third, when we send a message to Redis to invalidate cache, we don’t know when the invalidation will happen due to the latency. For this reason, we can not update the category information on the document. Because we still may keep old category information.

So how did we solve the problems I mentioned?

Firstly, we register all our applications that will listen for cache updates to the database.

Within each application, a scheduler is running that will record itself every 5 seconds. Let’s say, when we add a new feature to the consumer and re-deploy, the pods we registered before will be deleted after 10 seconds and the new pods will register themselves.

Secondly, we save the incoming message to the database.

Third, we create a cache-invalidation-scheduler that will run on every pod

Category information in the cache is updated on each pod, and the updated information is recorded. Since this scheduler will work in all our applications, each pod will record itself.

Here we use partial mutation, which is a feature of Couchbase. This allows us to update the same document asynchronously.

We currently know the pods whose categories are updated, and we have previously registered all the pods using app-register-scheduler. We can check, that all pods are invalidated in their cache, and then we process category updated event.

Now we need to find and update the content documents of the category. We are adding a final category-updated scheduler that will allow these documents to be found.

We set this category-updated-scheduler to run in one pod. Also, this scheduler has many capabilities, for example;

Since the name of the category has changed, we may want the document changes to be reflected at night, because every change goes to other teams as an event and we do not want to load there during the day.
When the category is inactive, we may want to make changes instantly.
There may be more than one change in the same category, for example, the installment may change immediately after the name changes, and we can combine and process these changes at intervals.

After making sure that all caches are up to date, we send a message to invalidate-topic. Then we consume the invalidate-topic and find the contents. We then send a message to update the documents.