Intelligent caching with Apollo GraphQL

Roshan Alexander
Team Pratilipi
Published in
10 min readNov 12, 2020

In this series of blogs, we will be sharing our learnings, while on the journey towards Apollo GraphQL as our gateway and BFF layer.

Apollo GraphQL has been a boon to us, acting as a serving layer to multiple clients (Web, Android and IOS), by combining the data from multiple downstream services with a seamless integration.

In this post, we will be sharing the ways in which we have leveraged the caching layers in the Apollo GraphQL server to create a gateway, that could respond in milliseconds, to even the complex of queries that fetches thousands of data fields.

About Pratilipi :

Pratilipi is India’s largest digital platform connecting readers and writers in 12 Indian languages.

Having around 2.5M+ stories published by 250K+ authors, and with 20M monthly active users, our services serve around 1.5 Billion requests in a day. Due to this, scalability is an integral part to any architectural solution we build here at Pratilipi

To better understand the caching implementations, here is a brief intro on where GraphQL Gateway fits into the Pratilipi ecosystem.

How a request flows through the system

The original requests from clients are directed towards our CDN+WAF layer. If the resources are CDN cached (images, js files etc…), they are directly served from here. The remaining requests from the clients are filtered through the WAF layer, and also assigned a bot score for identifying malicious patterns. Once identified as a genuine client, the requests will be forwarded to our external load balancers (ELB).

All the requests identified by the External LB as GraphQL requests, are forwarded to the Apollo GraphQL gateway. The gateway then proceeds to resolve the query, by fetching data from multiple down stream services and combining them back to form the query response. The gateway also coordinates with the auth service, for the authentication of the requests and for the authorisation of the operations performed. The calls to the down stream services are done via an internal LB

Before deep diving into the architecture, in case you are not familiar with resolvers, N+1 problem of GraphQL and dataloader, feel free to checkout this blog where it is explained in detail. This would be helpful in connecting the dots with the rest of this post, since these concepts played a key role in shaping up our caching strategy.

N + 1 Problem : A Blessing in disguise

When we look closely at the N + 1 problem, we can realise that the issue arises due to the way in which resolvers handles each resource. Every resource is a stand alone entity fetched and consumed from a datasource individually.

In the first glance, this might look problematic, as the number of requests to the downstream services increases exponentially. But when this is combined with some more stats information related to the size of the data, it can be seen as a blessing in disguise.

Let’s take the same example of the Books and Authors, mentioned in the N+1 problem. For a single query to fetch the top 100 books and their corresponding author names, we first make a single call to retrieve 100 records of books from the book service, and then for each book in the list, we make another call to the author service to fetch the author details for the given the author ID.

In total, 101 requests are being made to the internal services to resolve a single query.

In case of Pratilipi, we have around 2M daily active users. Assume that we get a different set of 100 books each time we execute this query. If we suppose each of these users end up calling this query once, the total combined request count for this query will be 2M times in a day.

Considering 1 call to Book service and 100 calls to Author service for a single request, the stats for number of requests to the downstream services will be as below:

Total downstream : 2M * 101 = 202 Million
Book service : 2M (Top 100 for each query, 1*2M)
Author service : 200M (Authors for 2M*100 books)
Author service (With Dataloader): 2M (Batched Authors for 2M sets of books)

Caching with Dataloaders

If we go by the data loader approach, the number of requests to the Author service will be 2M. In comparison to the 200M, this seems a really good bargain. Unfortunately, as explored in this N+1 post, having a usable caching layer is a distant possibility while using dataloader.

Why caching as separate entities makes sense

Now let’s take the scale of Pratilipi in picture. We have around 250K unique authors writing for us.

Going back to our calculations, there was around 200M calls going to the Author service, for fetching the author details, while having only 250K unique authors in the system.

So what are this additional 199.75 M requests doing ?

Yeah, you guessed it right. They are duplicate requests, asking for the same author details that was previously fetched.

This means 99.997% of the requests to the Author service are duplicate requests

And if we take the 80–20 rule into picture (80% of the requests are accessing 20% of your data), then the duplicates percentage increases even more.

And here lies the magic of caching as separate entities. By just caching 250K author entities, we will be able to serve 200M requests for Author from the cache layer itself.

And the notable part is, even if the request count increases due to our growing user base, this caching layer would not still be overwhelmed, as the growth of authors in the platform will still be magnitudes less than the growth of users in absolute numbers.

Let’s wrap up the boring number crunching, and see how the caching layer is actually implemented.

Implementing the cache layer

Whenever the GraphQL server needs to access an author’s details, it first checks with the cache, if an author with that ID is present in it. If yes, it is read from the cache itself, without going to the downstream Author service.

If there is a cache miss, the GraphQL server hits the Author service, and gets that particular author’s details. And while serving the response to the users, it adds it to the cache too.

Now, this seems like a pretty straight forward solution.

But one of the most challenging parts in any caching solution is knowing when and how to remove the items from the cache.

Ideally, all the items added to a cache should have an expiry or Time to Live, known as TTL. The longer the time to expire for items, the higher the probability that more items will be served from the cache. But this comes with its own set of challenges

Data staleness

Data staleness, is one of the major challenges when he have a longer expiry time. Suppose we are storing the author details in the cache for 24 hours, and if the author changes his first name in between, the cache might not reflect the updated data. Hence all the request to fetch that particular author will still return older details till the item is expired. The longer the TTL, the longer the time it takes to expire, and hence the staleness becomes more prominent.

This can be solved to some extend by removing the items from the cache when the author requests for updating his name through a GraphQL mutation.

But this might not always be the case. The author details could still be updated bypassing the GraphQL mutations, directly through a CMS, or as a result of using the older REST API gateway. In these cases, the GraphQL server will not be aware of the changes in the author details, and hence wont be able to clear the cache for that author.

GraphQL Turbo

GraphQL turbo was envisioned to solve the dilemma of staleness vs expiry.

If there is a way for the GraphQL server to know whenever the data of any of its cached items changes, then theoretically we can store items indefinitely in the cache. Here the staleness will be minimal too, since as soon as the data changes at its source, we can remove the same from cache, and the subsequent first request for this resource will re-fetch the updated details from the source downstream service.

And that is exactly what we did using Debezium CDC.

Change Data Capture (CDC)

Change data capture refers to the process or technology for identifying and capturing changes made to a database.

We added Debezium CDC connectors to the datasources owned by the downstream services. Debezium connectors constantly watches the state changes in the data residing in the Databases, and propagates them through Kafka as a CDC event.

A CDC event usually consists of the ID, the old data and the new updated data for each entity updates. For the sake of simplicity, let’s see a simpler version of a CDC event for the updates in services.

In the above scenario, Debezium CDC connectors were attached to the Book DB and Author DB. So, whenever an update occurs in these databases, an event is generated and pushed to a corresponding topic in Kafka. Here, the Book service was updating the value for B001 with new title `My New Title`. A corresponding event was generated in the `Book CDC` topic of Kafka. Similar is the case with updating the author name for 001 to `Roshan Alex`.

In all of these cases, the ID of the changed entity is also captured within the CDC event.

GraphQL Turbo : Cache Invalidation

Now that we have a basic understanding of CDC, let’s continue with how the cache invalidation happens in GraphQL Turbo.

Assume that the entries for authors with Ids 001 and 002 are already cached in the GraphQL Redis. And so are the books B001 and B002. Whenever there is a change in the entities in Book or Author, they gets passed on from the DB as CDC events to Kafka, by the Debezium connectors. GraphQL turbo consumes all these change events, and intelligently clears the entries for the corresponding entities in the cache.

In the above depiction, there were 2 events captured by CDC. One for author name change for 001, through the Author CDC topic. GraphQL turbo gets the ID for the updated author, and intelligently removes the corresponding cached item for that entry in the cache. Similarly, it removes the cache item entry for Book B001, which was received from the Book CDC topic.

This way, GraphQL turbo makes sure that there are no stale items in the cache, since as soon as they are changed at their data source, they are removed from the cache too.

As this solves the staleness vs expiry issue, we are free to cache the items for a longer time. There are resources in Pratilipi, that are being caching with expiries ranging from 10 minutes to 24 hours.

Solving the millisecond staleness : Turbo Lag

Even with the GraphQL Turbo working tirelessly to make the cache less stale, there would still be milli-second latencies from the time at which the data was updated in the DB, and till the time by which it gets removed from the cache.

This wont create problems in normal use cases. Imagine a scenario where an author is updating his name, or changing a title of one of his books. He is not concerned whether his updates is seen by other users after some seconds or minutes.

But when the author himself is re-fetching his profile or his books, he expects to be seeing the updated data immediately. In this case, there is high chance that he might get the stale data from cache, until the Turbo catches up and evicts the item from cache.

For solving this, we have added a rule in the data fetchers for all the resources, that if the calling user is the owner of the resource, to always bypass the cache. This way, even if the Turbo lags in catching up the updates and evicting cache entries, the author will still be seeing his updated profile or books.

Summing it all up

In a nutshell, below are the major components that works in tandem to make the intelligent caching possible :

  • The query reaches the GraphQL server from client
  • GraphQL resolvers will figure out the down stream services to hit, for fetching the resources needed for the query.
  • All the resources needed for the query, if present in the server cache are read from there.
  • The remaining resources are fetched from the down stream services and added to cache before precessing.
  • When a downstream resource changes it state in DB, a CDC event is generated and pushed to Kafka
  • GraphQL Turbo consumes the CDC event from Kafka, and removes the mutated entities from the GraphQL cache
  • To avoid the millisecond latencies (caused by Turbo lag) for the owner of the resource, cache is always bypassed for the resource owner.

We will be covering more on our GraphQL journey and the CDC concepts mentioned here, in upcoming blogs.

Signing off for now…

About Author :
Roshan Alexander, Engineering Manager @Pratilipi

Passionate about anything that is related to usage of technology for making the world a better place to live in
A hardcore programmer, having several years of experience in architecting highly scalable and robust distributed systems for product startup companies.

You can reach out to me in Linked-In

--

--

Roshan Alexander
Team Pratilipi

Passionate about everything that is related to usage of technology for making the world a better place to live in