The Call Unification Cache: Scaling Airtel Thanks App’s Service Nugget with Smart Duplicate Call Mitigation.
Introduction
At Airtel, we have a customer base of over 350 million. A significant number of our customers use the Airtel Thanks App to manage their services. This app has 15 million daily active users (DAU) and is supported by a set of distributed microservices that act as middleware to the underlying databases and CRMs.
We have recently made significant improvements to our service nugget, which serves as the gateway to all of our services. This dynamic widget now adapts and loads based on users’ interactions with Airtel services, ensuring a seamless and personalised experience. This enhancement holds immense importance for us, as it is expected to substantially boost customer engagement with the nugget, resulting in a significant increase in overall interaction.
To achieve this, we must get all the information distributed across microservices for a user and we faced a common problem with the distributed architectures that posed a bottleneck to scaling the current system. In this article, we will delve into this issue in detail.
Problem Statement:
In today’s world, it’s no longer about having one big system that does everything. We’ve moved on to something better — smaller microservices that offer so many benefits! They’re cost-effective, scalable, and minimize downtime. But with these benefits come challenges, and one major one is the loop problem within microservices. It’s like a never-ending cycle that can make things tough. Imagine making a single request, but it ends up calling the same service from multiple services at the same time! This can cause so much redundant traffic and put a lot of loads on the downstream. It can be frustrating, but thankfully there’s a solution — creating a cache layer.
First call would be at downstream then it will update that in cache and later calls will be served from Cache memory till the TTL (Time to Live). Conventional and very effective across.
What If you have parallel call and that too not in hundreds or lakhs but in crores, there is no time to update cache.?
- All landing on downstream- unnecessary increasing TPS at Downstream.
- As all calls are served from downstream so obvious will have high latency to channels.
Service 1 can have the same or different main logic, but the important thing is to save the call between Service-3 and CRM.
Solution:
Now that we’ve understood the problem statement and figured out that adding a cache layer won’t do the trick due to multiple parallel hits, we need to explore other options.
You might be thinking that adding a distributed lock to handle multiple parallel requests could be a solution. But that would make our requests synchronous, and we deal with a huge amount of traffic, so this would affect the latency of our API.
So, what we need is a distributed lock that can work in parallel without impacting the API’s latency.
Let’s look at the overall solution while keeping all these challenges in mind using the diagram below.
Flow Diagram:
In simple terms, when we receive the R1 request, we first create a cache key with the value ‘WAITING’, and then send a request downstream for a response.
Any other parallel request (R2) will check if the cache key exists or not. It will find the key as ‘WAITING’, so it will continue polling the cache until the maximum polling time to fetch the response.
If the Key is not created yet, then R2 will content with R1 request to acquire lock. Only one of them will acquire lock and other will start waiting and polling the cache until the maximum polling time to fetch the response.
If R1 is successful, it will update the cache value from ‘WAITING’ to the actual response. When R2 polls again, it will return the actual response in almost the same response time as the R1 request, maintaining the API’s latency.
If R2 fails, the key will be removed from the cache, and R2 will also return failure as it is for the same number as R1.
To implement the solution discussed above, we needed two major strategies:
- A time-efficient and scalable distributed lock strategy. To fulfil this need, we used Aerospike Cache for its low latency, high throughput, and scalable architecture.
- A well-thought-out and scalable poll and wait strategy. For this part, we were careful not to increase the API’s latency or overutilize the servers to scale at the maximum level with the current infrastructure. We configured the polling interval time and max polling time with utmost importance. Here’s how we calculated them:
Polling interval time: We found the average difference between the 50th, 75th, and 99th percentile of the response time from downstream.
Max Polling Time: number of calls (including retries) * timeout + polling time + delta (100)
At Airtel, we strive to provide generic and easy-to-use solutions that can be implemented anywhere with the same problem statement. In this case, you only need to add the following annotation along with some configurations:
@OnlineDistributedLock(cacheName = UsageConstants.PREPAID_USAGE,
key = UsageConstants.MOBILITY_CACHE_KEY)
Yes! it’s that easy to get this solution up and running.
Outcome:
Alright, now that we’ve got the above solution up and running, it’s time to dish out some seriously impressive results.
Currently, we are saving approximately ~43 million downstream hits per day (~32%) and ~ 303 million hits per week. By comparing our default cache implementation to that of the current implementation, we are now saving about 50% of requests.
This was a very crucial implementation in order to scale service-nugget to 100% users on Airtel Thanks App.