AarogyaSetu -Architecting a high scale infra.

AarogyaSetu is a contact tracing app launched by Govt of India to fight against COVID19. So far the app is downloaded by more than 125 million users. You can find more details about it on https://aarogyasetu.gov.in/.

SAURAV YADAV
AarogyaSetu
5 min readJun 13, 2020

--

Architecting an application that scales to millions of users requires massive planning and that too when you have to design that for the Government of India with user data involved. This means designing a system that is highly robust and ultra-secure at a mammoth scale.

It all started with a call on a Sunday morning (22nd March) around 11 AM with a requirement to setup a test environment for a backend application. There was a sense of urgency and I wasn’t told anything about the application or the project. I was unaware of the situation and setup a small POC infra for the application and went back to the usual stuff. After 3 days on 25th March, I again got a similar call late in the evening asking of me to build a highly scalable and secure infrastructure for an app called AarogyaSetu and that too within a day, rather that night :). A team from the Govt. of India was supposed to test the app on scalability and security parameters the next day.

Design Considerations

  1. Super simple data flow simple with no multiple and complex API calls.
  2. Micro-services & serverless infra for a quick scale.
  3. Leveraging client storage, to solve privacy as well as backend chatter.

The Scale Challenge

We somewhat knew about the nature of the workload that we will get, a spiky load, and kept the same in mind throughout design and implementation but the real-life traffic patterns can be very unpredictable. We had tested the system for 1 million requests per minute but when we went live, we started seeing around 3–4 million requests per minute within a few days of launching.
Even bigger challenge was faced when Mr. Prime Minister discussed the app on TV & we suddenly saw a spike of 15 million requests per minute, resulting in Lambda getting throttled and the SMS gateway started throwing error for a few minutes.

Number of Requests in millions

Scalability - ​To handle this ​​​tsunami of traffic​, the easiest solution was to build the infra with the headroom in place for the expected concurrency. But we might end up paying heavy infrastructure costs because of the unoptimized resources, so we decided to go with the serverless & containerized way to tackle this in an optimal way.

Serverless — We built the authentication service completely on the serverless architecture with API Gateway and Lambda. Lambda takes care of everything required to run and scale our code with high availability, we just need to upload our code to it. With Lambdas, we don’t have to worry about the scale-up time.

Serverless is way to go because we were dealing with Spiky loads and need practically instant Auto-Scale up and down functionality.

Containerization — We chose the microservice way and avoided the monolithic architecture completely. All the backend services are Containerized and can scale to millions of requests within a few minutes. The scaling of containerized applications depends upon the autoscaling rule but autoscaling alone to handle the traffic spikes is not a good idea because the scaling speed is limited.

The problem with the tsunami nature of traffic is scale-up time. When we get a sudden burst of requests on the application and if the scale-up time is high, the users will start seeing latency or errors in the application.

Scale-up time = New Infra provisioning time + Application Boot-up time + scaling trigger time

To scale up faster, we had a few options:

Headroom — We can run with say 20% more infra. This provides us buffer time to autoscale the infra.

Step Scaling — Instead of the recommended target tracking policy configuration with autoscaling we went ahead with the step scaling to aggressively scale-out when utilization reaches a certain level. Target tracking scaling adjusts the service task count in proportion to the percentage that the actual metric value is above the target and step scaling policy can add a predetermined number of tasks in the service if utilization breaches the target value and allow us to aggressively scale-out the application.

Scaling policies — As all the services expose metrics like RPM, Data processed, CPU and Memory utilization and we configured the autoscaling to trigger on multiple metrics.

This allows us to add/remove the infra dynamically while the platform is handling the millions of requests per minute that it is getting.

Scaling events for backend application

Did it scale?

YES, as a matter of fact :). And it was a great joy to see that even during the peak traffic time, the 99th percentile response time was within 1​7​0 ms​ with an average response time within 40 ms.

Response time of backend application

Thank You!!

I am honored and super full of pride for being a part of the AarogyaSetu Team. Thanks to Government Of India and AarogyaSetu Team to give me the opportunity to work on a project that has an impact on human lives.

Also, I request you all to download the app and help the Govt of India to fight against the COVID-19.

The app is also available for the KaiOS platform for all Jio phones.

मैं सुरक्षित, हम सुरक्षित, भारत सुरक्षित!

--

--