Observability to power Resilient Infrastructure

Asadh Sheriff
ShareChat TechByte
Published in
7 min readAug 2, 2021

At ShareChat, we are building a new microservice every month for our business and infrastructural requirements. With a growing user base at ShareChat on top of growing microservices, we are more often than later running into a situation where few of our services & pods become unresponsive with the increasing traffic on our ShareChat platform.

Problem Statement

One of the primary reasons for a pod going down is increased incoming traffic on one of the sources or contributing services and as a result, the number of calls to the destination service increases. In the below example, Account Service (destination service) faces an increased traffic inflow due to increased flow on feed service..

Service unresponsive due to overload

Potential Side effect

We can infer from the above diagram that the Ads service was impacted while fetching account information from Account service as one of the pods went down due to increased traffic on Feed Service. Although the Ads Service does not interact with Feed Service, it is still impacted due to a common interservice dependency bottleneck. It is unfair for Ads Service or any service to run into something commonly known as “Cascading Failure

How do we solve this? service owners having conversation 😂

Service Owners: “Hey, can you not call our service a lot of times please?”

Other Service Owners: “Ok, we’ll try to..”

Jokes apart, we have a couple of recommendations for this problem,

  1. Reaching out to owners of Microservices to verify heatmap of all the outgoing links and optimize them
  2. Each microservice owner devises a solution to secure itself from being bombarded with superabundance server hits.
  3. Develop an edge solution that will protect each service from each other being affected

After evaluating, the first & second solution is not a scalable solution for a platform of 100’s of microservices. The Edge solutions are a common observability solution to get insight into traffic & network-related footprints.

Solution Proposal

In the world of microservices and high throughput ecosystems, the majority of rate-limiting logics are implemented under the philosophy of Service Mesh architecture. If you google it, you might be going through multiple scenario-driven implementations of rate limiting in Service Mesh Architecture. However, in ShareChat, we wanted to give the control and power over to the Service Owners of microservices to control the inflow traffic per contributing source service. The idea of controlling the inflow traffic is by using a distribution function between [0 to 1] for all contributing source services. To understand better, please have a look at the below diagram,

Distribution logic

How does it work?

A microservice or destination service like Account Service can have declarative control over its resource to allocate a certain percentage of its resource to be specifically used by a source service while serving requests. With the latest Observability Client (refer here), we are enabling microservices to declare the load distribution for a pod in the service descriptor as a distribution function or 0–1 scale just like how it shows in the image above.

Distribution condition:

Sum of all the distribution values provided by service should not exceed more than 1 and for the distributions whose sum is either lesser or greater than 1, we normalize the load to utilize 100% of the pod's resource.

Implementation deep dive

Most of Service Mesh architecture deploys the rate-limiting as an edge server in a Sidecar pattern. The reason is to abstract out the actual application container from implementation details of a proxy and centralized rate-limiting computation between the application pod and central rate-limiting orchestrator. On the other side, the cost of deploying & maintaining a sidecar app is steep and it does not make sense to implement a fully functional centrally orchestrated rate-limiting infrastructure. We phased out our solution as below,

  1. Implement pod level rate-limiting as for now it can mitigate the issue of a pod being bombarded by redirecting the traffic with the condition that the Load Balancer works on round-robin distribution algorithm, which is our case at ShareChat.
  2. Collect statistics from every pod of service to get an overall insight of application health against traffic contributing service and use the statistics to do both predictive and dynamic auto-scaling
  3. Centrally orchestrate rate-limiting to enable accurate dynamic auto-scaling

The Observability Client is going to enable rate-limiting at a POD level per microservice with the help of traffic. The client library internally uses the Leaky bucket approach and overflowing requests with HTTP Status 429.

Complete Architecture

Explainers for better understanding

  1. Request per unit (RPU): We support client service to rate-limit at various granularity like n Requests per second/minute/30 seconds/5 minutes/10 minutes etc.,
  2. Resource Name: Name of the resource or service used internally at ShareChat
  3. Inbound Distribution Percentage: A service that defines traffic distribution for a list of services
An example of descriptor to Observability Client

How are we building resilient infrastructure?

  • Periodically sync the state of counters back to the Observability Orchestrator for generating state machine with attributes, number of requests served, number of requests rejected for an RPU. This would use the state machine for the prediction of scale.
Pod level statistics collected

These metrics are synced from each pod to the orchestrator so that the orchestrator can run aggregation on top of the above table to get overall load distribution metrics at a service level

  • Periodically aggregate RPU across all pods for a service to deduce the health and trends of services for a given consolidated load
Traffic insight

From the above state, it is inferable that account service Accepted(A) 200000 RPS while rejecting 45 Requests from Feed Service, 320 from Ads Service, and 180 from other services with a total of 545 requests being rejected per second. If we plot the number of requests accepted or rejected by a service on a 24-hour timeline, we will get the data model for future prediction.

  • Generate Dependency graph and having each node of the graph with data like no. of pods running, number of requests accepted or rejected, etc.,
Example dependency graph

A dependency graph is a directional graph that helps us visualize the interservice dependency of all the services in Sharechat with the load distribution ratio as an edge of the graph.

  • Based on incoming load, use a service’s estimation Trend function to plot it on the current health state of the Dependency graph to estimate which dependent services are downscaled or upscaled.
Traffic prediction using dependency graph

The diagram shows that ~53% of Feed service’s incoming traffic is indirectly calling Account service. So the change in load at Feed Service by 15% traffic will effectively increase 6% of traffic on Account service. Let’s assume that a single pod of Account service can handle an average of 600 RPS, so we need to scale 20 more pods to handle 6% increased traffic on Account service. If this information is available for us, we would scale the servers in a predictive fashion.

The below figure shows the trend we deal currently at ShareChat scale platform

Auto scaling without predictive system

Trend Function for Prediction

Statistically, when we have the below data points, we would be able to predict or statistically forsee resource requirements to dynamically scale,

PsAonB — Percentage. Of service A that is contributing to input load of Service B(as per config file provided by service owners)

nPSA — no. of pods running for service A

nPSB — no. of pods running for service B

NRPSsB — No. of request a pod of service B can handle

SaRPS — No. of RPS of service A overall

CSbRPS — No. of RPS of service B contributed by Service A overall

Scale up if this is true

CSbRPS/SaRPS <= PsAonB * nPSB * NRPSsB

From the above equation, it is evident that it is derivable to arrive at a scale decision. Still, if we do this with a statistical approach, we might be able to cover more than 95 percent of scenarios of scale up and down, unless it is a day of Festival or Bot is attacking our services. We would be able to generate trends based on past Festival dates to provide input to our system.

Scale up with minimized latency

Challenges & Scope for improvisation

It looks fascinating, isn’t it? But, it has many challenges in increasing the accuracy for a real-time use case due to swinging parameters like time to spawn multiple pods and being available and increase in delta factor in the trend function. We can address this with a more accurate & advanced prediction for future traffic hits using ML/AI.

Supervised learning and data modelling are mathematically established ways to arrive at a higher level of prediction accuracy. We have built an extensively provisioning system to create models on top of data trends to get better predictions.

Co-authored by: Ankur Narain Verma

--

--

Asadh Sheriff
ShareChat TechByte

Architect & Engineering @Apple | Building scale product | Technology Consultant for building things from Scratch