Towards building a failure resilient system

Gandharv Srivastava
Capillary Technologies
8 min readApr 11, 2022

150 Million, this is the number of communications Capillary’s Engage+ product had sent out inside a two hour window at peak during New-Years. Even a minor failure on this scale can be damaging to our clients’ capital and our product’s reputation.

Failures are like blasts, they can range from a small hand grenade to the size of a nuclear level explosion and the devastation caused by the blast depends on the blast radius. No matter how well the system is designed and built, there will always be failures. Even a small blast can aggravate to cause big devastation if not identified and handled as early as possible.

‘Everything fails, all the time’, Werner Vogels

Please note that the focus of this blog will be on robustness and recovery from failure in a micro service design, with a particular emphasis on the communication between micro services and recovery from it.

Motivation

Failures in one service in a micro service architecture environment affect other services, resulting in multiple escalations on the product and a loss of trust in the product. In Engage+ we follow a choreographic micro service architecture. I won’t go into great detail about the details of this pattern because of time constraints. In a nutshell, this means:

In a choreographic micro service architecture each component of the system participate in the decision-making process about the workflow of a business transaction, instead of relying on a central point of control.

Simple choreographic micro service architecture

As we can see from the figure, there are multiple services involved in the decision making process and hence handling any failures in this architecture can be as hard as finding a needle in a haystack. So how do we detect these failures and control the radius of the blast before it burns down the whole haystack.

Failures & Recovery

These can predominantly be segregated into the following 3 types :

  • Failures between services these are other micro services running within Capillary
  • Infrastructure Level Communication Failures these could contain infrastructure components like databases (MySQL), queues (RabbitMQ), etc

Lets get more into details

Failures between services :

A downstream service may become unresponsive for a variety of reasons, resulting in failures.

These failures can be caused by a variety of factors, such as high CPU usage resulting in a large number of unresponsive calls, application thread exhaustion, service memory issues, and so on.

According to the industry standards, a service should have 99.999% availability in order to be considered highly available. Let’s take an example where service “A” is dependent on 5 other services. If each of the downstream services has an availability of 99%, then service “A” can have a maximum availability of 95%.

(0.99) ^ 5 = 0.95

(0.999) ^ 5 = 0.995

So how do we deal with this ?

Identifying the issue :

The first step in any recovery is to understand the failure. Knowing if there is a problem, where it is, and what it is becomes critical information for engineers dealing with failure mitigation. Monitoring tools like AppDynamics and New Relic, for example, may give engineers a basic overview of the application as well as crucial metrics like requests per minute, Apdex, and resource metrics.

Resiliency before failure recovery :

If one of the service instances is down, the service’s responsibilities must still be met. Micro services should be horizontally scaled to allow several instances ensuring that if one instance of the service is down, other instances can take over and respond to the caller service. This removes a no single point of failure in the architecture.

Asynchronous communication to prevent short term outages :

Small time outages can be mitigated by switching from synchronous to asynchronous communication. As a result, once the services are active again, the requests will be processed. This can be accomplished by implementing a highly available queue-based communication service between the communicating parties. However, this strategy has the drawback of being ineffective in purely synchronous and real-time systems, thus the developer must be meticulous while employing this.

Auto Recovery

Assuming the engineer was notified promptly and the issue with the broken service was resolved then all other services waiting for responses should retry the calls and receive valid responses from then on. Idempotency will have to be enforced across all retry-based calls. This approach also helps in cases of network blips between the services.

Manual Recovery

Sometimes recovering a service can take a long time, and the system’s auto recovery may get exhausted. While this is the least recommended approach, the engineer may have to attempt manual recovery. This can often involve a list of APIs/data manipulation steps to bring the system back to a consistent state. Be warned that a tedious TODO list for manual recovery often reduces the engineer’s morale and confidence.

Infra Level Communication Failures :

Infra failures are like nukes on one’s system. Issues like unresponsive databases, queues crashes, etc fall in under this category. These kinds of failures are not very common but have the potential of breaking the whole system and recovery from such failures is more challenging as you may lose data in many cases.

Databases failures

Failures with databases can absolutely paralyze the system, lets see what can we do here :

Alerting

Both the service and the database should alert the engineers of the mishap. Real time monitoring and alerting on database resource usage will help the engineer in the long run and pull you out before the situation gets extremely sticky.

Recovery

One can choose to leverage third party cloud managed databases for auto recoveries. Third party managed databases like Aurora db for SQL based db and MongoDB Atlas for document based DBs come with in-built backing and recovery mechanisms. For self maintained databases, you can refer to this blog. Recovery here involves avoiding data loss, once recovered the retries can take over and the micro-service can be back to working as normal.

Queue failures

Possible failures here can be queues being unresponsive due to some reason or in extreme cases queue crashes related to resources and these can directly lead to data loss.

Write First Approach

For queue failures the important thing here is to avoid data loss. Best practices involve using the write first approach. Messages entering the queue should be persisted to an external disk to recover the stuck messages post recovery from a crash. For queues like RabbitMQ and AWS SQS, the option of persistence comes out of the box and is config based.

Use of write first approach for faster recovery

Explicitly for RabbitMQ, once can use features like lazy queues and persistent messages to be more resilient in the event of a crash, allowing engineers to employ the write first strategy and keep data on disc in case something went wrong.

More Approaches For Achieving Resiliency

Added check-pointing in a simple choreographic micro service architecture

In a choreographic micro service architecture, one can use check-pointing. We called this process internally as Campaign-Checker. For the messages moving from one micro service to another, added check-pointing will help in real time monitoring of the flow and help the time to identify the point of the issue.

Total campaign numbers in Engage+ check-pointing dashboard
Error notifications in case of campaign failures
Campaigns trends and heartbeats for campaigns failures

SLA based alerts on the checkpoints will help mitigate the issue further. If the time threshold is not met between the checkpoints there can be alerts triggered which can help engineers pinpointing the exact location of failure and speed up the mitigation.

Impact

Here is the impact which we have had:

Product stability

The biggest motivation for all these changes was the product stability. Every failure caused a product escalation and dented our product reputation. After implementing resiliency we started observing a significant improvement in terms of product escalations.

Engage+ Product escalations in a span of 2 years

Engineering bandwidth

Before we implemented these changes, our development team had to manually recover each and every failed campaign. This involved a long and tough manual process where we had to pull all the failed data of the campaigns and then reconcile them one by one.

With the introduction of retrials and auto recovery thousands of calls to downstream services got recovered without any manual intervention. The same can be seen from the dashboards we have in place, where we track the performance of all the calls made from one service to another. Before all these failures corresponded with overall product failures, but now post retrials these get auto recover.

Cross service dashboard with communication numbers across a week

Introducing this product resiliency helped the engineering team recover from failure much quickly and reduced the effort spent on firefighting issues, increasing time for what the developers love the best, to develop.

Conclusion

Capillary’s Engage+ product sent out about 3 billion messages in FY 20221–22, thanks to the strategies mentioned above.Even at this scale, our systems functioned without a hitch. These modifications have not only improved the stability of our product, but they have also made our jobs as engineers a lot easier and more productive.

Hope this helps and thank you for taking the time to read this.

--

--