“Nothing is completely reliable. Everything fails at some point!”
In this blog, I will be talking about what needs to be done to make your customer-facing web-application, Resilient. These are based on my experiences building highly scalable e-commerce applications for top retailers in the world.
Resiliency is the ability of the system to continue serving incoming requests in an acceptable way even if one or more of its components or dependent systems are not functional.
In a microservices world, it is very easy to encounter a cascading failure just because one of the microservices goes down. Also, with the advent of public clouds where you have less control on resources, failure is more common than usual. For example, let’s say Service A is dependent on Service B and Service B is dependent on Service C. If Service C goes down, both Service A and Service B become unavailable. Hence designing resiliency for your microservice becomes very important.
Resiliency design should be done along with the application design and should not be an afterthought. Work closely with your dependent teams, DevOps team and enterprise architects during this phase.
Resiliency should be designed at three levels:
Design to have servers at multiple regions:
This helps in ensuring you are able to serve traffic even if a region goes down. For example, setting up at EastUS, CentralUS and WestUS regions would help. You may leverage public clouds to build this setup.
Design to have one set of back-up servers within region:
Additional to having servers at multiple regions, plan to have multiple servers within region. This ensures you don’t send traffic all the way to a distant region in case of failure of few servers. For example, two Kubernetes clusters per region would be good.
Ensure every resource is distributed:
Whether it is Database, Cache, LoadBalancer or KeyVault, ensure to have every resource that comes in the way of request processing is based on redundant-architecture.
Activate auto-scaling where possible:
Most of the cloud-providers offer auto-scaling features on servers, databases, etc. Enable these so that your application is ready to take-up surge in load without manual intervention.
With stateless microservices, each request can go to any server instance which is healthy thus protecting the application from instance reboots and failures.
Have a back-up:
See if you can serve requests using back-up resources. Can the application serve requests using cache if DB is non-responsive? If you are not able to make a call to a service, can you post it to a message-queue and process it later?
Configure a Deep health probe which checks for all must-have-dependencies. For example, a health probe should hit a service endpoint of your application which interacts with DB and a highly-dependent service.
3. Inter-service communication:
Ensure to have appropriate time-out configured when you make a service call. Value of timeout should be chosen carefully since too small a value may result in too many time-outs, too high a value may result in wasting time by waiting.
Ensure to have retries since service calls you make could fail due to network glitch. But don’t retry too many times either. Retrying multiple times on services with high time-out periods would make your service extremely slower. Ensure the services you are calling are idempotent while implementing retries.
Ensure to have fallbacks defined when there are service call failures or timeouts. These fallbacks could be fetching value from cache, pick default values or internal logic to calculate approximated values.
Best way to avoid service call failure is to avoid service calls. See if you can leverage cache, if you can reuse values from previous calls, etc. Ensure to have TTL in place to avoid impact due to stale data. These TTL values can be provided by dependent services if they know when the next update is planned, or you can choose it based on your knowledge of volatility rate of the data.
Design patterns and Tools:
Implement design patterns such as CircuitBreaker, Bulkhead and use tools like Resilience4J or resilience features of service mesh to implement Timeout, Retries, Fallbacks and ConnectionPooling.
Test your resiliency
Test your resiliency-design using resiliency tools available online. You may do this either as part of deployment pipeline or by conducting a drill where you test and monitor your application for availability.
At last, designing resiliency is an iterative process in most cases. Monitor your application in production and tune the configurations like retries, time-out, cache-TTLs, etc until you feel it is optimum.
In this article I have talked mostly about “what” needs to be done to make your application resilient. If you are interested in knowing “how” to do it, put in your responses and I would be happy to share my thoughts.