Resilience Engineering Strategies
Resiliency is the ability to recover from failures and continue to function. It isn’t about avoiding failures but accepting the fact that failures will happen and responding to them in a way that avoids downtime or data loss.
Whether you are designing a cloud native application or a microservice in a distributed system, you often depend on other systems and services. They may be unavailable (offline, under load, maintenance…) or unreachable (network problem, timeout…). In both cases, you need to apply a good strategy for dealing with failed remote operations and keep you service stable.
Unavailable database server, timed-out identity provider call, or unreachable API due to broken network are examples of transient faults that you are going to deal with in a distributed system. Retry and circuit breaker patterns can help here.
If the called service or network resource cause a transient errors, retrying the failed operation may help to recover. However, Your service should react differently based on available information (the reported error or handled exception).
- the service may decide to cancel the call if the error returned is permeant (deprecated API) or can not be self corrected (expired token or invalid credentials).
- the service may retry the operation directly (fox x times) if the reported fault is unusual (corrupted packet in transport layer).
- the service may retry after delay (x milliseconds) if the called service or resource may is most likely to be available (loaded database or caching service)
- choosing a proper values for the number of retires and the delay in between is a critical decision. You do not want to overload your on service (used threads / resources per request) and make it out of service!
If the failed system will take long time to recover, It make no sense to continue retrying the failed operation! Therefore, The Circuit Breaker pattern, popularized by Michael Nygard can prevent an application from repeatedly trying to execute an operation that’s likely to fail. A circuit breaker acts as a proxy for operations that might fail. The proxy should monitor the number of recent failures that have occurred, and use this information to decide whether to allow the operation to proceed, or simply return an exception immediately.
Think of this pattern as an electrical circuit breaker that has 3 states:
- closed: the incoming request is routed to the operation. The proxy maintains a count of the number of recent failures, and if the call to the operation is unsuccessful the proxy increments this count. If the number of recent failures exceeds a specified threshold, the proxy is placed into the Open state.
- Open: all incoming requests will fail and the error is return to the client.
- Half open: after a defined period of time, some incoming requests pass through and call the operation. If these requests are successful, it’s assumed that the fault has been fixed and the circuit breaker switches to the Closed state . If any request fails, the circuit breaker assumes that the fault is still present so it reverts back to the Open state.
The mentioned pattern will react as soon as a failure is reported. What if the remote service never respond? What if my client does not want to wait so long? A preemptive approach is needed to deal with these scenarios. Timeout or bulkhead patterns are proactive strategies to control the system load and thus improve stability.
How long should your service wait for connecting to a downstream service? Take the database connection for example. You can set the number of milliseconds before you report a timeout expectation! Too long will consume a lot of memory and CPU and will load your service. Too short timeout may throw a false exception (transient network faults). Anyway timeout is a good pattern to improves resilience, and user experience.
A bulkhead is a section of a ship which can be sealed off from others. If the ship is holed in one place, that bulkhead section can be sealed, other sections will not flood, and the whole ship will not sink.
In a resilient fault-tolerant architecture, elements of an application are isolated into pools so that if one fails, the others will continue to function. The bulkhead pattern manage the resource consumption directly - parallelism throttle - by either queuing or rejecting the exceeded requests.
Caching data on client side (local storage) or saving data snapshots of downstream service (on host or centralized in-memory system) may still improve the user experience and the stability of your service (the user can work offline and update her changes whenever your service or the remote API is available).
Polly is an open source .NET resilience and transient-fault-handling library that allow you to express policies such as Retry, Circuit Breaker, Timeout, Bulkhead Isolation, and Fallback in a fluent and thread-safe manner.
It is great solution to guarantee resiliency in database connections and RabbitMQ/Azure Service Bus connections. However, In a production scenarios based on Kubernetes using a Service Mesh is a better option to provide resiliency between your synchronous microservice-to-microservice communication.
When you use a Mesh for resiliency, nothing special is needed in your code. The Mesh is a pure infrastructure concept, so your Kubernetes files will be affected, but your code won’t. Both Linkerd and Consul are great solutions for such a scenario.