Creating Fault Tolerant Services Using Resilience4j

Gopalkrushna Pattanaik
3 min readDec 24, 2019

--

In a distributed system, whenever a service makes a synchronous request to another service, there is an ever-present risk of partial failure. Because the client and the service are separate processes, a service may not be able to respond in a timely way to a client’s request. The service could be down because of a failure or for maintenance. Or the service might be overloaded and responding extremely slowly to requests. Because the client is blocked waiting for a response, the danger is that the failure could cascade to the client’s clients and so on and cause an outage.

There are several design patterns that are available to achieve fault tolerance against these partial failures. It is beautifully described by Netflix Tech blog here .

In Summary Whenever a service invokes other services synchronously it should protect itself from using combination of below mechanisms

  • Network timeouts (Never block indefinitely and always use timeouts)
  • Retry (Retry after specified period of time to avoid the transaction to fail due to transient failures such as network glitch)
  • BulkHead (Limiting the number of outstanding requests from a client to a service)
  • Circuit breaker (Track the number of successful and failed requests, and if the error rate exceeds some threshold, trip the circuit breaker so that further attempts fail immediately)
  • Rate Limit (As a Service limiting the processing of number of requests from each client in a given period.)

Few of these above mechanisms were provided by Netflix library Hystrix which is used vastly till date.

Introduction to Resilience4j

After Netflix in Nov 2018 announced that Hystrix will be under maintenance mode and no further new development will be done, in Dec 2018 Spring Cloud Hystrix project is deprecated.So new applications should not use this project. Instead Resilience4j is a new option for Spring developers to implement the circuit breaker pattern.

Resilience4j is also more lightweight compared to Hystrix as it has the Vavr library as its only dependency. Netflix Hystrix, by contrast, has a dependency on Archaius which has several other external library dependencies such as Guava and Apache Commons.

Resilience4j comes with features like Rate Limiter, Retry and Bulkhead along with Circuit Breaker pattern. Works well with spring boot and using micrometer libraries, it can emit metrics for monitoring.

There is no replacement introduced by Spring for Hystrix Dashboard so users need to use prometheus or NewRelic for monitoring.

A sample poc application using spring boot and Resilience4J with monitoring enabled using Prometheus and Grafana can be found here.

Below are the few of the screen grabs from Grafana circuit breaker dashboard

--

--