Achieving Fault Tolerance With Resilience4j

4 min readAug 11, 2017

This is the first article of a short series about the Resilience4j library. It provides an introduction to the Resilience4j functionality, its unique features, and the motivation behind it. All other articles of the series will share some insights about library internals like data structures, algorithms, and other tricks.

Intro

Resilience4j is a fault tolerance library designed for Java 8 and functional programming. It is lightweight, modular, and really fast. We will talk about its modules and functionality later, but first, let’s briefly discuss why you should even bother with fault tolerance.

Fault Tolerance

Fault tolerance is basically the ability of a system to operate properly in case of the failure of some of its components. It sounds easy, but it’s not so easy to achieve, because if you’re aiming to make some system fault tolerant, it should be done on all levels and subsystems as a part of the design. And it is not only about proper error handling; you should also keep your failure domains as small as possible, work on fault isolation and the possibility of self-stabilization… Error handling seems to be the easiest problem here, but I’ll disappoint you.

Error Handling Is Still a Thing

There is a very interesting paper called “What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems.” In this study, error-handling bugs are the second largest category (18%) after logic bugs. The authors break down error-handling bugs into three classes of problems:

Error/Failure Detection — Errors are often ignored and incorrectly detected;
Error Propagation — This class of problems arises in layered systems where error detection and error handling code are located on different layers and there is propagation problem across layers;
Error Handling — Sometimes it’s not clear how to handle rare corner-cases, and the lack of such specifications leads to error-prone code.

Example

Let’s continue with a small example of a subsystem that could potentially fail:

We Can Do Better

So we all know that things break down from time to time, and we often turn a blind eye to it, like this:

Yes, logging is a very important aspect of failure detection, but we can be a little bit smarter about it, and this is where Resilience4j can help you. The key word here is “help.” The library won’t automatically fix possible bugs, all the important work and choices are still on you. The library can only make this “hard way” brighter. There is an unlimited count of additional actions we can do in case of failure, except the logging. Here are a few options just off the top of my head:

Define “fallback” operations that can go to another host, query backup DB, or reuse the latest successful response. The example uses Vavr’s Try monad to recover from an exception and invoke another lambda expression as a fallback:

Apply automatic retrying and configure max attempts count and wait duration before retries:

Use circuit breaking, where you can track error rates of some service/component and, in case of problems, stop all operations with it to help it recover:

Send an event directly to the monitoring system to speed up problem detection:

Instant event based notifications are really great, but in general, you should always have a separate monitoring system that will poll all health checks of your nodes and will watch for any anomalies in your metrics. Resilience4j has add-on modules for integration with Prometheus and Dropwizard Metrics, so you can easily publish your metrics to these systems. For example:

Now you can see the uniqueness of Resilience4j from the API standpoint: it’s just decoration of methods references or any functional interfaces by using higher order functions. For those of you who love FP, it should be very appealing. If you like OOP more, just use the Proxy pattern, which is especially good with AOP interceptors.

There are no library interfaces that you should implement in order to guard some operations. You aren’t forced to delegate all operations to some separate thread pool. From this standpoint, Resilience4j is very flexible and can be used with any programming paradigm or concurrency model.

Main Modules

Resilience4j can help you to apply any fault tolerance ideas. It also has some bug prevention capabilities where you can restrict calling rate of some method to be not higher than N [req/timeUnit] or limit the number of concurrent executions. Everything is highly configurable and there are metrics in place (where it makes sense). All features have very low overhead, and CircuitBreaker, RateLimiter, and Bulkhead can be configured to make them completely garbage free. By using the internal event system, you can implement immediate reaction to any problem or failure that will notify you about it.

Here is a full list of core and add-on modules:

Core modules:

resilience4j-circuitbreaker
resilience4j-ratelimiter
resilience4j-bulkhead
resilience4j-retry
resilience4j-cache

Add-on modules:

resilience4j-metrics: Dropwizard Metrics exporter
resilience4j-prometheus: Prometheus Metrics exporter
resilience4j-spring-boot: Spring Boot Starter
resilience4j-ratpack: Ratpack Starter
resilience4j-retrofit: Retrofit Call Adapter Factories
resilience4j-vertx: Vertx Future decorator
resilience4j-consumer: Circular Buffer Event consumer
resilience4j-rxjava2: integration of internal event system with rxjava2

Additional Resources

If you are interested, please visit our GitHub page or take a look at User Guide.

For Spring Boot users, we have a starter module and a small demo project.

Originally posted for DZone.