Goresilience a Go library to improve applications resiliency
When we start developing an application lot’s of people forget about one of the most important facts of our programs. Our application will be in production and nothing is guaranteed there.
We know that is true but we forget about it. Could be as easy as asking ourselves some of these questions…
- What if the network isn’t working?
- It’s one of my dependency services responding? or worse, is responding slowly?
- Can I become DoSed by my clients? (internal clients?!)
- Can I be collapsed by my own concurrency with my current assigned resources?
I could continue… but you know what I mean.
Why does this happen to us? Well, maybe the answer is that on the happy path these things don’t happen, but this happens, especially in high scale environments.
That’s why companies like Netflix that work at this huge scale develop things like Hystrix.
Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.
With these libraries, they can control these kinds of problems and Goresilience is one of those libraries.
Goresilence is a library to improve the resilience of Go applications in an easy and flexible way. It can be extended and tries to be Go idiomatic.
- Flexibility: Easy to extend by anyone.
- Testable: Easy to test.
- Go idiomatic: Embrace Go language “standards”.
- Observability: Give the user the ability to expose metrics about what’s happening inside the application execution(white box monitoring).
- Easy to use: Clean design and safety defaults.
The library architecture is simple, and if you are already a Gopher it will be easy to understand how it works because it follows the same ideas of the
The core of the library is based on one component, the
goresilence.Runner, this is the executor.
Runner is an interface and all the resiliency components in Goresilience implement it. With this interface, the library uses the power of the decorator pattern and decouples all the logic of the different resiliency components. In Go is common to have seen this pattern with middlewares of the
Let’s explain how the library works with a small example.
As seen on the snippet, we have a service that knows how to grab music information from our users.
But this is not very resilient, what if it's getting slow and the dependencies get stuck for more than 1 Second while we are waiting to their response? we don’t want to make our users wait for everything because the user favorite genres list is slow. Let’s fix that with a goresilience timeout.
On the first line we create our runner using goresilience with the default configuration, using that runner we can execute any function. Next we set a fallback for the results and then execute using the runner.
As you see is clean and easy to follow. But the best part of the library design comes with the middlewares.
The Runners can be used as middlewares in a chain of middlewares so you could chain the different resiliency componentes in a execution chain.
Following the past example.. imagine that we want to customize the timeout and also add retries with exponential backoff.
We just change the runner code. Now we create a Chain of runner middlewares. This runner execution would be represented with something like this:
And like this chain, you can create the flow that you want with any of number and type of Runners (including yours! it just needs to implement interface
As you see in the example every Runner from the library comes with 2 factory methods:
Newis used in standalone mode and
NewMiddlewareis used in a chain of runners.
As a final example imagine you want to create a more `Hystrix-like` runner chain, you could:
At the moment of this writing the library comes with this (for now only reactive and static, but I want to add adaptive ones).
- Timeout: Timeouts the execution based on the duration of it.
- Retry: Retries with exponential backoff and Jitter if the execution fails.
- Bulkhead: Controls the concurrency of the execution.
- Circuitbreaker: Circuitbreaker pattern based on the metrics of the execution.
- Chaos (failure injection): Injects errors and latency on demand.
Why this library?
Maybe you are wondering why another circuitbreaker library?
Well… this is not a circuit breaker library, is a toolkit to increase the resilience of your applications although it implements the circuit breaker pattern.
When looking out there I feel the lack of a toolkit that could adapt to my execution flows without using multiple libraries (one for retries, one for circuit breakers…) and createing glue code to wire everything, messing up my application with code out of my business logic.
Apart from that, I like the idea of creating chains of middlewares like the
http.Handler style but for execution flows.
So… I implemented…
The way the library is designed, it can be extended and customized the reliability for each particular case, the responsibility of each resilience component is separated.
All the Runner know how to measure their actions using a metrics recorder, at this moment only Prometheus recorder is implemented. Following the previous example, we could do this to have Prometheus metrics.
Like you see is another
Hope you like the library and if you like it, start using it!
If you want you can go to Github to give feedback about errors and pain points or if you like to contribute to improving the current
Runners (or new ones), they will be welcome!
Thanks for reading.
P.S. A second article has been published based on the next version of the library that implements adaptive resilience using concurrency limits