Circuit breaker pattern — What and why?
A big difference between in-memory calls and remote calls are that remote calls can hang, possibly leading to cascading failures in your distributed system. In this article we are going to break down the so called “circuit breaker”-pattern which prevents such issues by making sure an erratic resource is not called.
Reasons to use the pattern
Further down this article you’ll find a scenario where the pattern is useful, but let’s start with listing some benefits and reasons to use the circuit breaker pattern:
- Fail fast (better response times and no hanging requests).
- No unnecessary extra burden on the system/service that is failing.
- Perfect for implementing fallback logic.
- Good way to track service health for dashboards and alarms.
- Thanks to libraries it is relatively easy to implement.
What exactly is it?
The pattern is inspired by an electrical switch which is designed to protect electrical equipment from excess current.
A circuit breaker‘s three states
- Closed: Allowing current to pass through.
- Half Open: Allowing current to pass through (ready to quickly go into “Open”-state)
- Open: Not allowing current to pass through.
In the context of software you add a circuit breaker in front of e.g. an external system, which monitors request failures and based on set conditions switches between it’s three states.
Scenario
Let’s paint ourselves a picture where a payment service is currently overburdened. The service is not able to handle incoming requests and they are added to a request queue until they are eventually dropped, resulting in timeouts for consuming services.
Apart from the obvious fact that long response times and timeouts are something we want to avoid, continuing sending requests (that are inevitably just added to the request queue) reserves resources both for the overburdened system and our own service due to hanging requests, which is a problem that can easily cascade to other services as well in a distributed systems architecture. This can have the following consequences:
- The overburdened system is not given a chance to recover and might end up in an unrecoverable state, becoming unresponsive or unavailable.
- Our service (and others that depend on our service), start seeing errors unrelated to the actual problem, worst case scenario even leading to a crash.
So what can we do?
By adding a circuit breaker that controls the influx of requests we will allow the payment system to recover, while also making sure that the problems do not cascade to other services. We will also be able to fail fast or even implement fallback logic, greatly improving the end-user experience.
How does it work?
When the payment service starts failing and behaving unexpectedly to the point of exceeding a failure threshold that we set, the circuit breaker is tripped (entering the “Open” state). Once that happens we stop sending requests to that specific service. This gives us more options such as failing fast or using a fallback. In this case we could use another payment service until our system has recovered.
After a time that we specify, the circuit breaker will enter “Half Open”-state in which it will re-establish the original connection. If the next request is successful the circuit breaker will reset to “Closed”-state and if not it will immediately re-enter “Open”-state.
Read more about the circuit breaker pattern and get a more in-depth understanding of how it works
Things worth keeping in mind
If you decide to implement the pattern in your code, here are some things that you might want to consider:
- No need to re-invent the wheel. There are probably libraries for this in your language. Opossum is a great npm package for handling it.
- What errors do you actually want to count towards the error threshold? (For example, you might not want to trip the circuit breaker due to “404"-errors).
- While the general concept translates well over to software, the terms “closed” & ”open” can be interpreted as the opposite of what they mean in this context. This is worth keeping in mind for error responses or logging/metric purposes to make sure no misunderstandings occur.
- Consider where the circuit breaker is best located. E.g. do you require independent circuit breakers on specific endpoints of an api or do you want all requests to that api to share the same circuit breaker? Worth noting is that the latter means that if the api has one or several frequently failing endpoint(s), the circuit breaker might be tripped despite other endpoints working properly.
Conclusion
Hopefully this article has made you realize why the circuit breaker pattern is such a powerful tool in a service oriented architecture. A common truth is that if something can fail it will and it’s always good practice to do your best to help other teams’ services recover by relieving back-pressure.
Given how easily implemented it is and the implications it has, the circuit breaker pattern is truly a win-win for everyone.