Circuit Breaker Design Pattern 🔩
(Microservice Design Patterns — Part- 02)
If your house is powered by electricity, then you can find a circuit breaker. So when you get the power from the main grid, the power always comes through a circuit breaker. If the main grid behaves in an abnormal way, or if additional power tries to come, then that circuit breaker will go off and the internal power system of the house will be protected. So this scenario is much similar to the “Circuit Breaker Design Pattern”.
Why we need the Circuit Breaker Design Pattern? 🧐
In a distributed system, there are many services that interact with each other. So the possibility of services getting down is high. Therefore it’s better to know the status of the service when sending the requests.
Circuit Breaker Design Pattern
“The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all.” — Martin Fowler
As we discussed above, we use the Circuit Breaker design pattern to stop sending or processing requests if a particular service is down. Assume that a consumer sends a request to get data from multiple services. And out of these services, one service is down due to a particular reason. As a result, You’ll have to face 2 issues:
▪️ As the consumer does not have enough knowledge about the down, it will be sent the requests to continue to that particular service.
▪️ System and network resources will get tired with low performances.
✔️ So, in order to get rid of these, we can use Circuit Breaker Design Pattern.
📍 Usually, we say 99.999% guaranteed whenever we talk about the availability of services. Consider the below calculation.
24 hours per day and 365 days per year = 8760 hours per year
8760 * 60 = 525600 minutes per year
99.99% uptime means, the accepted failure chance = 0.001%
525600 * 0.001% = 5.256 minutesTherefore one service can down for 5.256 minutes per year
✹ This seems okay if this is monolithic architecture, But it’s not fine if this happens in a microservices architecture. Because if there we 100 number of services, then it will take 8.78 hours per year(5.256*100)/60 = 8.78 hours)
. Hence, this is not acceptable.
✹ There are 3 states as Closed, Open, and Half-Open in Circuit breaker design pattern.
What will happen if a service is down 🙄
▪️ Let’s understand it by the below example,
▪️ Assume there are 5 different services. Now after receiving a request, the server allocates one thread to call the particular service. (as shown below figure 1)
▪️ Now the service ① is a little delayed. So the thread is waiting.
▪️ It is fine if only one thread is waiting for the particular service. But, what if this service is high demanding service 🧐 If it is so, then it will get more requests and the threads will be waiting and blocked in the queue. (as below figure 2)
▪️ So though the service is back, your webserver will never recover, because when it processes the queue more and more requests are coming. And this type of scenario will lead to destroying your service and this causes the Cascading Failure.
Cascading Failure
Let’s understand this by the below example:
▪ ️Here, A service calls B, B calls C and C calls D (W, X, Y, Z are other services). If service D fails to respond on time, then service C will have to wait, So when C is waiting, service B also has to wait, when B is waiting A service will also have to wait. And this is called Cascading Failure.
▪️ So, no matter how, if your service fails, your service will go offline.
Is Cascading Failure Bad?
▪️ Yes! Because the major issue that can be caused by cascading failure is that it can make the entire application or the system down.
How to overcome?
▪️ Let’s take the same previous scenario. Now you have defined a threshold as the service A should respond within 200ms.
▪️ So according to the Circuit breaker pattern, if the number of requests (75% of requests) is reaching the upper threshold (that means between 150 to 200) So that means the service is failing slowly.
▪️ If the number of occurrences exceeds the maximum threshold (200ms) of the service, the Proxy will identify that the service is not responding anymore.
▪️ So, it will fail back the next request which comes to access service A which means that it breaks the connection between the Proxy and the service A. (Now, the proxy will not go to service A, so it will not wait anymore)
What Circuit Breaker did here?
▪️ Assume there’s a 30 seconds timeout. And each request is trying to hit service A without considering its failure. As a result, all the requests which come from the consumer will be waiting 30 seconds and fail out. Also during the time of 30 seconds, the remaining requests which come to consume A will be waiting in the waiting queue.
▪️ So If it is failing more than the given threshold, the circuit breaker will not try to hit service A and it will fail back the consumers saying “Service A is not available”.
📝 Note: When the response time is back to its normal threshold, Then the circuit breaker will turn closed again and the new traffic will pass.
Does that mean we fail certain consumer's requests?🙄
YES 😰 Because, if it let them go to the service which is down, the whole system will fail.
Summary 😎
📝 By using the Circuit breaker design pattern we can control faults in an elegant way. And Circuit breaker pattern does not let the consumers wait for the internal service failures and gives a better user experience for the consumers.