Circuit Breaker — A scalable client side solution to connect to non-scalable services
Here at Xandr, our front-end applications connect to a wide variety of services to render our pages. Most new services are backed by GraphQL for service orchestration but there are cases where we had to call the end-point services directly. Some of these services, while they serve data that have business value, don’t have the same scalability properties exhibited by our core services. However, there are scenarios when they have to handle bursts of requests at similar rate to that of the core services. When a new product feature — an aspect of which is implemented as a periodic background request to one such non-scalable service — was rolled out for testing, the service started to slow down due to excessive load and eventually responded with errors. When multiple front end clients continued to send more requests, the situation exacerbated and finally brought down the service.
A post-mortem of the incident revealed that the front-end was in effect DDoS’ing¹ the non-scalable service on which it depends for its data. While the service was probably at fault for not protecting its resource, the front-end could have avoided escalating the issue by exhibiting “Service Sympathy”².
Few criteria to inform the solution:
- Client should intelligently adapt to service failures.
- Client should eventually be able to call the service if/when it has recovered.
- Solution should be re-usable to solve any other client and server communication regardless of whether the server is scalable or not.
- Solution should be implementable on the client side.
We can look at different approaches to solution below.
One way to solve this problem is when we get an error response from service, stop calling them altogether. This is a simple but effective solution, but it is static, and we don’t have a way to automatically restart the call once the service recovers from heavy load and starts serving requests again.
Make the service scalable either horizontally or vertically. But based on the cost vs benefit of doing this given the timeline we decided against it. We needed a client side solution that is generic enough to be used for any other such non-scalable service or even as an extra protection mechanism when calling our GraphQL or other orchestration service.
While doing some research, I came across a stability design pattern that seemed to be a perfect fit for this problem. This pattern is called Circuit Breaker³ (CB).
Original function that call a service every 5 seconds
Same but wrapped with Circuit Breaker
The circuit breaker function wraps the original function. When it is called, it will invoke the original function only if its state is CONNECTED and return the function response. If the state is DISCONNECTED, it will simply return previously cached error response from the original function.
We will look at how it behaves internally.
The Naive Solution with Circuit Breaker
The circuit breaker function will not invoke the client function any more after the state is changed to DISCONNECTED — which happens if it receives a failure response. Once its state is DISCONNECTED it will never change back to CONNECTED. This is the naive solution that was referred above.
Practical Solution with Circuit Breaker
To turn the above into a practical solution, we need some way to reset the state back to CONNECTED if it is DISCONNECTED. One way to do that is to sleep for few seconds and once woke up we reset state back to CONNECTED⁴. Any call to CB when it sleeps will be ignored.
This sounds promising but is there a better way? What if after sleep when we call the client function again and it fails? Also when the service is hit by multiple such calls originating from different client browser, having a static sleep timeout would not help as there is high probability that the service will receive a burst of requests again from these clients at the same time. While this is better than the static solution, having a fixed timeout is also limiting and might not be useful in practice to allow the service to recover from load.
This is where EB (with Jitter)⁵ comes to the rescue. It solves the specific problem that occurs when multiple clients start to send request to the server at the same time. The paper⁵ solves this in the context of Optimistic Concurrency Control when multiple service calls were trying to update a row in the DB but we can take this concept and reuse it for our use case.
The EB algorithm solves two problems:
- How to avoid having every client wake up at the same time to send a request.
- How long should CB sleep in DISCONNECTED state before trying to make a request again that is neither too slow nor too frequent to adapt to service recovery.
EB solves the first one using some form of random jitter (random seconds added to the sleep time which makes it less probable that any two machines would wake up at the same time to make a request). It solves the second problem, by increasing the wait time exponentially⁶ from what it is currently, if the service fails repeatedly.
Finally, we also want to allow few repeated calls that can fail up-to some max number of failures, before we kickoff the EB algorithm. This is done so that CB does not have to sleep if the service errors are transient and recovers after first few failures.
Considering all these three requirements together, the pseudocode looks like this:
Well, that’s it!
Below you can check out a complete implementation (close to what we have in production) in JS.
We saw how frontend can prevent DDoS’ing its internal services using Circuit Breaker (CB) and also in the event of service failure dynamically adapt its frequency of further requests using EB.
This pattern is most effective if you have a need to call any non-scalable services within your organization that still provide some business value.
As a final note, since client side code is generally prone to tampering, the solution described in this article should be augmented with some DDoS prevention techniques on the service side in the long run.
- DDoS — Distributed Denial-of-Service
- Sympathetic in terms of how much load it puts on the service.
A nod to analogous “Mechanical Sympathy” https://mechanical-sympathy.blogspot.com/
- https://martinfowler.com/bliki/CircuitBreaker.html — Original design is proposed for the middleware layer but this concept can also be implemented on the client side.
- Assume that Circuit Breaker rejects any call to it when it is sleeping.