Showing Metrics With Error Rate Below 1% with Circuit Breaker

Sinem Elif Haseki
Insider Engineering
3 min readFeb 13, 2023

In a system consisting of multiple services or making remote calls, those services are likely to impact each other. Remote calls can always fail or just hang in the connection waiting for a response until a timeout limit. If we have a system consisting of several microservices, where one of them can fail, or one remote request might end up with an error, this can have an impact on the whole of our system. As software engineers, we have to minimize the impact of failures on these services, and we can create such resilient microservices with the Circuit Breaker design pattern.

When a failure happens in your system, you need to make 0 further calls to the failing service, and this is done by opening the circuit, just like how it works in electronic switches: when there’s an anomaly in the circuit, it stops the current flow to protect the whole circuit.

States of Circuit Breaker

States of Circuit Breaker Pattern
3 Distinct States of the Circuit Breaker Pattern
  • Closed — When everything is normal, the circuit breaker remains in the Closed state and all calls pass through to the services. When the failure rate exceeds a predefined threshold rate, it goes into the Open state.
  • Open — The circuit breaker returns an exception for calls without executing the function, and immediately returns with the error.
  • Half-Open — After a configured timeout period, the circuit switches to a Half-Open state to check if the underlying problem still exists in the upstream service. Even if a single call still fails in this half-open state, the breaker is once again tripped to an Open state. If it succeeds, the circuit breaker resets back to the Closed (normal) state.

How did we utilize Circuit Breaker in our systems?

At Insider, we are using different sources for gathering and retrieving our event statistics per different channels. On the Email channel, we use our email service provider (ESP) for gathering the events of our users, and we write these events to our ClickHouse DB via APIs in our internal systems.

For each of these transactions, we are sending HTTP requests from our microservices to outsourced services, and these are likely to fail from time to time, or they can even go completely down for an indefinite amount of time. But our customers shouldn’t come across such cases, and we must be able to provide our statistics to them regardless of failures in either of the external services. To this end, we implemented the Circuit Breaker pattern in our systems.

Our Utilization of Circuit Breaker
  • Here, we first try to fetch our statistics from the ClickHouse DB source. If it’s successful, we cache the response to our storage in Amazon S3 and we are in a Closed state.
  • If the request is not successful, the circuit is Half-open, and we send a request to our ESP. If this request is not successful either, we show our statistics from our cached response to our customers, and the circuit is Open.
    - In parallel, we keep sending requests to ClickHouse DB. If this request is successful, we are back to the Closed state.

Conclusion

Before the Circuit Breaker implementation, our customers often came across situations where the statistics page could not be accessed due to external service issues. We were not able to show the metrics in a stable way for almost 3 months, with an error rate of 60%. After this release, the error rate on our statistics pages is below 1%, and with this pattern, our customers were capable of seeing their engagement with their own customers regardless of the external dependency status on our side.

If you liked this article, you can also check this one or our Insider Engineering blog.

--

--