The Joys of Circuit Breaking

Zendesk’s product suite is built with an evolving service oriented architecture, comprising hundreds of micro services and a growing web of interdependencies. Operating and orchestrating such a large scale distributed system can be challenging. This post will explore the importance of circuit breaking and related concepts.

A few weeks ago, we experienced a cascading failure of two of our services which impacted all downstream consumers. The following chart illustrates the traffic from one service to the other, successful responses are green, failures are red and rate-limited requests are teal.

HTTP requests to a service broken down by status code

So what exactly is going on here? Turns out, a few different features of a service based architecture are at play: circuit breakers, rate limits and retries. Let’s look at those and see how they interact with each.

Our services, let’s call them foo and bar, have a fairly typical setup. Let’s assume we run two instances of the foo service and three instances of the bar service. Requests from the foo service are load-balanced round-robin to the bar service instances.

Circuit Breakers

Circuit breakers are useful when services are unable to process requests, due to defects or capacity constraints. A circuit breaker will detect failures and take the misbehaving host out of rotation. The load-balancer will no longer route requests to the affected server and instead redistribute traffic to the remaining healthy servers.

Circuit breaker commonly implement two different policies: consecutive failures trigger the breaker when more than a fixed number of failures are encountered, alternatively a success rate can be set that breaks ones the rate dips below a set threshold. Finagle, the Scala client we use offers both and uses 5 consecutive failures by default.

Once the circuit is open, clients will periodically retest the failed connection. Many different backoff strategies are available, finagle by default uses a backoff of 5–300 seconds depending on how many times the connection has failed in the past. Jitter is added to prevent a large number of retries from different clients at the same time, aptly named a thundering herd.

We have found the defaults to work well for now, but might tweak them in the future.

If we had circuit breaker in place and they are configured correctly, how come we still took down the the bar service, you ask? The explanation can be found by looking at the next concept: rate limits. Every server has a maximum load it can handle and unless you can scale servers elastically, you’d rather fail requests fast when nearing that limit. All our services have rate limits. During the incident, two bar services were marked as dead, funneling all traffic to the sole survivor. The combined traffic exceeded the server’s rate limit, which caused it to issue “429 Too Many Requests” responses in addition to “500 Internal Server Error”.

Turns out, our circuit breaker treated 429s as a success and thereby prevented the breaker from triggering — even though there was a large number of failures. We have since adjusted our circuit breaker to classify 429s as failures, which will not only prevent the scenario above, but also preempt failures when exceeding the set rate limit and thereby the host’s capacity.

Retries

One more fun feature. Retries. We use retries in many places to mitigate the impact of temporary failures. We use an exponential backoff policy to avoid hammering servers on retries. While this works well for requests made in background tasks, it is pretty silly when handling synchronous request. Consider this trace:

Awesome to see retries working, more awesome to see exponential backoff in action. But note the scale. We made the requester wait 10 minutes. Nobody waits that long. Failing fast would have been the better option in this case.

Let’s take a last look at the chart from the beginning. This time with a few annotations.

  1. Regular traffic (green) is load-balanced across all three servers. Occasional 429 responses are returned (teal), which we dutifully ignored.
  2. Servers start failing and return 500 responses (red). Pauses in traffic to the first server indicate a triggered circuit breaker.
  3. Load increases due to retries of 500 responses, adding to the misery of the bar service. Most retries eventually result in a 429 response.
  4. Finally, two servers get marked as dead by the circuit breaker, routing all the remaining traffic to the first server. That server exceeds rate limits for most of the requests, which prevents the circuit breaker from triggering since 429 responses are treated as success.

We have since made several adjustments to our services, including measures to reduce requests from the foo service to the bar service, changes to our circuit breakers as well as a more nuanced retry policy. We learned a lot from the incident and it provided a great opportunity to observe the interplay of common strategies for making a service oriented architecture more resilient.