Circuit breaking a parking structure in 2018

Preventing Systemic Failure: Circuit Breaking — What it is and How it Works, Part 2

Yu-Han Lin
Glasnostic
Published in
8 min readMar 28, 2019

--

This is the second of a two-part series on circuit breaking. In part one, we covered the pattern and how it is approached differently by developers and operators. In this post, we’ll explore its typical use cases and how it is implemented in modern service middleware.

Typical Microservice Use Cases

Developers and operators typically use circuit breaking for different purposes. Being primarily concerned with protecting their code, developers look to circuit breaking as a way to compensate for upstream failures. Operators, on the other hand, are responsible for the stability and availability of the entire service landscape and thus use circuit breaking primarily to monitor and remediate.

Developers: Compensating for Upstream Failures

Besides merely “breaking the circuit” and moving on, developers care about mainly three benefits of circuit breakers. First, because circuit breakers allow developers to deal with service failures, clients can adapt to changes in service availability dynamically over time and in a graceful manner. Secondly, circuit breakers that share their state across a service architecture provide network effects that can significantly improve responsiveness in the face of failures. Third, circuit breakers coupled with intelligent routing and load balancing can be used to automatically substitute healthy service instances for failed ones, thus promoting self-healing.

Operators: Monitoring and Remediation

Circuit breakers are a great way for operations teams to spot trouble before it cascades into bigger problems. When a circuit breaker is tripped, operators might decide to divert some or most traffic away from a service while the responsible engineering team investigates the relevant logs and metrics. Because of its usefulness in relieving systems of acute stress, diverting traffic or load shedding, this represents the most popular use of circuit breaking among operators.

Another, closely related, variant is to define circuit breakers as predetermined breaking points in the architecture. Ideally, such breakers would be set up in places which are known to bear loads in direct proportion to critical systems. Such breakers work in essence as canaries in the architecture that, again, lead to remediation through load shedding.

Advanced Circuit Breaking

As circuit breakers evolved from client-side libraries to middleware, shared-state breakers and platforms, their definition also became increasingly diverse. The developer and operational use cases of circuit breaking diverged and its definitions involved an increasing number of parameters. Circuit breaking as provided today by cloud traffic controllers such as Glasnostic can be applied to traffic links that have been defined by arbitrary sets of endpoints, and combined with a number of complementary patterns such as timeouts, backpressure or brownouts. These combinations of patterns are then refined over time in conjunction with a number of parameters such as rate of requests, concurrency, bandwidth or latency.

Circuit Breaking with Hystrix

Netflix’ Hystrix was the first service middleware dedicated exclusively to circuit breaking. When it was released to the public in 2012 to provide microservice architectures with “greater tolerance of latency and failure,” it was already being used extensively at Netflix for over a year. Hystrix continued to serve as a fundamental part of the Netflix’ service middleware until it entered maintenance mode in late 2018, marking, according to the project, a “shift [in focus] towards more adaptive implementations that react to an application’s real-time performance rather than pre-configured settings.”

Hystrix is a Java library that developers can use to wrap service calls with circuit breaking logic. It is based on thresholds and can fail calls immediately and perform fallback logic as shown in part 1. Besides providing timeouts and concurrency limits, it can also publish metrics to monitoring tools. Finally, when used in conjunction with the Archaius library, it can also support dynamic configuration changes.

Figure 1. Hystrix dashboard showing call volumes, various interaction metrics and breaker status. Circles represent call volumes and sparklines show how volumes over the past 2 minutes.

Although Hystrix supported refinements such as combining circuit breaking with timeouts and concurrency pools, it proved ultimately not flexible enough for the increasingly dynamic interaction behaviors in modern organic architectures. The ability to set thresholds and client-side concurrency pools gives service developers sufficient control to isolate their code from upstream failures, but ceases to be useful where systemic, operational concerns gain importance. As such, the decline of Hystrix is a direct consequence of the limitation of circuit breaking as a developer pattern.

Circuit Breaking in Service Meshes

Istio

Istio is a service mesh that supports circuit breaking based on connection pool, requests per connection, and failure detection parameters. It does this with the help of so-called “destination rules”, which tell each Envoy sidecar proxy which policy to apply to traffic, and how. This step happens after routing has occurred, which is not always ideal. Destination rules may specify limits on load balancing, the connection pool size, and the parameters for what ends up qualifying as an “outlier” so that unhealthy hosts can be removed from the load balancing pool. This type of circuit breaking is great at insulating clients from service failures, but because destination rules are always applied cluster-wide, it lacks a way of limiting breakers to only a subset of clients. To achieve combinations of circuit breakers with e.g. quality-of-service patterns, multiple client-specific routing rules must be created, each with its own destination rule.

Figure 2. Sample Istio circuit breaker configuration, ready to be passed to the kubectl command. This configuration specifies circuit breaking based on both, connection pool and failure (“outlierDetection”) parameters. On the connection side, either more than 10 open TCP connections, more than 20 pending HTTP requests or more than three pending requests per connection will trip the breaker. On the failure detection side, any two consecutive errors occurring within two seconds of each other will cause the target host to be ejected from the load balancing pool for a minimum of 30 seconds.

Linkerd

Circuit breaking in Linkerd is somewhat complicated, reflecting the generally conflicted state of circuit breaking as a developer pattern. While Linkerd 1 continues to support robust circuit breaking courtesy of the original Finagle code, Linkerd 2, a complete, lightweight rewrite in Rust and Go, does not do so directly. Instead, it offers related functionality in its Conduit proxy, which is now merged into Linkerd 2, albeit without support for retries and timeouts.

To implement retry and timeout support, Linkerd 2.1 introduced the concept of “service profiles,” custom Kubernetes resources to provide Linkerd with extra information about a service. Using service profiles, operators can now define routes as being “retryable” or having a specific timeout. While this provides some essential functionality related to it, circuit breaking in Linkerd is still a ways off.

Circuit Breaking with Glasnostic

Glasnostic is a cloud traffic controller that enables operations teams to control the complex emergent behaviors that their organic architectures exhibit. This enables companies to run diverse architectures in an agile manner, without costly revalidation on every change. As a result, development and operations are ideally positioned to adapt to their company’s rapidly changing business needs.

Unlike Hystrix and service meshes, which implement circuit breaking from a developer’s perspective, Glasnostic implements circuit breaking as an operational pattern, designed for operators.

Glasnostic’s control plane provides high-level visibility of large-scale, complex and dynamic interaction behaviors that enables operators to remediate issues quickly. Operators are able to apply tried-and-tested, predictable operational patterns such as circuit breaking by exerting fine-grained control over interactions across arbitrary sets of service endpoints. Because operational patterns may be readily combined to form highly refined, compound patterns, circuit breakers can likewise be easily refined by combining them with e.g. backpressure based on request rate, bandwidth or concurrency.

For example, figure 3 shows a channel set up to monitor and control intermittently recurring latency spikes across a set of otherwise unrelated services. Without looking for a putative root cause, operators decide to first control the situation by circuit-breaking the more extreme long-running requests. They achieve this by first defining a new channel covering the services in question, as well as any potential clients, and then imposing a suitable latency limit on the interactions governed by the channel. This allows the operations team to control the situation until engineering is able to provide a fix.

Figure 3. Glasnostic console showing a channel (1) set up to monitor and control intermittently recurring latency spikes across a series of services. Configuring the circuit breaker to trip when latencies reach 1,800 ms (2) serves as an early warning system to the operations team while at the same time controlling the situation. Once the engineering teams responsible for the services have identified a fix, the circuit breaker may be removed.

Of course, initial policies are often just that–first attempts to remediate a situation — and need to remain open to adjustments. Adjusting or complementing policies in Glasnostic is both fast and easy. For instance, the operations team may find that the initial channel policy can be further refined by first circuit breaking non-mission-critical clients to leave mission-critical clients unaffected as long as possible. To accomplish this, they could define a refinement channel covering only non-mission-critical clients and adding a policy that circuit-breaks them based on connection and request allowances. Figure 4 shows such an auxiliary refinement channel set up with both concurrency and request policies to circuit-break non-mission-critical clients before the original latency breaker is tripped, thus increasing availability for mission-critical systems.

Figure 4. Glasnostic console showing a refinement channel (1) for the channel set up previously (2), configured to break non-mission-critical clients based on connection pool and request rate parameters first, thus delaying circuit breaking for critical clients (3). The request breaker is currently active.

Unlike the circuit breakers typically offered by service middleware such as API gateways and service meshes, Glasnostic supports circuit breaking as an operational pattern, between arbitrary sets of endpoints and in realtime, as opposed to via static deployment descriptors. This allows operators to specify circuit breakers that are not just tactical adjustments to local interactions but instead steps towards improving stability and availability that are meaningful for the entire service landscape. For instance, while Istio implements circuit breaking based on destination rules, Glasnostic can apply circuit breaking to any set of interactions, clients or services, past, present or future. As a result, operators can set separate policies for different traffic classes.

Summary

Circuit breaking is a fundamental pattern designed to minimize the impact of failures, to prevent them from cascading and compounding, and to ensure end-to-end performance. Because it can be leveraged both as a developer pattern and an operational pattern, it can be applied broadly, often causing confusion.

As a developer pattern, it is predominantly used as a fairly rudimentary compensation strategy that is difficult to refine without considering each specific call. On the other hand, circuit breaking as an operational pattern aims to relieve distressed systems of pressure to manage both, systemic stability and performance. Its behavior is often further refined by combining it with other stability patterns such as timeouts or backpressure. Operational circuit breakers used to depend on separately deployed service middleware such as API gateways or service meshes. However, because service meshes address primarily developer concerns, support for circuit breaking as an operational pattern is limited and inconsistent across implementations. As a result, operational circuit breaking is best done using a cloud traffic controller like Glasnostic.

Glasnostic is a cloud traffic controller that enables operations teams to control the complex emergent behaviors that their organic architectures exhibit. The Glasnostic control plane allows operators to apply a number of operational patterns such as circuit breaking to arbitrary sets of interactions. It also allows operators to refine operational goals by combining patterns. This lets operators and developers deploy faster, to production and support the business to adapt to ever faster-changing business needs.

Originally published at glasnostic.com.

--

--

Yu-Han Lin
Glasnostic

Core router team lead at Glasnostic. Player of mini-quadcopter.