London police exerting backpressure against Beatles fans, 1965.

Preventing Systemic Failure: Backpressure — What It Is and How It Works

Pei-Ming Wu
Glasnostic
Published in
5 min readMar 19, 2019

--

In this blog post, we’ll take a deep-dive into understanding backpressure as an operational pattern, and how it can be applied to a service landscape that is controlled by a cloud traffic controller like Glasnostic.

A cloud traffic controller lets operators detect and remediate the complex, emergent interaction behaviors that dynamic service landscapes exhibit.

What Is Backpressure?

In the world of software, “backpressure” refers to actions taken by systems to “push back” downstream forces. As such, it is a defensive action taken unilaterally while under duress or if the aggregate call pattern exhibits too many spikes, or is too bursty.

As an operational pattern, backpressure helps equalize the traffic characteristics to protect systems from overload. Without a mechanism like backpressure, a multitude of adverse effects, from intermittent systemic failures to system collapse can occur.

There are many prominent examples of backpressure in software systems. The most notable is when a web server responds with a status of 503 (Service Unavailable). When this happens, it is, in essence, exerting backpressure against incoming requests. Similarly, actions taken to mitigate against a denial-of-service (DoS) attack consist primarily of the application of backpressure on incoming requests.

Backpressure can be also exerted against certain classes of requests. For instance, operators of applications that are made available via a “Freemium” business model often discover that their large number of free users tend to drown out their paying premium users, and thus the need to exert backpressure against those free users. In other cases, less essential services may compound the demand on business-critical systems and therefore need to be pushed back against.

While backpressure is a fundamental and essential operational pattern, it can be a crude tool if applied without regard to downstream concerns. It can be refined if clients (senders) and services (recipients) can agree on a cooperative protocol. For instance, TCP’s flow control mechanism uses a variable, sliding window to relay to clients how much room is left in the receiving queue. This information allows clients to adjust their sending rate. Of course, such cooperation increases dependency and coupling between services, which brings its own set of problems.

Backpressure in Dynamic Environments

Backpressure is even more important in dynamic environments such as an organic architecture. Because an organic architecture exists to be as flexible as possible to a businesses’ needs, the demand on every service in the landscape may change at any time and without warning. And because companies that employ an organic architecture have typically organized into many teams, each running their own independent deployment pipelines in parallel, changes tend to occur independently as well. As a result, the overall service landscape is continually evolving, causing continuous and unforeseeable shifts in demand and load patterns between services. This volatility can build up quickly and lead to complex emergent behaviors with dire consequences for individual services or entire zones in the architecture.

Figure 1. Extracting services from a monolith to modernize the existing architecture often allows them to service downstream client requests significantly faster, leading to overloading of upstream services whose owners are typically blindsided by such changes in demand. Operators need to detect these changes rapidly and be able to remediate immediately to prevent systemic failure.

These emergent behaviors heighten the need for operators to protect their systems and architecture. Traditionally, protection against overload was often (but not always) built into the systems themselves. For instance, web servers would limit the number of workers available to service requests and databases would make use of a connection pool to decouple query execution from incoming requests. In both cases, the system would simply stop responding to additional requests until capacity was freed up again to service them.

However, with the cloud, containers, large-scale service architectures and their intrinsic emergent behaviors, protecting individual systems is no longer an effective strategy to ensure system stability. Such large-scale and dynamic environments require operators to manage stability and prevent failure from a systemic perspective. In doing so, exerting backpressure at the level of service interaction and between arbitrary sets of endpoints becomes a fundamental and essential operational capability.

Also, because operators of organic architectures cannot predict how much demand compounding load shifts will result in, where this demand will occur, and how quickly and when it will occur, they’ll need to be able to see demand changes instantly and remediate against them quickly to protect arbitrary systems.

Applying Backpressure with Glasnostic

Glasnostic is a realtime operations solution that enables enterprises to control the complex emergent behaviors that their organic architectures exhibit. It allows operators to detect such behaviors in real time and remediate them instantly by applying predictable, best-practice operational patterns such as backpressure to arbitrary sets of endpoints.

For instance, in the modernization example given in Figure 1, operators need to be able to quickly spot where increased demand is not being met with a commensurate upstream response. Once it is identified, they’ll need to react immediately by exerting backpressure to protect the overloaded parts of the architecture and allow upstream activity to recover. This can be accomplished in the Glasnostic console by visually identifying high-activity interactions that are not being “passed on” upstream. A brief look at the metric history can then be used to confirm whether systems are already degraded and a remediation policy can be injected immediately (Figure 2).

Figure 2. Using Glasnostic, an operator spots a set of services experiencing increased demand (1) while upstream service activity appears suppressed (2). Examining the upstream history confirms that activity has all but collapsed under the load (3).

Similarly, when a new deployment causes widespread slowness in a seemingly unrelated part of the organic architecture, operators need to identify quickly if and where services are being negatively impacted. If such services are found, operators need the ability to immediately exert backpressure by injecting suitable policies (Figure 3).

Figure 3. A data access layer service has been impacted by a large number of clients and higher demand than usual (1). This lead to a continuous rise in load, ultimately exhausting the connection pool. Operators decided to apply backpressure at both, the concurrency and bandwidth level (2) until additional capacity became available.

As a final example, consider the case of a suspected denial-of-service attack (Figure 4). In this scenario, operators need to quickly identify patterns of increased load and then examine the balance of clients to determine which have been compromised. Setting the question of malice aside, operators group these clients in a channel and then exert backpressure against them to contain the attack.

Figure 4. Thwarting a DoS situation consists of three steps: discovering the pattern of increased traffic (1), identifying and quantifying its sources (2), and containing them in a channel to apply suitable backpressure (3).

Backpressure is a fundamental operational pattern used to protect the continued availability and stability of systems that have come under duress from heightened demand.

Backpressure takes on a particularly essential role in dynamic environments such as an organic architecture. Why? Because such environments exhibit a high rate of unpredictable changes that are rarely related to each other, demand and capacity mismatches are a common, yet dangerous occurrence. These are all good reasons that require operators to be able to exert backpressure at any time and between arbitrary sets of endpoints.

Glasnostic is a cloud traffic control plane that allows operators to quickly detect complex emergent behaviors and to promptly remediate any fallout by applying tried-and-tested operational patterns such as backpressure.

Originally published at glasnostic.com.

--

--