SRE: Resiliency: Bulkheads in Action — Partitioning to Minimize Failure Impact

dm03514
Dm03514 Tech Blog
Published in
7 min readDec 27, 2018

--

Bulkheads are a core resiliency pattern which provide hard limits to bound resource usage. Bulkheads are commonly found pattern which many libraries and components regularly employ. This post walks through a series of test cases to illustrate how bulkheads allow applications to enhance availability, by strictly controlling resources, even in the face of overload. Links to detailed technical overviews of bulkheads are provided in the references section.

What are Bulkheads?

Bulkheads or a way to partition an application. They provide a manner to bound concurrency and to limit a number of concurrent actions. The term bulkhead originates from shipping and refers to partitioning off parts of a ship. The Polly resiliency library explains:

A bulkhead is a wall within a ship which separates one compartment from another, such that damage to one compartment does not cause the whole ship to sink.

Ship Hull Bulkheads

Bulkheads enable isolation through concrete enforceable resource limits. Bulkheads are everywhere in software. Semaphores, Worker pools, Thread pools, Service/Process isolation and network isolation are all examples of bulkheads.

Consider a web server like NGINX. NGINX is configured with a worker pool (worker_processes) denoting the maximum number of processes. No matter how many requests are applied NGINX has a hard limit on the number of processes it will create. When there is more load than NGINX can handle it begins to queue connections and will begin to drop them (load-shedding) instead of spawning more processes.

Why use bulkheads

Bulkheads effectively isolate components and protect from cascading failures through the enforcement of limits. In the NGINX example above the worker pool isolates the rest of the system from failure and provides hard limits on the amount of resources (processes/connections) that NGINX will consume.

Finally, consider a web application that makes a database query per http request. Each database query creates a new connection, performs a query and then closes the connection. Without a bulkhead the number of database connections is directly correlated to the number of http requests. Traffic peaks could easily overwhelm the database. If the site got featured and traffic 1000x it would result in 1000x the number of database connections. If this causes the database to become slow it could cascade into the service, which may in turn cause its callers to slow down! And if its callers called the service unbounded this could cascade through the entire system! Using a bulkhead each service could maintain a pool of connections. No matter how popular the site is the service instance will never use more than 10 connections.

How?

In order to illustrate how bulkheads perform in action this post uses an example application (available here) and the resilience4js (WIP) resiliency library. The diagram below shows the test service. It is a simple node.js express service that exposes a single HTTP handler. In addition it exposes prometheus metrics and hypothetically exposes health checks.

In order to protect the prometheus and health check functionality the http handling will be executed inside of a bulkhead (denoted by the black border in the image above). The bulkhead will be configured to limit the number of allowed http requests to 100 and then three load tests will be executed:

  • 50% bulkhead capacity
  • 100% bulkhead capacity
  • 150% bulkhead capacity

Each load test will graph the application and the state of the bulkkhead during the load test.

Note: the resilience4js library operates by decorating promises. It takes its inspiration from netflix’s resilience4j and provides a semaphore based bulkhead. One important consideration is that the resilience4js bulkhead currently does not queue. If the bulkhead is saturated an error is immediately raised. In a bulkhead that supports queueing (such as many database connection pools) the action may block which is why bulkheads should be combined with timeouts.

The first step is to create a bulkhead by assigning it a name, a capacity, and optionally metrics to publish to:

const bulkhead = resilience4js.Bulkhead.New('http', 1000, metrics);

The next step is to decorate a promise to enable partitioning through the bulkhead:

const wrappedGet = bulkhead.decoratePromise(get);

The final step is to handle promise state. The example below returns 429 (Too Many Requests) on promise error:

wrappedGet(req, res)
.then((val) => {
res.send(val);
})
.catch((err) => {
if (err instanceof BulkheadExhaustedError) {
statusCode = 429;
res.status(429);
res.send({ error: err.message });
} else {
throw err;
}
})

Each of the tests use a dummy express handler which sleeps for 500ms and then returns:

const get = (req, res) => {
return new Promise((resolve, _) => {
setTimeout(() => {
resolve('hi');
}, 500);
});
};

50% Capacity

The first test configures a bulkhead with 100 allowed http requests and then applies 100 requests per second.

const bulkhead = resilience4js.Bulkhead.New('http', 100, metrics);

100 requests/second is then applied to the service:

$ echo "GET http://localhost:3000/" | vegeta attack -rate 100 -duration=0 | tee results.bin | vegeta report

The results are then surfaced to prometheus using the resilience4js prometheus surfacer:

Bulkhead Half Capacity

The top left chart “Utilization Percentage” shows that the bulkhead is 50% utilized during the test. The upper right hand chart reflects the same thing showing that the bulkhead is configured to allow 100 maximum concurrent calls (yellow line) with ~50 available during the test (green line). The bottom left hand graph shows the application processing the expected 100 requests per second.

Bulkhead 1/2 Capacity ~500ms median

Finally the chart above shows the latency of the HTTP requests during this time. The requests hover around 500ms, as expected.

100% Capacity

This test applies 200 requests per second:

$ echo "GET http://localhost:3000/" | vegeta attack -rate 200 -duration=0 | tee results.bin | vegeta report

The upper left hand chart above shows the jump to 100% bulkhead saturation. A different view is reflected in the upper right hand chart showing a dip in available calls (green line) down to 0, and a maximum allowed concurrent HTTP calls at 100 (yellow line).

Once again the latency of 200 requests is hovering around 500ms:

The application is performing as expected since all actions are within the constraints of the bulkhead.

150% Capacity

The final test illustrates what happens when the bulkhead because saturated by exceeding its capacity. 300 requests per second are applied:

$ echo "GET http://localhost:3000/" | vegeta attack -rate 300 -duration=0 | tee results.bin | vegeta report

The upper left hand “Utilization Percentage” Shows that the bulkhead is saturated at ~100%. The capacity chart in the upper right hand reflects this. Once again it shows that it has 100 max concurrent calls (yellow line), and there are 0 available calls (green line). The bottom left hand chart shows the application is handling 300 requests / second. 200 of those calls are resulting in 200 status code while 100 are resulting in 429! The bulkhead is shedding 100 calls per second because it is saturated! The graph below shows that the 200 calls that are getting through are being completed in the expected latencies (median ~500 ms).

Saturated — 200 status code latency

The final graph below shows that the application is handling the saturated requests in (median ~3ms ).

Saturated — 429 status code latency

Another thing to note is that the use of a bulkhead doesn’t prevent overload, at some point the service will exhaust its resources and be overwhelmed. The guarantee that the bulkhead provides is that during that time the bulkhead will continue to ensure that the maximum number of concurrent calls (100) are not exceeded.

When to Use?

In my experiences unbounded resource usage is one of the largest risks to availability. Because of this I would put every external call inside of a bulkhead. Without the use of bulkheads, resources can grow unbounded in relation to the amount of work being done which can easily overwhelm a system and cause cascading failures.

Many tools already include bulkheads in libraries that expose pools. Additionally, circuit breakers (like hystrix) overload the circuit breaker with bulkheading by bounding the number of concurrent calls allowed. Envoy provides a bolt-on network level bulkhead that doesn’t require any application level changes.

Bulkheads empower enforceable resource controls and help to maximize availability and resiliency which directly benefits clients.

Bulkheads are an effective way to bound resource usage. They ensure isolation and help to mitigate cascading failures through preventing overload. Bulkheads are not a silver bullet but one of many primitive patterns for creating resilient highly available applications.

--

--