Building Resilient Systems: Exploring the Bulkhead and Circuit Breaker Patterns

Aggelos Bellos
4 min readDec 4, 2023

--

The Bulkhead and Circuit Breaker patterns

In the era of microservices, it’s common for services to either depend on others or act as dependencies. These situations call for a method to robustly safeguard our application’s components when a dependency acts up or doesn’t respond at all.

The Bulkhead Pattern

Coming from ship design, where bulkheads divide a ship into sections to stop flooding, this pattern does something similar for microservices, helping to keep them safe. It mainly focuses on the isolation of the resources and thus preventing a whole systemic failure. The isolation is achieved through resource pools consisting of anything that can be exhausted ( CPU, GPU, memory, etc ).

Read world example: Read vs Update

Moving beyond theory, let’s dive into the real world. Supposedly we have 2 endpoints:

  • GET /products/{id} : Returns the data of a single product. It is a lightweight action that its response can easily be served directly from the cache.
  • PATCH /products/{id} : Updates the data of a single product. This ‘heavy’ action may involve updating data across multiple storage systems and invalidating cache entries.

We can easily see that the PATCH action is a more resource-intensive request that can easily hurt the performance of the application.Consider the scenario where this endpoint receives multiple simultaneous requests. A lot of write requests will hit the database and start clogging the available connections. Of course, the issue is not isolated and it affects also the users that just want to view the product’s data.

A typical solution is to set a maximum number of threads that can handle this endpoint. This will work as a rate-limiter and will minimise the maximum resources that it can consume.

Real world example: Service as a dependency

Goind straight to the point, we once again have 2 endpoints:

  • GET /movies : Returns a list of movies. Its response may or may not be served from the cache.
  • GET /movies/{id} : Returns the data of a specific movie. To construct the necessary data it has 2 dependencies:
    1. RatingsService: It returns the ratings for the movie.
    2. CommentsService: It returns the comments for the movie.

The first endpoint is straightforward, with minimal dependencies. Caching its results could significantly boost performance.

On the other hand, the movies/{id} is dependant to the performance of its dependencies. While we might have isolated each service’s resource, a reduced performance in any dependency will affect the “parent” service and probably the other dependencies as well. To protect the resources of the “parent” service a rate-limiting approach might be insufficient. The reason for this is that the idling process might be more stress to other services or exhaust the resources of itself. Through the usage of time-outs and fallbacks we can still return a partially correct response and prevent the failure of the whole service. For example, even the RatingsService is down or it has a degraded performance we can still return the movie’s main data and its comments.

We increased the isolation even more but what if a service starts acting up? how do we stop from making things worse?

The Circuit Breaker pattern

In a distributed environment each service is autonomous. By definition, this makes each service unreliable. A server can be on fire or simple network errors might occur. Because of this uncertainty there is a need for a service to alert when it has stopped accepting requests.

It sounds good, but why do we need an external service to know when another service is unresponsive? we can just ping it directly, right?
The answer of every good engineer, it depends. For simple applications this might be sufficient and you might never have a problem. For bigger application the things are a little bit different. When a service becomes unavailable, its requests will start to pill-up and eventually might be too many for the rervice to restart itself and start again. So an infinite process of pinging and restarting might happen.

Here comes to the rescue the Circuit Breaker pattern. We basically build on top of our components another service that is responsible to monitor their health and proxy the traffic to them. In case the perfomance of a component gets degraded the Circuit Breaker stops proxying the traffic to the service. Using a controlled pinging system (where only one client checks health status), the Circuit Breaker can resume directing traffic to the component. More complex criteria can set about when the switch is activated but the main idea stays the same.

Summary

In our adventure of resilient systems we have discussed two pivotal patterns: The Bulkhead and the Circuit Breaker. The Bulkhead pattern, drawing inspiration from naval architecture, emphasizes the importance of resource isolation. It ensures the issues of one segment do not escalate into a systemic failure.

On the other hand, the Circuit Breaker addresses the challenges of the inter-service dependencies in a distributed environment. It keeps an eye on services’s health and carefully manages the flow of requests to stop the system from getting overwhelmed. This pattern is really important for dealing with unexpected problems like network issues or server failures.

In conclusion, both patterns play a critical role in building robust and resilient microservices architectures. By isolating services and smartly managing traffic, these patterns collectively enhance system stability and reliability. Their application can significantly uplift the resilience and efficiency of distributed systems, ensuring smoother and more reliable operations in dynamic and challenging environments.

--

--