Bulkhead Pattern for micro-services

2 min readFeb 2, 2020

Cascading effects of failure

Recently while working on a project that required a large number of external service interactions, I realized that even one buggy service can bring the whole systems down.

Scenario:- Let us suppose that one of the external services becomes non-responsive which means that it blocks the requesting thread for a long time(timeout period) and then throws errors.

If we are making a lot of parallel calls to this service then it will lead to a large number of threads being blocked which in-turn will lead to resource exhaustion and our service will also stop responding. It can cause a cascading effect to bring the whole system down.

The bulkhead pattern

The bulkhead pattern is analogous to bulkheads in the ship. If the hull of a ship is compromised, only the damaged section fills with water, which prevents the ship from sinking.

In a bulkhead architecture, we can allocate resources to each component so that it can never bring the whole system down. So that, an issue affecting a consumer or service can be isolated within its own bulkhead, preventing the entire system from failing.

In our case, we allocated a fixed thread pool to each external service(using Hystrix thread pool), which means that at maximum it can only block a fixed number of threads. We combined this with retry, circuit breaker, and throttling patterns to provide more sophisticated fault handling.

Sample Implementation

We used feign proxy for rest interactions and ribbon as the client-side load balancer.

spring configurations for ribbon:-

external-proxy:
  ribbon:
    ConnectTimeout: 1000
    ReadTimeout: 6000
    MaxAutoRetries: 3
    MaxAutoRetriesNextServer: 0
    retryableStatusCodes: 404,502,504

Note:- external-proxy is the name of feign proxy for the external service.

spring configurations for Hystrix:-

hystrix:
  threadpool.default.coreSize: 10
  command:
    ExternalProxy#method(arg1, arg2):
      execution:
        isolation.thread.timeoutInMilliseconds: 6100
      circuitBreaker:
        requestVolumeThreshold: 10
        errorThresholdPercentage: 50
        sleepWindowInMilliseconds: 60000
      metrics:
        rollingStats:
            timeInMilliseconds: 60000

Explanation

Hystrix assigns a fixed pool of 10 threads to the external service. Circuit breaker configurations depict that in a window of 60 seconds if more than 50 percent requests out of the first 10 requests fail, then the circuit is broken for the next 60 seconds, which means it will not consume any resources for that period of time.

Conclusion

The goal of the bulkhead pattern is to avoid faults in one part of a system to take the entire system down. Resources are partitioned to ensure that resources used to call one service don’t affect the resources used to call another service. Bulkhead architecture can save us from cascading failure effects and helps in building fault-tolerant systems.