Oosterscheldekering storm surge bulkheads (Image credit: delta.nl)

Preventing Systemic Failure: Bulkheads in Microservice Architectures

Yu-Han Lin
Glasnostic

--

More microservices and a proliferation of applications lead to more complex interactions between applications and ultimately to less control for operators. Such large-scale, dynamic service landscapes are great for business agility, but they exhibit emergent behaviors that are difficult to manage and secure. These behaviors are inherently complex and non-linear phenomena that typically occur on a large scale and without “warning.” They are the result of aggregate service interactions and must be managed if the business is not to lose its agility.

One popular strategy of containing complex emergent behaviors is to implement the bulkhead operational pattern where possible.

Patterns in microservice architectures usually address either developer or operator concerns. While developer concerns focus on design and deployment, operator concerns emphasize performance, security, availability and scalability-things that matter to those who r un software vs. develop it.

What are Bulkheads?

So, what’s with the name? The bulkhead pattern gets its name from naval ship design. In a ship (or submarine), a bulkhead is a dividing wall or barrier between other compartments. This means that, for example, if a section of the ship hits a rock (or is hit by a torpedo), the particular portion that experiences the breach may fill with water, but it is partitioned in a way that damage is limited to that section and doesn’t ultimately sink the ship.

As a pattern in microservice architectures, a bulkhead manifests itself as a firewall or some form of network segmentation or gateway. The most significant benefit of a bulkhead is that it helps to contain failures, attacks or performance degradations from spreading to other portions of the system because they are essentially partitioned.

The Shortcomings of Bulkheads

Of course, there are situations where you might want to “open a hatch” (the doors that separate bulkheads) to allow communications between two otherwise partitioned services. Why? Perhaps you have a scenario where during normal operations you don’t want Services A and B located in different availability zones to communicate with each other. But, what if Service A becomes unavailable and you want to temporarily failover to Service B? This can be challenging to achieve with firewalls, gateways, or segmentation software as they would have to be deployed ahead of time and combined with special-purpose rate limiters or additional tools that can automate their required temporary configuration adjustments. The more sustainable approach is to use smarter, more adaptive policies around when and how services can communicate with each other.

The Developer Perspective

As previously mentioned, there are two perspectives to consider when it comes to the implementation of patterns in microservice architectures. In this section, we’ll look at the bulkhead pattern from the perspective of the developer and the typical software they might employ to implement it.

Let’s take a scenario where an enterprise has begun executing on a microservices strategy where two separate teams are developing their own set of microservices. In the beginning, it is decided that there is no reason they should communicate with each other. However, as development progresses and business requirements change, it is decided that the applications will need to communicate with each other. This communication should be minimal and done in a controlled manner as to avoid any undesirable side effects by “opening the hatch” to the other application. So, before the deployment of the applications to production, the teams agree to rate limits and under what circumstances communication between the applications should be allowed to occur. Implementing a bulkhead between the applications is the likely pattern of choice to use in this scenario.

Support in Kubernetes

Seeing that containers are the most popular choice for deploying microservices, developers will naturally start with investigating what sorts of capabilities are native to Kubernetes that might help them implement a bulkhead pattern.

To implement bulkheads in Kubernetes, its network policies functionality can be leveraged. Setting network policies for ingress traffic has been stable since Kubernetes 1.7, with egress rules added in 1.8. However, network policies in Kubernetes don’t work “out-of-the-box” and the network provider must support it. (You can find the list of providers who support them, here.) Now, using the scenario previously described above, let’s examine how implementing a bulkhead might be accomplished.

First, let’s specify that the pods of the first app are selected by the pod selector app=app1 and the pods of the second application are selected by using app=app2. Next, we need to establish a policy for each application to allow only traffic from the same application. This will automatically disallow traffic coming from the other application (see Figure 1).

Figure 1: Kubernetes network policy allowing only traffic from the same app. Apply this policy for each app (once by using the app=app1 selector and once by using app=app2).

The next step is to apply a network policy for each traffic flow between the applications that we want to whitelist. Let’s say, for example, that the search and the web services of app2 both need to access the users service of app1, because they share the same user data (see Figure 2).

Figure 2: Kubernetes network policy allowing only traffic from the same app. Apply this policy for each app (by changing the podSelector).

As you can see, network policies are being enforced on the connection level. In our example, this means the search and web service are permitted to initiate a connection to the users service, but not the other way around. It is also worth noting that although we have allowed traffic to the users service, no rate limiting is possible by just using Kubernetes network policies. So, although we have locked down the directional flow of traffic, we have at the same time given uncontrolled access to app2, who can potentially crash app1. In the next section you’ll see how this problem will manifest itself again with Istio. For more information about how network policies work in Kubernetes, we recommend reading “Securing Kubernetes Cluster Networking — The Unofficial Guide to Kubernetes Network Policies” by Ahmet Alp Balkan.

Support in Linkerd

Linkerd is an open source project sponsored by Buoyant and arguably “the original” Service Mesh. Initially written in Scala like Twitter’s Finagle, from which it evolved, it has since merged with the lightweight Conduit project and relaunched as Linkerd 2.0.

Both Linkerd v1 and Linkerd v2 focus more on routing than on security or enforcing additional policies and as such do not support the creation of bulkheads. In fact, rate limiting is not supported in either version, although there is a plugin available for Linkerd v1 and a request to have this feature mainlined in a future release.

Support in Istio

Istio is arguably the most popular service mesh at the moment. Therefore developers often look to see how they can leverage it to implement developer or operational patterns such as bulkheads.

To add rate limiting to Istio, policy enforcement needs to be enabled in conjunction with Redis and an adapter so that quotas can be stored. If just the testing of configurations is required, the memquota adapter can be used. Next, for Istio to apply rate limiting, a VirtualService definition needs to be added for each service that will be participating. Finally, rate limiting can now be applied to the traffic segments.

Figure 3 shows an example of an Istio configuration that limits a web service to issue no more than 500 requests per second against a users service and a search service to issue no more than 200 requests per second. If any of these services exceeds their allowed budget of request per second, Istio’s data plane will simply return with an HTTP status code 429 (“Too Many Requests”) instead of proxying the request to the service.

Figure 3: Configuration stanzas required to implement two different rate limits for traffic from the “web” and “search” services of “app2” to the “users” service of “app1”.

Two Perspectives on the Bulkhead

As we saw in our scenario where two development teams are developing their own siloed applications, which are allowed to communicate with each other under very specific circumstances, their primary concern has been, “Will my application run as designed?” Although they have obviously given some thought to traffic limits, they have not anticipated the potential problems they’ll run into when these two applications are deployed to production, nor the possible side effects, when they start to interact with other services and applications.

If we look at the same scenario from the perspective of the operators, they are going to be more concerned with, “Will any of the services that comprise these two applications potentially cause the entire service landscape to fail?” As well as, “How can I in real-time prevent this scenario from happening, while at the same time not have to rely on static deployment descriptors?” Operators also need the ability to create independent bulkheads that won’t interfere with other pre-existing and potentially overlapping policies. In a nutshell, operations teams want the ability to layer policies around the bulkhead to protect the surrounding architecture.

Glasnostic’s Operational Perspective

Glasnostic is a control plane in the form of a virtual router that is built from the ground up to support operational patterns. It is analogous to a sound engineer’s mixing board. As such, it is designed around the concept of grouping service interactions logically in channels, with each channel acting as a point of control for the interactions it applies to. Glasnostic supports the creation of any number of channels, for arbitrary sets of interactions. Once a channel is defined, operations teams can then control its underlying interactions by applying policies and operations. Channels are also independent of each other and thus can be layered arbitrarily.

Glasnostic is a control plane for operations teams that controls the complex interactions among microservice applications in order to detect and remediate issues, prevent cascading failures and avert security breaches.

Let’s look at two examples of bulkheads from the operational perspective. First, we’ll implement the same example as we did in the developers perspective.

Based on this example we’re then looking at a scenario where the shared users service is putting too much load on the existing master data management system used by the whole enterprise.

Example 1: Simple Bulkhead

In this first example, we’ll look at our earlier scenario from above, where two microservice applications are generally segmented from each other, but the teams want to allow a limited number of requests from the web and search services of app2 to be able to reach the shared users service of app1.

Unlike Istio, which requires lengthy and tediously complex YAML configurations to implement such rate limiting, Glasnostic users need to create only two simple channels:

  • One channel covering the traffic from web services to users services with a limit of 500 requests per second and
  • Another channel covering the traffic from search services to users services with a limit of 200 requests per second.
Figure 4: Simple bulkhead operational pattern (1) involving two channels “App2 web → users rate limit” (2) and “App2 search → users rate limit” (3) set to limit requests to 500 and 200 requests per second, respectively.

Example 2: Layered Bulkheads

In this next example, let’s say that the development teams did not take into account the load (more than 100 concurrent requests) that their shared users service would eventually put on an upstream master data management system, which happens to be critical for the operation of the entire service landscape.

To remediate the situation, the operations group quickly creates a bulkhead channel that captures requests from all users services to the master-data services and applies backpressure by setting a concurrency limit of 20 concurrent requests. It also informs the relevant development teams of the newly created bulkhead so they can explore alternative designs that can help avoid the load issue in the future.

Figure 5: Bulkhead pattern with additional “Master Data Bulkhead” channel (2) exerting backpressure against the previous pair of channels (1).

Summary

Connecting microservice applications in a larger service landscape leads to complex interaction behaviors that present a fundamentally new challenge to operations teams. One popular strategy to contain these behaviors is to strategically insert bulkheads into the landscape. Bulkheads should support access limits between otherwise separated groups of services so that services can fail over in a controlled fashion.

Bulkheads can be implemented in a way that focuses on either the concerns of developers or the concerns of operations teams. The key difference between these two implementations is that developers are more concerned with whether or not their application will run as designed, while the operations team is more concerned with whether or not they can effectively secure and scale the services when they begin to interact with additional services in an organic manner.

While Kubernetes and Linkerd provide very limited to no support for bulkheads, Istio can be coerced to implement bulkheads if configured correctly. Bulkheads become significantly easier in Glasnostic, in particular with the operational perspective in mind. This is because, unlike service meshes, Glasnostic takes a layered approach to policy definition that enables rate limiting in conjunction with the bulkhead operational pattern.

Originally published at https://glasnostic.com on May 14, 2019.

--

--

Yu-Han Lin
Glasnostic

Core router team lead at Glasnostic. Player of mini-quadcopter.