Improving the Resilience of your Software: a Practical Approach

Mario Bittencourt
SSENSE-TECH
Published in
8 min readOct 9, 2020

At SSENSE, the software we develop is expected to always be available. Our customers from all over the world can interact with our services at their convenience. This makes aspects, such as scalability and availability, top of mind when we develop our features. Far from being our exclusive requirement, I have noticed that this is where several companies’ efforts stop, leaving at least one aspect without receiving the attention it deserves: resilience.

In this article, I will discuss why developing resilient applications is needed and propose a practical approach on how to incorporate it into the design and development of your next project.

Traditional Approach

When developing your application, a well-established approach involves thinking about how to make it always capable of servicing your customers by looking at two factors:

  1. Scalability: The saying “The true enemy of success is success” applies here. Your application receives an influx of new customers at a rate higher than expected, and now it is overwhelmed with requests that effectively makes it unavailable to most or all customers. The solutions usually come in the form of designing your application to scale horizontally and some form of scaling the resources it uses, web services, databases, and so on.
  2. Availability: Here the common strategy is to avoid single points of failure (SPOF), by adding redundancy to as many components as possible. Need a web server? Add two and balance the traffic to both while monitoring the health of them. One is misbehaving? You can take it away and still be available to do business. Repeat this process throughout the entire infrastructure. With cloud computing, most of the heavy lifting is taken care of behind the scenes.

Anyone doing serious business online has likely already implemented actions to cover both of these factors. Unfortunately, this is not enough, especially for applications based on microservices, as it does not handle an important aspect: the services you talk to will fail, no matter what you do.

Resilience

Resilience is defined as the ability to recover quickly from difficulties. In the context of this article, we are looking for ways to make our microservice capable of resisting the failures of its downstream services.

We already established that the services we communicate with will fail from time to time. Then how can we improve the resilience of the application if the items it depends on will fail?

The obvious answer is to improve the downstream services to make sure they are always — as in 100% — available. The problem is that chasing this will be cost-prohibitive and likely to add too many complexities to the development and operation of your application. If you factor that some dependencies are external to your company, hence out of your control, this looks like an impossible goal.

The solution then, is not only to try to improve the services themselves, but factor their failure into the design of the application. This is commonly referred to as establishing a graceful degradation strategy.

Graceful Degradation in Action

Adopting a graceful degradation strategy postulates that it is possible to provide the service for your customers even if certain parts of your application do not behave as expected. Let’s see how this works in a fictitious example.

Imagine we have an e-commerce solution, in normal operation mode, our customer would visit the website and be offered a landing page that displays recommendations based on their preferences, the status of their recent orders, and their notifications such as unread messages from our customer care agents or personal stylists.

Figure 1. Example application with contents coming from different sections.

Imagine that serving this landing page involves orchestrating 4 services:

  • Catalog: provides the information (images, description, prices) on any product
  • Personalization: provides an updated list of products that were picked based on user preferences
  • Order History: provides access to past orders, their status, and other details
  • Notification: provides a list of messages sent to the user
Figure 2. Decomposition of the application into the individual services.

When looking at the previous list we reach the conclusion that we can still serve our customers, albeit in a degraded mode, if all but the Catalog service are out of commission. In our case, if we designed our application to detect and operate with their failure our page would behave like the one illustrated in Figure 3.

Figure 3. The same application with reduced functionality.

The rationale is simple, in our scenario, in the event of a failure on the recommendation, we switch to display products from our catalog based on non-personalized criteria (example: new releases). Our customers can still see and browse the catalog, which leads to new sales.

It is important to understand that in this case we are certain that the customer is losing functionality — and the business loses part of the conversion — but at least the customer can still interact with our application. The same principle applies to the other services, but with a different outcome. If the notification or the order history is unavailable, don’t show an error page. Instead, craft a message to indicate we are experiencing an issue and prompt the user to keep retrying.

Now that we know the ways graceful degradation can help, you should ask yourself, how can I benefit from it in my application? The answer is two-fold, first by reviewing the design process to incorporate failures and second to apply industry patterns, such as circuit breakers, to detect and handle failures.

Incorporating Failures into the Design Process

In order to devise a graceful degradation strategy you need to understand how your application works focusing on four aspects: highlighting the external dependencies, determining which ones could be made optional in the event of a failure, for those optional what should be a fallback response, and upon reestablishment of the failed dependencies what actions should be taken.

Please keep in mind that this is a business and engineering collaborative endeavor. While engineering can help with identifying the external dependencies and the technical aspects of introducing the circuit breakers and triggering any compensating actions, it is the business’ responsibility to decide which dependencies could be optional and which fallback responses should be used.

The best way I have found to facilitate this collaboration is to produce a visual representation of the application flow. Lately, this means to generate a Business Process Model and Notation (BPMN) diagram of the use cases of the application. One of the benefits of BPMN is that it is a non-technical artifact that, and at least in my experience, helps to visually express the application’s expected behavior that can be easily created and understood by both business and engineering.

Let’s take our e-commerce example. Imagine we built the BPMN that describes the creation of our landing page.

Figure 4. The simplified flow of operations involved.

In the simplified version of the flow, seen in figure 4, you start looking at the gateways — diamond-shaped elements — and the communication that happens between swimlanes. Those will give you the decision points and also the dependencies you have.

Figure 5. Dependencies and decisions highlighted in green.

Next is to discuss with business stakeholders and essentially mark which ones could be optional and what to do in the case of issues. Then incorporate those decisions back into the diagram.

Figure 6. Adding a circuit breaker with a fallback mechanism.

If you are using Domain Driven Design and adopted Event Storming, you can use a similar approach, but instead, focus on the policies and bounded context boundaries to help identify which parts can benefit from a fallback.

Circuit Breakers

After devising your graceful degradation strategy you have to incorporate it in your application. As seen in the previous section, every call to a service can fail and it is upon failure that you will enact your fallback solution to handle it gracefully.

The circuit breaker is a known pattern that can be used to help with the detection of problems with your dependencies. The way a circuit breaker works can be seen below.

Figure 7. Circuit breaker returns error immediately after the failure threshold is reached.

Figure 7 shows that when using the circuit breaker, all calls to a service are monitored and if the failure rate is reached the circuit breaker opens and returns the failure immediately, without even trying to reach the service. On its own, this helps to avoid overwhelming the service that may already be suffering and also making your application react faster but does very little for the graceful degradation per se.

A common approach to add this to your application code is to wrap or decorate the call to dependency with a circuit breaker component. This decorator catches the errors reported — or thrown depending on the language — from accessing the dependency and tracks the number of occurrences over time. If they exceed a predetermined threshold the circuit breaker opens and stays so for a while.

This is a good indication — and place — to trigger the alternate behavior you want to provide. Following the same decorator approach, in our e-commerce example, the solution would look like figure 8.

Figure 8. Wrapping the client for a service with a decorator to make it transparent to the upper layers of your application.

If you are using a service mesh, your application is already relying on proxies to establish the communication between the services needed to deliver the expected functionality.

Figure 9. Service mesh illustrating the communication happening between the services via the proxy

In this scenario, you can add the graceful degradation in a transparent way, at least as the application code is concerned. Launch a sidecar containing the circuit breaker + fallback logic and configure your application to use this sidecar.

Figure 10. A sidecar is deployed and it handles the circuit breaker and fallback without the application having to be updated.

Summing things up

As we have seen, looking at your application from the traditional availability and scalability angles is not enough to cope with the fact that your dependencies will fail and requires us to evolve our approach towards the always-on goal. In a scenario where microservice dependencies are only growing, having a graceful degradation strategy will only increase in its importance and potential impact when adopted.

To create your graceful degradation strategy is not a necessarily complex endeavor but requires some mindset adjustment as it is best done at design time and needs the involvement of both teams: business and engineering.

Circuit breakers are your friends and using them provides an added benefit of a greater chance to faulty microservices to recover quicker. The best part is that there are many implementations out there that can reduce the effort needed to adopt them.

Now it is up to you to start incorporating graceful degradation into your next project.

Editorial reviews by Deanna Chow, Liela Touré, & Mikhail Levkovsky.

Want to work with us? Click here to see all open positions at SSENSE!

--

--