Improve availability and resilience of your Microservices using these seven cloud design patterns

Published in

ASOS Tech Blog

10 min readMay 12, 2018

As this is my first post, I want to share with you seven design practices used today in large modern and scalable systems. Each pattern will improve the availability and resilience of your components. You can also find more details and in-depth samples in my github, so let’s get started!

1. Circuit Breaker

Modern apps have many dependencies which may be external to the application. Also, the application itself may have many components and each may be using other external dependencies.

This can increase the chance of your application suffering from faults, either:

Transient faults — where the application will solve itself in a matter of seconds
Non-transient faults — where the app will suffer for minutes, if not hours

Having this component or caller retrying the call to the ‘do’ operation, or to request an API, will waste your cloud resources. Waiting for a timeout and handling it will waste even more resources. Hence the Circuit Breaker Pattern comes in to play — it will resolve the problem by shielding your application against non-transient faults and here’s how it works.

You create an intermediate component — let’s call it CircuitBreaker — that sits between the Caller and the EndPoint which act in the following ways:

In the stable state, caller will send a request to the CircuitBreaker — it will be delegated to the Endpoint and then the response will be returned all the way back to the Caller . At this point, the circuit has the closed state.
In the faulty state, caller will send a request to the CircuitBreaker — it will be delegated to the Endpoint and, when it has a timeout exception, will be returned all the way back to the Caller . Again, the state of the circuit switched from close to open .
During the recovery state, caller will send another request to the CircuitBreaker — it will be delegated to the Endpoint and then the response will be returned all the way back to the Caller . At this point, the circuit has the half open state.
Finally, when the state becomes stable again, the caller will send a request to the CircuitBreaker — it will be delegated to the Endpoint and then the response will be returned all the way back to the Caller and the circuit will have the closed state.

This pattern will ensure that no resources are wasted. Note that this pattern is often used with the Retry Pattern and can be implemented in such a way that it switches the state based on the percentage of faulty requests, or number of failures, or types of errors. The stability can also be determined by checking a specific ‘health check endpoint’ using the pub/sub pattern or the event emitter pattern. You can, therefore, make this pattern as smart as you want, however, here are few things to consider:

Caller must know what to do with each exception thrown by the CircuitBreaker
CircuitBreaker must be designed in a way that it examines failures and changes strategy based on that. It should log these errors and have a manual change state implementation in case admins want to force it. It should also be thread safe, support async operations and act as facade of the Endpoint .

2. Compensating Transaction Pattern

Large applications will have distributed processes or microservices. Each of these components will be doing some specific job, but what if a fault happens in one of these components? You would have to revert the error to keep the entire application consistent — the Compensating Transaction Pattern will help you do just that.

Let’s take this scenario: imagine that we have a distributed application process that helps users purchase a domain name. You can imagine the process in this way:

(1) Registered User ---> 
       (2) Acquire a Domain ----> 
               (3) Choose a Plan ----> 
                      (4) Issue the payment ---->
                             (5) Receive confirmation email

The process cannot be completed if any of these steps fail. For example, if the user is not registered, he cannot have an identity and login to the admin panel to utilise his/her domain — the same goes for every step. When such a situation happens, it is important to revert the state so that the process can work properly. In this example, it would be important to either retry the missing step, or to revert all the other steps so that we do not have a domain that is acquired by a user but cannot be used, or a payment issued for a service that has not been provided. Reverting the steps can be very hard, but when using the Compensating Transaction Pattern this becomes more manageable. To implement this pattern, you will need:

A global, shared storage that stores the process state
To create a compensating transaction for each step to revert it (for example, delete the user step for registering a user)
To inform admins as some steps cannot be reverted automatically.

The compensating transaction will use the store to determine what state/step needs to be reverted. Here is the compensating transaction for our example:

(1) Delete User ---> 
       (2) Free up a Domain ----> 
              (3) Delete User/Plan association ----> 
                     (4) Refund the payment ---->
                             (5) Inform Admin

3. Health Endpoint Monitoring Pattern

Applications that contain many sub-systems require advanced monitoring of the health of the application and its sub-systems. This design pattern provides an elegant way to achieve that.

Suppose you have a website that uses an SQL database, a search server, a cache store and a content delivery network. Let’s assume that the application (if you just included regular monitoring for the front-end website) always receives a status of200 — this indicates the web is is fine. However, behind the scene, it may be that the cache is failing and all items stored are no longer reverted internally, or possibly the CSS images and files are all retrieved directly from the server and are not served by the CDN. In these cases the regular monitoring isn’t telling us much about the sub-systems. The Health Endpoint Monitoring Pattern will help you detect some of these internal issues — and this pattern contains two parts:

An endpoint exposed by the website (in this example) that checks all the sub-systems and returns the following status:

Status of the website and it may run functional tests
Database status
Search status
Cache status
CDN

2. A monitoring tool that checks:

The website by utilising this endpoint
SSL and if the site is secured
For latency in performance
Geo pinging

This way, you can gather a lot of data and have a realistic view of the application health. When designing this pattern, think of what information needs to be exposed by the endpoint and how it must be presented. Attackers may use this information so it must be secured. Think of what the monitoring tool will be analysing — for example, latency from different geo-locations.

4. Queue-based Load Levelling Pattern

When designing applications around concurrency, it’s important to distribute the load on the services/processes that the application uses so that it remains available. Performing the Queue-based Load Levelling Pattern can help a lot by adding an intermediate queue that keeps track of the messages sent to be processed by the right application services.

Suppose you have a web application with an API backend — it would be very hard to guess the number of requests per second that the application can make to the backend API, as this would depend upon the users and what they try to do. Even worse, the application may suffer from overloaded requests that can make the entire application unavailable. Of course, the application is also unscalable as we’re unable to predict the right numbers to scale or we may always choose to scale to the maximum. Either way, we have wasted resources and this would not be the right direction to go in.

Luckily, the Queue-based Load Levelling Pattern will allow us to solve all these issues by placing a queue between the application and the backend API.

Application ----->   Queue  -----> API

This would change the application so that it sends messages to a Queue that can be consumed by the API . The API would then communicate back to the application asynchronously when needed, depending on when the message would be picked up. Some things to note:

Applications would have to create messages that fit in the Queue
Applications would have to encrypt the messages
When scaling there will be multiple queues, which means the application will need a reliable strategy to determine the right queue to use
To get the data asynchronously from the API, you would use Pub/Sub such as WebHooks, or use Async Task that uses the same data source
Services have to use the Queue either by subscribing to the queue or by Fetching/Getting messages manually.

5. Retry Pattern

Applications often have dependencies that come in many forms — they could be backend services and APIs or databases. Temporal issues may happen due to many factors, such as a temporal network failure — and this will result in an exception temporarily but will resolve itself once the network is back. However, the problem is that the request has already failed and this could happen frequently, or the application times out and becomes unstable very quickly. The Retry Pattern would solve these transient (temporal) errors.

The pattern simply addresses the issue by retrying the transient failure a number of times. Sometimes it can use a heartbeat strategy to retry after an incremental amount of time. Things to note:

You should only retry transient failures
Design the retry policy to match the functionality
Be careful with chaining retries (i.e. retry a request that fails and retry another request that fails)
Log all the retry attempts for further investigations.

6. Throttling Pattern

Unpredictable traffic is always a problem for applications. Suppose you have an application and its traffic suddenly increases — this means that the application resources will become insufficient, if not unavailable with time. This will result in users having a very poor experience (which you clearly want to avoid). Luckily, the Throttling Pattern will help you avoid that by creating a Throttling layer .

The Throttling layer sits between the traffic and the application. Its main responsibility is to monitor the traffic and throttling it to a reasonable amount. Usually it’s a separated layer, but its functionality can also be implemented as part of the application layer.

The Throttling layer has a strategy that it acts upon when certain users/tenants exceed the traffic amount. It can either deny the traffic and send an error code that maybe used to implement the retry pattern, or it can filter certain IP addresses, or traffic coming from certain clients, if the sent data length exceeds a specified size. Things you need to remember:

Throttling needs to happen very fast
You need to monitor traffic constantly and in detail
Return specific error details that can be utilised to do the retry pattern.

7. External Configuration System Pattern

Cloud-based applications have many services that they depend upon. They all live in the cloud, hence the application needs to know how to connect or deal with these services. They may be connection strings or cache settings or search endpoints etc. These are often stored in .net applications in the web.config and will get deployed within the application. The problem with this starts when you have multiple instances of the same applications or containers. It becomes hard to maintain each configuration that may use the same resources. For example, if the connection string changes, every web.config will have to be updated. Ideally, we will want that the operation team or DevOps engineer to take care of any secrets, while we only care about the application level and everything that makes the configuration complex.

The External Configuration System Pattern separates the application from the configuration system and provides configuration as a service external to the application.

The configuration system contains a store of the configuration alongside a set of logic and strategies to deal with empty or non-existing configuration. It then exposes these settings as an endpoint API that will be consumed by multiple applications. The configuration system may retrieve different settings for different applications. It’s also recommended to have an alternative storage, like an SQL database and a cache layer, to enhance the performance as the system is empowering multiple applications. Other things to note:

Configuration systems must always be available
It needs to be performant
It must be secured and support authentication/authorisation as it stores secrets
It must have good error handling and strategies for non-existing configurations
It must be flexible
It must expose the Endpoint to get settings by scope, organisation, environment
It must support typed format (Xml,JSON,Key/Value)
Only operations should have write authorisation access.

So, these were the seven practices that I wanted to share with you. Let me know in the comments below which patterns you use the most, if you have any questions or if you need in-depth examples. I’m happy to dedicate future posts to the topic of your choice.