Building Resilient Applications: Introduction

(portuguese version)

What is resilience? If we look for the definition in the dictionary (Google) we will see:

In physics: the ability of a substance or object to spring back into shape; elasticity.
Figuratively: the capacity to recover quickly from difficulties; toughness.

In this series of posts we will deal with resilience as:

the ability of the application to recover from failures and keep running.

We say that:

  • In the best scenario: the application recovers without the user / client perceiving
  • In the worst scenario: the application offers services in a limited way (which we call graceful degradation)


Faults in distributed systems can be classified according to their duration in:

  • Transients: Occur once and disappear with short duration. If you repeat the operation you will probably succeed;
  • Intermittent: They are transient failures that occur repeatedly, they happen and disappear by “will of their own”; and
  • Permanent: These are the failures, which will remain until some action is taken, such as changing an equipment or fix the software;

We can also classify these failures according to their behavior:

  • Crash: the resource has stopped working due to a crash or loss of internal state;
  • Omission failure: The resource does not respond to requests;
  • Timing failure: responses are out of expected timing; and
  • Byzantine or arbitrary: the resource responds completely arbitrary form;

Sometimes when we design and develop our systems, we do not take into account that they can fail. Peter Deutsch and James Gosling have listed eight items that are often assumed to be true of distributed systems projects and that in the long run are wrong and can result in problems (8 fallacies of Distributed Systems), they are:

  • Latency is zero;
  • Bandwidth is infinite;
  • The network is secure;
  • Topology doesn’t change;
  • There is one administrator;
  • Transport cost is zero; and
  • The network is homogeneous.


It is increasingly common to build applications based on microservices architecture or the migration of monolithic to this architecture. Microservices architecture has interesting promises and among them it is to favor high availability.

However, a microservices architecture with modeling mistake in which there are many synchronously chained HTTP calls, can lead to an opposite scenario.

Look at the following picture and scenarios, being simplistic but for the purpose of demonstrating this question.

  • Monolithic: A server has an availability estimated at 99.5% to monolithic system; and
  • Microservices: 5 servers, also with 99.5% availability each. However, in this architecture a service makes chained and synchronous calls to other services, thus adding other points of failure and the availability of the system is given by the availability factor of all the services involved. In this scenario it will be 97.5%, being smaller than the availability of the Monolithic scenario.

In microservices, the failures can be a lot of types, from hardware failures to container movements between nodes.

The question that some can ask is: even with high availability like those of our scenarios, can failures really happen? Should I worry about that?

Murphy’s Law: Anything that can go wrong will go wrong!

So, it’s not a question of “whether failures will happen” but rather “when they will happen.”

In the book “.NET Microservices: Architecture for Containerized .NET Applications” says:

Intermittent failure is guaranteed in a distributed and cloud-based system, even if every dependency itself has excellent availability. It’s a fact you need to consider.

In conclusion, being Resilient is accepting what failures will happen and treating them the best way.

In the following articles we’ll look at software patterns to build resilient applications.

Leave your feedback and follow the series.