Using Chaos Engineering to Generate Resilient Solutions

Published in

gft-engineering

4 min readNov 29, 2021

Most companies are realizing the benefits of digital transformation with the adoption of cloud solutions, which offer high scalability and infinite potential for fast growth. However, this advantage can become a disadvantage if resources start to fail in scale and cause greater damage for everyone.

Chaos is a state of complete disorder and confusion. How can this concept, combined with engineering, offer ideas and validations for preventing the failure of our solutions?

Nowadays, during the solution architecture process, especially in the cloud, it is extremely common to include in the final design, resources, and techniques to ensure the resilience of applications against possible failures. By definition, resilience is the ability to recover from failures and keep everything working. Its main goal is to return an application to a fully functional state after a failure.

Imagine that there is a way to deploy an application or a solution to test possible failures that may occur in the real world and at the same time validate the resilience conditions. This practice is known as Chaos Engineering, its purpose is to test the systems’ resilience by conducting experiments that inject faults and simulate real-world conditions.

The concept of this method is similar to that of a vaccine: injecting harmless parts of the virus that will provide immunity to your system without causing any harm. Among the main benefits of this method is identifying single points of failures in the architecture, through experiments with a proactive approach, which result in the identification of failures, but also suggestions for their resolution.

Analyzing the effects of failures in the real world is another great benefit, we don’t know exactly how a solution behaves in specific situations, running controlled simulations will help to better understand this behavior in different scenarios.

Every system, by default, must have excellent observance. The use of Chaos Engineering requires the implementation of a Monitoring and Alerts mechanism that will improve this observability.

Even if they are dealing with micro services, these micro services contain a high dependency on other systems to deliver a result, most of the time we can isolate a system to ensure its resilience, however understanding the propagation of failures between components requires a great understanding of all possible behaviors, and the Chaos engineering experiments will provide a better understanding of these scenarios.

Chaos Engineering is a way to ensure the predictability of solutions, proactively generating the implementation of the culture of acceptance of error as part of the learning process.

Microsoft recently announced the public release of Azure Chaos Studio, a fully managed chaos engineering experimentation platform to accelerate the discovery of unique failure situations.

In architectural drawings, in theory, at least, we can create extremely resilient environments, but the truth is that we will only be able to verify this efficiency in a possible failure. With the Azure Chaos Studio tool, we can create various simulations such as high customer traffic, machines with 100% processing, lack of memory, services that are interrupted or affected by a problem, among many others. These experiments will provide results and analysis that can help in strategies and actions to easily mitigate failures or validate the reliability of the solutions.

Using the Azure Chaos Studio platform, it is possible to customize experiments simply and easily, the intuitive interface helps in the organization and provides a (continuously expanding) library of failure scenarios.

At each step of the experiment, you can use two types of fault injection approaches, which can be either through an agent installation inside a virtual machine or through direct service calls, such as API calls. Another appealing feature is that each experiment is stored in JSON format, which makes administration, understanding, and systemic updating much easier.

Every experiment can be assigned to a CI/CD for validation of integration or stress tests. The addition of failures during this process can help to locate problems that would not normally be found by the usual tests.

Another advantage at the moment is that until April 2022 you will be able to run experiments within Azure Chaos for free and after that date, you will be charged based on the duration for which your experiment actions are performed.

We can understand that it is fundamental for the quality of our solutions, that possible failures or unexpected behaviors are identified and resolved even before they manifest. The investment in ensuring the resilience of architecture will be to deliver value efficiently and effectively to customers.

The more scalable and complex the solution, the greater the responsibility for keeping it stable regardless of external conditions that may affect it. Chaos Engineering will help to ensure predictability, proactively, improving the reliability of your solutions.

References

[Azure Chaos Studio]

https://azure.microsoft.com/services/chaos-studio/

Using Chaos Engineering to Generate Resilient Solutions

Written by Marcelo Goberto Azevedo