How chaos engineering increases your competitiveness

Resilient software design lets you build high quality applications

Proofdock.io
Proofdock
3 min readJul 9, 2020

--

Photo by Brian Mann on Unsplash

Our Proofdock team is enthusiastic about it. Google, Amazon, Facebook, Apple, and Microsoft are doing it. We’re referring to chaos engineering — a technique designed to detect failures in software and increase its reliability.

The goal of software reliability is not to make money, rather how not to lose money. Outages, network errors or buggy services can easily introduce faults into your application. If not treated correctly, these may result in its unavailability, causing customer dissatisfaction at the very least.

Recent events related to the COVID-19 outbreak showed that such turbulent conditions are not just an engineer’s bad dream. Xbox experienced server outages, YouTube and other streaming services were forced to decrease video resolution to save global bandwidth and prevent a complete network collapse.

COVID-19 has taught the whole world a lesson. Your aim should be to design and build resilient applications that are able to withstand unexpected and turbulent conditions, and continue to function at an acceptable service level.

Everything fails all the time — Werner Vogels

Nowadays, software is not just a simple standalone piece of code. Instead, we’re dealing with complex meshes deployed on distributed environments that interact with each other. This type of complexity originates from novel architectural approaches and easy-to-provision cloud infrastructures. Human brains are no longer able to comprehend every single part of the mesh. To tackle this issue, engineers must start thinking about distributed services differently. Instead of taking “when service A, then service B” for granted, engineers should rather consider “when service A, then maybe service B”.

How to win the game

Software reliability is a measure of the percentage uptime of a specified system, in which downtime results from any kind of fault. Availability measures the percentage uptime, considering downtime due to faults and intended actions, such as planned maintenance.

It is possible for one application, say R, to be more reliable but less available than another, say application A. Such a case would arise, for example, if the more reliable application R required several upgrades due to maintenance, causing significant downtime. Hidden behind this simple scenario is a serious issue, namely, how applications can achieve both high reliability and availability.

Resiliency is about being able to adapt to stress or faults in order to prevent failures. Being resilient is important, because no matter how well a system is engineered, entropy will sooner or later conspire to disrupt the system. Residual defects in software or hardware will eventually cause the system to fail while trying to perform its required function.

Resilient design supports detection, response or recovery. An application may therefore be resilient in some ways, but not in others. System A might be more resilient in detecting certain adverse events than system B, and vice versa. The process of shifting applications from reliable to resilient has recently been trending in software design.

Don’t repeat yourself. Repeat yourself!

A simple example of resilient design are redundant deployments of your application. Redundancy duplicates your application’s components in order to increase the overall availability. If one component breaks, the redundant component of your application takes over. Your application will continue to serve your customer workloads and keeps your business running without loss of money and/or reputation.

Conclusion

  • Quality: Chaos engineering improves your application’s reliability in order to maintain quality during turbulent situations.
  • Value: Resiliency is essential for product supremacy. Focusing solely on features is not enough these days, as low quality products can easily be pushed back by market audience.
  • Design: Resiliency does not come for free. Resilient systems are an essential component of reliable services.
  • No money loss: Resilient software design helps your application withstand turbulent situations and continue serving demanding customer workloads.

We are Proofdock, a software tech company located in Germany helping engineers build more resilient and reliable software products. Check out the Chaos Engineering Platform for Microsoft Azure and explore your system.

--

--