Chaos Engineering like Sherlock Holmes

Dudu Hazal OK
Skills Matter
Published in
4 min readDec 21, 2018

Chaos engineering is a discipline that was introduced by Netflix engineers to the DevOps world. In short: the discipline aims to build confidence in distributed systems. Some of the interest areas include increasing resilience and constant availability. Many companies are migrating to cloud architecture due to shifting business needs as new technologies and their incorporation into daily life push architectures to a point where they have to be scalable, reliable and redundant.

Your code as a crime scene

DFDS is known as a shipping and logistics company but it has large IT and digital departments and invests significantly in new technologies and innovation. DFDS approaches future with autonomous trucks and vessels and putting a significant effort on being more data-driven. I am a fresh chaos engineer at DFDS, a company with a futuristic vision for technology. We are in a time period when you can access information very easily and almost for free. In a traditional field, you find core information in books, digital files, online courses and through discussions with experienced colleagues, but chaos engineering’s infancy means the ability to find relevant information can be limited. There are some introductory blogs and books about principles of chaos engineering, but practical examples and tutorials are limited. When you are starting a new role, it is common that there is no one that you can ask for advice on such new disciplines. There are even misunderstandings to grapple with if your title includes the term ‘chaos’. Therefore, the challenges are more than just being new to a company. Migrating to the Cloud already brings many cultural changes in software development, and while chaos engineering and embracing failure are one of the most significant improvements in understanding distributed systems, incorporating these into a business can be a challenge.

As a fresh chaos engineer not coming from a background in computer sciences, I needed to find a course where I could verify my understanding of chaos engineering and also get some practical information to apply the discipline and share its benefits with colleagues. I found the course “Fast Track to Chaos Engineering with Russ Miles” on the Skills Matter website and enrolled without thinking twice.

At the first day of the course, Russ introduced chaos engineering, its importance and motivations. I noticed that my understanding of chaos engineering was correct and realised the importance of interacting with a teacher in order to validate my learning. Russ was so friendly and also stressed a lot on the social part of chaos engineering. He also motivated us to be active in the chaos engineering community. At the end of the day, we had ideas on how to design and apply our experiments and its structure, we were aware of the importance of running experiments in a production environment, as well as the concept of ‘Blast Radius’ and being responsible during the experiments in production. In addition, we learned that Chaos Engineer is more than a position title, it is a culture which should be considered from the early phases of software development.

Day 2 and Day 3 were more hands-on and we introduced our system architectures and its details and had some discussions on designing chaos experiments at platform level, application level and multi-level (such as game days where we see human bugs). We used Chaostoolkit and learned more about its probes and automating the experiments. I learned that chaos engineering is also applicable to monolith architectures if needed, which was surprising.

We had discussions on new cloud technologies and their use cases and did some brainstorming on a business-oriented approach to design experiments. Understanding business needs and user behaviour helps us ask the right questions when designing experiments and discover unknown areas. Furthermore, we talked on the need for agility and how chaos engineering can be useful to introduce continuous learning and improvement in software development. Learning from historical failures in a system in order to find improvement areas makes a lot of sense to have a resilient system, therefore, converting chaos experiments into chaos testing afterward helps us in this goal. We also discussed the negative impact of fast development on reliability, and the importance of nonfunctional complexities and how chaos engineering works for the safety of the system.

Sherlocking in Chaos

I had the opportunity to listen to Adam Tornhill’s “Your code as a crime scene” talk at DFDS IT Conference. He also mentioned a lot on the social part of development processes and as it can be understood from the name, he describes the code as a crime scene. In the course, Russ called chaos engineers the ‘Sherlock Holmes of the distributed systems’ and I really like these expressions for certain concepts as they stress the importance of asking the right questions and finding out the root causes. It is important to learn that resilience is a result of continuous learning and there is no system that is 100 percent resilient. With chaos engineering practice, we can achieve highly resilient systems and also realise that there is always a ‘better’ to strive towards.

👍 For news and articles from Skills Matter, subscribe to our newsletter here.

--

--