Creating Chaos — The Importance of Chaos Engineering

Published in

SEEK blog

5 min readSep 20, 2020

Thanks to one Maxwell Smart, “Kaos” was all I knew in my childhood. Back then “Kaos” was the villainous organisation responsible for upsetting world order and our hero, Agent 86, bumbled his way to thwarting those doomsday efforts.

Now that I have shown my age…

Chaos means a very different thing to me and what I do for a living nowadays. Principally, the aim is not too different from the famed 1960’s sitcom alluded to earlier. Infiltrate, disrupt and destroy.

The term Chaos Engineering is defined by the action of purposefully attempting to access, potentially augment and bring to a grinding halt, services and applications in an attempt to surface point vulnerabilities or system architectural design flaws in a production environment.

No one is perfect, and naturally, as humans are responsible for the ideation, creation and implementation of applications, infrastructure and networks, they can inherently be vulnerable to mistakes and misconfigurations. Some obvious, and others not so. And with Chaos Engineering, the idea is to highlight the obvious vulnerabilities, but more importantly, bring to light the not so obvious shortcomings.

Netflix — Chaos Engineering pioneer

You cannot utter the term “Chaos Engineering” without acknowledging the pioneer of such a practice: Netflix.

The toolset called “Chaos Monkey” was born. A “simian army” of unruly primates entering your production system, throwing bananas and creating havoc. This was the premise for the practice of Chaos Engineering at Netflix, introduced by Greg Orzell, it was to flip the idea on it’s head that Developers assume the circumstances in which an application or service fails, and put it in the hands of the monkeys to highlight what fails.

Kill. Destroy. Maim. Violent terms, but this is exactly the intent of the monkeys and Chaos Engineering.

Mean motor scooters… (still courtesy of XpertThief)

Knowing the unknowns

Channeling my Dr. Seuss. You know the knowns, but you don’t know the unknowns. Pretty straight forward, but unknowns can be the difference between a business thriving or tanking. The competitive advantage is to know as many of those unknowns as possible.

Pre-flight testing, smoke testing, unit testing. We make assumptions on all sorts of scenarios about how users will attempt to interact with our applications and services. You expect user-entered input to be validated. You expect the user process to flow to the tune of your design. You expect systems to pause and recover when they can no longer facilitate a process. But what if we apply the randomness of a coupled system no longer being there when you expect it. And the underlying network link gone awol too. Or a queue that has decided it’s taking annual leave a day early. Ideally, you want to be able to identify as many of these unknowns as possible by effectively killing what you can in production.

But wait, I don’t want to take production down intentionally

Great, so now this has you and your developers thinking more about building in resilience to the applications and services they build and deploy. The principal of this process of Chaos Engineering is to drive better quality — more resilient systems that will pass the test of screaming monkey carnage. It will also deliver a better quality service for your customers and give your organisation peace of mind about the redundancy status of your architecture.

Developers don’t want to be attending post-production incident reviews where the “pucker” factor is high. They don’t want that finger of shame pointed at them because they haven’t factored in a retry-wait somewhere. Or they couldn’t be bothered factoring in that there are nasty souls out there ready to inject code to assess vulnerabilities in your infrastructure at every form post.

Can you take production systems down with this practice? Heck yeah, but a chaos program should always be planned. Tools used to be able to easily roll the program off to restore production services and return the monkeys to their cages. But the trade off is that the business (and engineers) will have considerably more understanding about their shortcomings in terms of reliability and give them solid, actionable targets and actions to address those problems. The end result being obvious for service deliverability.

Making better engineers

I am a huge advocate of Chaos Engineering because it’s a great opportunity for DevOps and Software Engineers alike. They can think more holistically about the redundancy and resilience mechanisms factored into their infrastructure, applications and services. The earned the ownership of these with their practical efforts which make these engineers more well rounded.

Culturally, its hugely advantageous to any organisation to have engineers that incorporate and plan for failure. Life is not perfect after all. So with this principal front of mind it can have really positive impacts on technical project outcomes. And let’s look at reality, the more unknowns you discover, the more resilient you can be. And the less 4am wake up calls you need to contend with. No one likes the hazy 4am page, not knowing whether you are still in a dream, or it’s real and a network layer has been saturated due to an endless loop in code somewhere. Yeah, that happens and it’s all too common.

What are your options?

Thankfully, nowadays there are a plethora of Chaos Engineering toolsets that you can use in anger against your money-making production environments.

Chaos Monkey is the King Kong (pun intended) of frameworks and was the original built by Netflix for their own production environments back in 2011. But since then, many others have joined the party to provide nuanced experiences, such as targeting Cloud provider network zones or targeting certain applications or known framework vulnerabilities.

One of the increasingly interesting ones is a platform known as Gremlin. Gremlin targets Chaos Engineering principals in containerised environments- essentially Kubernetes. And this is pretty good timing too.

ZDNet reports that in a recent survey, 84% of companies were utilising containerisation for their business. 78% of them were using Kubernetes to support their containerisation initiatives.

So Chaos Engineering targeting Kubernetes w̶i̶l̶l̶ should be a huge focal point for businesses deploying Kubernetes workloads going forward.

There are dozens of Chaos Engineering frameworks and all of them are relevant. The carry out their duty of identifying and validating engineering flaws and vulnerabilities. So you really can’t go wrong employing one or more of them.