Creating Chaos — The Importance of Chaos Engineering

Brandon Kenna
Sep 20, 2020 · 5 min read

Thanks to one Maxwell Smart, “Kaos” was all I knew in my childhood. Back then “Kaos” was the villainous organisation responsible for upsetting world order and our hero, Agent 86, bumbled his way to thwarting those doomsday efforts.

Now that I have shown my age…

Chaos means a very different thing to me and what I do for a living nowadays. Principally, the aim is not too different from the famed 1960’s sitcom alluded to earlier. Infiltrate, disrupt and destroy.

The term Chaos Engineering is defined by the action of purposefully attempting to access, potentially augment and bring to a grinding halt, services and applications in an attempt to surface point vulnerabilities or system architectural design flaws in a production environment.

No one is perfect, and naturally, as humans are responsible for the ideation, creation and implementation of applications, infrastructure and networks, they can inherently be vulnerable to mistakes and misconfigurations. Some obvious, and others not so. And with Chaos Engineering, the idea is to highlight the obvious vulnerabilities, but more importantly, bring to light the not so obvious shortcomings.

Netflix — Chaos Engineering pioneer

The toolset called “Chaos Monkey” was born. A “simian army” of unruly primates entering your production system, throwing bananas and creating havoc. This was the premise for the practice of Chaos Engineering at Netflix, introduced by Greg Orzell, it was to flip the idea on it’s head that Developers assume the circumstances in which an application or service fails, and put it in the hands of the monkeys to highlight what fails.

Kill. Destroy. Maim. Violent terms, but this is exactly the intent of the monkeys and Chaos Engineering.

Mean motor scooters… (still courtesy of XpertThief)

Knowing the unknowns

Pre-flight testing, smoke testing, unit testing. We make assumptions on all sorts of scenarios about how users will attempt to interact with our applications and services. You expect user-entered input to be validated. You expect the user process to flow and to tune your design. You expect systems to pause and recover when they can no longer facilitate a process. But what if we apply the randomness of a coupled system no longer being there when you expect it. And the underlying network link gone awol too. Or a queue that has decided it’s taking annual leave a day early. Ideally, you want to be able to identify as many of these unknowns as possible by effectively killing what you can in production.

But wait, I don’t want to take production down intentionally

Developers don’t want to be attending post-production incident reviews where the “pucker” factor is high. They don’t want that finger of shame pointed at them because they haven’t factored in a retry-wait somewhere. Or they couldn’t be bothered factoring in that there are nasty souls out there ready to inject code to assess vulnerabilities in your infrastructure at every form post.

Can you take production systems down with this practice? Heck yeah, but a chaos program should always be planned. Tools used to be able to easily roll the program off to restore production services and return the monkeys to their cages. But the trade off is that the business (and engineers) will have considerably more understanding about their shortcomings in terms of reliability and give them solid, actionable targets and actions to address those problems. The end result being obvious for service deliverability.

Yeah, no time for a rest here… (source)

Making better engineers

Culturally, its hugely advantageous to any organisation to have engineers that incorporate and plan for failure. Life is not perfect after all. So with this principal front of mind it can have really positive impacts on technical project outcomes. And let’s look at reality, the more unknowns you discover, the more resilient you can be. And the less 4am wake up calls you need to contend with. No one likes the hazy 4am page, not knowing whether you are still in a dream, or it’s real and a network layer has been saturated due to an endless loop in code somewhere. Yeah, that happens and it’s all too common.

What are your options?

Chaos Monkey is the King Kong (pun intended) of frameworks and was the original built by Netflix for their own production environments back in 2011. But since then, many others have joined the party to provide nuanced experiences, such as targeting Cloud provider network zones or targeting certain applications or known framework vulnerabilities.

One of the increasingly interesting ones is a platform known as Gremlin. Gremlin targets Chaos Engineering principals in containerised environments- essentially Kubernetes. And this is pretty good timing too.

ZDNet reports that in a recent survey, 84% of companies were utilising containerisation for their business. 78% of them were using Kubernetes to support their containerisation initiatives.

So Chaos Engineering targeting Kubernetes w̶i̶l̶l̶ should be a huge focal point for businesses deploying Kubernetes workloads going forward.

There are dozens of Chaos Engineering frameworks and all of them are relevant. The carry out their duty of identifying and validating engineering flaws and vulnerabilities. So you really can’t go wrong employing one or more of them.

SEEK blog

Enjoy our Product & Technology insights…

SEEK blog

At SEEK we’ve created a community of valued, talented, diverse individuals that really know their stuff. Enjoy our Product & Technology insights…

Brandon Kenna

Written by

Spruiker of fitness jargon. Attempting to make sense of the technology world, one production incident at a time. Senior Principal Cloud Architect @ PointsBet.

SEEK blog

At SEEK we’ve created a community of valued, talented, diverse individuals that really know their stuff. Enjoy our Product & Technology insights…