Embrace the chaos … engineering

Waldemar Jankowski
Nov 3 · 6 min read

What is it?

According to Principles of Chaos:

Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.

That’s a lot of important words, but I like to think of Chaos Engineering as the “secret shopper” of application resiliency. Just like a secret shopper is used by businesses to uncover issues with customer service, Chaos Engineering is used to expose gaps and weaknesses in your applications. For example, a secret shopper is sent into a store to act in unpredictable ways to see how a customer service representative will deal with an irate customer. In a chaos experiment, you deliberately inject unpredictability into your system to see how it responds to disruptions.

There are many different types of chaos experiments that can be unleashed on your unsuspecting applications and infrastructure, but they typically fall into three broad categories:

  1. Resource: How does your system handle resource spikes? For example — CPU, Memory, I/O, Disk space
  2. State: What happens if your infrastructure’s state changes? For example — shutdown of a server, killing of a docker container, change in system time
  3. Network: How does your application deal with network issues? For example — packet loss, network latency

Difference between testing and experimentation

You may be asking yourself, how is this any different than testing? The main focus of Chaos Engineering is to use experimentation instead of tests. The difference is that “testing” focuses on validating predicted behavior against a specific condition, but “experimentation” focuses on injecting fault into a system to generate new knowledge about how it could potentially misbehave.

Injecting failure into a system can ripple out in unpredictable ways, so think about your components and process as one system. You’re not testing a specific feature, but instead are observing how the collection of components respond together to an unpredicted change. This way of thinking requires a new mindset, but it will build confidence in your ability to handle distruptions.

Why you need chaos engineering

It’s easy to brag about your systems stability when you have a 11 users (almost half are family). I’ve been there. Trust me. If your site is down, not a huge deal. Nobody really cares. Once you start getting more users, things happen, they start interacting with your site in ways you didn’t know were possible. Your 100% test coverage only tests things you know to test for, but what about all the things that are lurking in the dark? Those things that only show up when component A calls component B that calls component C on Feb 29th during a full moon?

Applications can get complicated quickly and it takes a lot of work to finally release something to production. Intentionally trying to break the system may seem like the last thing you want to do once you’ve gone through your build, test, and deployment. It seems scary to inject problems into your live applications on purpose, but that’s exactly what you should do.

Testing gets us part of the way there, but complexity makes it impossible to validate every scenario. That’s where chaos experiments can help you uncover the unknown unknowns.

Agile approach to resiliency

It is an Agile approach to resiliency that complements traditional testing. Just like with Agile software development, you create a product, listen to feedback from your users, and make improvements iteratively, Chaos Engineering at it’s core follows the same process. It’s not the big bang approach of an entire Amazon region becoming unavailable (although that has happened and you should be ready for it). It more likely will be a more common issue like your disk running out of space. Start with injecting small failures, observe how your system reacts, fix any issues that are uncovered. As you increase the scope and run more chaos experiments, your system will iteratively become more resilient.

Can it work in a regulated industry?

You may be thinking that Chaos Engineering is only for for companies like Netflix where the worst thing is a frustrated teenager unable to watch Riverdale, but how about applications that effect people’s jobs and lives? I think it’s possible, and in fact necessary. To be fair, it does require companies to be more deliberate and evolve their DevOps/SRE practices, but that is a good thing (especially in regulated industry). Chaos Experiments help address failures before they become outages.

How to get started?

The holy grail is to run chaos experiments in production, but we can’t get there immediately. I typically would recommend an adoption framework that looks similar to this:

  1. Small one-off experiments in non-production environment. Pick something low impact and measure/observe. The theme here is to learn, fix, and iterate.
  2. Increase scope incrementally. Use this as an opportunity to improve your application resiliency and mature your engineering and operational practices. How is your monitoring? Do understand the recovery steps if something goes wrong? This is a good time to polish your incident management as well.
  3. Continuous chaos experiments in non-prod. Now you’re ready to incorporate the experiments into your CI/CD (which you hopefully are using, otherwise do that first before you even think about Chaos Engineering). Build chaos experiments into your development cycle along with other forms of testing (eg. end-to-end, performance, load etc.). The goal should be to use what you learn to improve your system resilience and further mature your operational and engineering practices.
  4. Continuos chaos experiments in prod. As confidence in your solution’s ability to handle failure rises, you will begin to shift your perspective on disruptions. Now you should slowly start incorporating chaos experiments into production environments. Here you start small again and observe. Learn from how your system reacts and adapt accordingly. By now you should have deep monitoring enabled and real-time alerting to notify you of any issues caused by users or your chaos experiments. Make sure you have a way to roll-back in case something unexpected goes wrong. I repeat, always have a backup plan. If you’ve gotten this far, Chaos Engineering is part of your fiber. Well done!

What is the current landscape to give you a head start?

Now that you have a framework to adopt Chaos Engineering into your applications, I recommend taking a look at the existing landscape. There are a number of open source products that you can get started with right away as well as some interesting services to leverage. Here are a few examples that I like:

  • Simian Army (Chaos Monkey, Kong) — Suite of tools from Netflix to improve resiliency against random instance/region failures
  • Gremlin — Failure-as-a-service (yes it’s a thing) started by Netflix/Google “chaos alumni”
  • PowerfulSeal — Bloomberg open source tool that adds chaos to Kube clusters

Conclusion

I am interested in the potential innovation in this space. There are opportunities to take Chaos Engineering to the next level by incorporating industry trends like Machine Learning. Hopefully I’ve at least peaked your curiosity enough to learn more. Perhaps even think about how you can bring Chaos Engineering into your own applications or company. Many companies already test, but I believe adding chaos experiments to their development practices will greatly improve the overall resiliency of their systems.

It may seem scary at first, but the key is to do it one step at a time. Please don’t run chaos experiments on systems that aren’t ready. I would encourage you and your teams to make chaos engineering a part of your delivery practices, but remember to start small and increase scope as comfort level grows. Nobody wants that call at 2 am because of a system failure, but it’s even worse if that’s self induced through a chaos experiment gone wrong.


Disclaimer

The ideas in this article are my own thoughts and do not reflect or represent those of my employers or customers.

Waldemar Jankowski

Written by

Entrepreneur, dev, reader, BJJ practitioner. Architect at Capital One focusing on DevOps and Cloud.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade