Embrace the Chaos … Engineering

An Introduction to Chaos Engineering

Waldemar Jankowski

Published in

Capital One Tech

7 min readMay 7, 2020

dark wooden pole with many power lines connected to it. there are power lines rolled into circles and some pulled straight

What is Chaos Engineering?

As an Architect at Capital One, I deal with complexity on a daily basis. We’re consistently maturing our software delivery best practices and Chaos Engineering has been a focus for me.

According to Principles of Chaos:

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

That’s a lot of important words, but I like to think of Chaos Engineering as the “secret shopper” of application resiliency. Just like secret shoppers are used by businesses to uncover issues with customer service, Chaos Engineering is used to expose gaps and weaknesses in your applications. For example, a secret shopper is sent into a store to act in unpredictable ways and see how a customer service representative will deal with an irate customer. In a chaos engineering test — more commonly called a chaos experiment — an engineer deliberately injects unpredictability into their system to see how it responds to disruptions.

Chaos Engineering is an Agile approach to resiliency that complements traditional testing. Just like with Agile software development where you create a product, listen to feedback from your users, and make improvements iteratively, at its core Chaos Engineering follows the same process.

Chaos Engineering is not about introducing big bang failures like an entire AWS region becoming unavailable (although that can happen and you should be ready for it). It’s more likely to focus on more common issues like your disk running out of space or memory. Start with injecting small failures, observe how your system reacts, and fix any issues that are uncovered. As you increase the scope and run more chaos experiments, your system will iteratively become more resilient.

There are many different types of chaos experiments that can be unleashed on your applications and infrastructure, but they typically fall into three broad categories:

Resource: How does your system handle resource spikes? For example — CPU, Memory, I/O, Disk space, etc.
State: What happens if your infrastructure’s state changes? For example — shutdown of a server, killing of a Docker container, change in system time, etc.
Network: How does your application deal with network issues? For example — packet loss, network latency, etc.

Difference Between Testing and Chaos Experimentation

You may be asking yourself, how is this any different than traditional testing? In Chaos Engineering, the main focus is on the use of experimentation over tests. But what is the difference between an experiment and a test? The difference is that testing focuses on validating predicted behavior against a specific condition, whereas experimentation focuses on injecting fault into a system to generate new knowledge about how it could potentially fail.

Injecting failure into a system can ripple out in unpredictable ways, so think about your components and process as one system. You’re not testing a specific feature, but instead are observing how the collection of components respond together to an unpredicted change. This way of thinking requires a new mindset, but it will build confidence in your product’s ability to handle disruptions.

Testing gets us part of the way there, but complexity makes it impossible to validate every scenario. That’s where chaos experiments can help you uncover the unknown unknowns.

For example, a test will check that a request for a CPU intensive operation responds under 5ms, but a chaos experiment is a hypothesis that when a system is flooded with multiple requests of high and low CPU usage it will appropriately direct the workloads to executors with enough CPU.

Why You Need Chaos Engineering

So you’ve created a toothbrush exchange, congrats! It’s easy to brag about your system’s stability when you have only eleven users (and almost half are family). I’ve been there, trust me. If your site is down, it’s not a huge deal because nobody is really using it anyway.

But who knew the sharing economy was so robust! You now have hundreds of thousands of users optimizing their toothbrush usage across the globe. With your increased scale comes a whole new set of enterprise problems for your app.

Since you care about user experience, you set up your system to operate in both east and west regions to reduce latency. Everything is working perfectly and the site is steadily growing. Then one day, there is a regional outage and all of your traffic is automatically routed to one region. You have auto-scaling set up (which you’ve tested), but you weren’t prepared to handle all that traffic that used to be distributed across two regions in one region. The application comes to a grinding halt because you’ve hit your auto-scaling limit. All your best laid plans are a mess.

Your 100% test coverage only tested for things you knew to test for; but what about all the unknowns that were lurking in the dark until your users showed up and started pressing buttons? Unknowns that only show up when component A calls component B that calls component C on Feb 29th during a full moon?

Man wearing a tan jacket fixing an open electric box with many wires

Applications can get complicated quickly, especially at enterprise scale. It takes a lot of work to finally release something to production and intentionally trying to “break” it after going through build, test, and deployment may seem counterintuitive.

Large enterprises, especially those in regulated industries, need to be even more outage-proof than your average site. Therefore, regulated industries like healthcare or finance should include Chaos Engineering as part of their resiliency efforts to help address failures before they become outages.

Injecting problems into your live applications may initially seem scary, but that’s exactly what you should do. The key is to do it one step at a time. I would encourage you and your teams to make Chaos Engineering a part of your delivery practices, but remember to start small and increase scope as your comfort level grows. Please don’t run chaos experiments on systems that aren’t ready.

How to Get Started with Chaos Engineering?

The holy grail of Chaos Engineering is to run chaos experiments in production. But don’t expect to get there immediately.

I typically recommend an adoption framework like this:

flow chart with black arrows connecting green, blue, and red squares like a staircase

Small one-off experiments in non-production environments. Pick something low impact and measure/observe. The theme here is to learn, fix, and iterate.
Increase scope incrementally. Use this as an opportunity to improve your application resiliency and mature your engineering and operational practices. How is your monitoring? Do you understand the recovery steps if something goes wrong? This is a good time to polish your incident management as well.
Continuous chaos experiments in non-prod. Now you’re ready to incorporate the experiments into your CI/CD processes (which you hopefully are using, otherwise do that first before you even think about Chaos Engineering). Build chaos experiments into your development cycle along with other forms of testing (eg. end-to-end, performance, load etc.). The goal should be to use what you learn to improve your system resilience and further mature your operational and engineering practices.
Continuous chaos experiments in prod. As confidence in your solution’s ability to handle failure rises, you will begin to shift your perspective on disruptions. Do not move on to chaos experiments in production until you’ve successfully run them in non-prod. This could be a while, but eventually the goal should be to start slowly incorporating chaos experiments into production environments. Just like in non-prod, start with small experiments and observe. Learn from how your system reacts and adapt accordingly. By now you should have deep monitoring enabled and real-time alerting to notify you of any issues caused by either users or your chaos experiments. Make sure you have a way to roll-back in case something unexpected goes wrong. I repeat, always have a backup plan.

If you’ve gotten this far, Chaos Engineering is part of your fiber. Well done!

What Tools Can Give You a Head Start on Chaos Engineering?

Now that you have a framework to adopt Chaos Engineering into your applications, I recommend taking a look at the existing landscape. There are a number of open source products that you can get started with right away, as well as some interesting services to leverage.

Simian Army (Chaos Monkey, Kong) — Suite of tools from Netflix to improve resiliency against random instance/region failures.
Gremlin — Failure-as-a-service (yes it’s a thing) started by Netflix/Google “chaos alumni”
PowerfulSeal — Bloomberg open source tool that adds chaos to Kube clusters.

Some additional resources on Chaos Engineering:

Chaos Engineering - Book
Principles of Chaos Engineering — Site
Practical Chaos Engineering breaking things on purpose to make them more resilient against failure — In-depth talk on Chaos Engineering

DISCLOSURE STATEMENT: © 2020 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.