Embarking on a Chaos Engineering Journey

James Gordon
cloud native: the gathering
5 min readOct 22, 2019

How to begin thinking about Chaos Engineering

“Chaos Engineering is the discipline of experimenting on a system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.” — https://principlesofchaos.org/

How do you turn a chaotic mess into a manageable system? Photo by Rick Mason on Unsplash

Today’s computer systems are very complex and vastly distributed. Microservice architectures can quickly become giant piles of legos on the floor, leading to difficulties in troubleshooting failures, extending downtime when incidents occur, and increased anxiety when making changes to production. Chaos Engineering can bring your platform to the next level by increasing the resiliency of the platform itself, increasing the preparedness of the developers and operators to respond to production incidents, and increasing the confidence developers have for making changes to production.

Often times, Production environments are treated like a House of Cards. It’s a fragile work of art. Please don’t change anything, we don’t want it to fall over.

It’s time to clean up the mess, and build a sturdy production environment that developers have confidence in. So how do you begin a chaos engineering program in your organization, especially if you are worried about knocking down your house of cards? There are many important ideas to keep in mind when starting off with chaos engineering. The idea of breaking things can seem intimidating at first, but trust me, you already are breaking things on purpose every-time you develop new code.

Test Driven Development, the first thing I ever learned in my introduction to software construction class in college. Never write a line of code for your program before having a functional test suite. Your test suites must include test vectors designed to break your code. No test suite is complete without testing what happens when you enter invalid inputs or supply the program with a barrage of edge case anomalies. No one ever expects these issues to happen frequently, but any good programmer understands that a user will do something unexpectedly, and if you haven’t tested for it, your code will break, or at the very least, perform unexpectedly.

A map of the Internet connections, circa 2005. Your production environment feels like its own universe sometimes.

If you are going to test your code, why not test how it operates in the production environment? Production environments are so large and complex now that no one individual can be certain of how anything will behave in all circumstances. Do you know what will happen to your service if the database it relies on is no longer reachable? What if just the queries take longer to complete all of a sudden? What happens to your service when a dependency breaks for a dependency of your service? No one person is going to have this answer, but you can discover the answers and begin to develop an understanding of the emergent properties of your production environment through chaos engineering and testing for failure scenarios.

Testing for failures is an essential step in the development cycle. Avoiding failure tests will only provide for more uncertainty in your code, less confidence in making changes, and longer time to resolution for incidents you were unprepared to respond to. The next question becomes, if it is so important to break things, how do I begin breaking things responsibly?

It takes time to become resilient. Start small. Plant the seeds, and nurture them as they grow.

When starting off with Chaos Engineering, start small. A lot of Chaos Engineers will call this idea limiting the Blast Radius. The idea is simple: minimize the impact of your failure test. When designing a failure scenario to test for, don’t start with a black hole test to see what happens when AWS has an S3 outage. Start Small. Design a test that only impacts the behavior of your service. If you are uncertain the test will not effect an upstream or downstream dependency or system, do not perform the test. Only increase the blast radius after validating the resiliency of everything within the blast radius.

Observability enables resiliency. Photo by Daniil Vnoutchkov on Unsplash

Observability is an absolute must for Chaos Engineering. When you begin breaking something in production, you are going to want to see what is broken. There will be things that you expect to break, as well as things that you don’t expect to break. If something unexpected breaks, it will be near impossible to fix without any observability. I suggest having three forms of observability:

  1. Observability: Spiderweb view of your infrastructure. It is necessary to see what is running on what, and where all of those things are, and what they are connected to.
  2. Traceability: Flame chart style or any other visual form that lets you see an entire call chain on demand. If a call goes through multiple microservices, where did it go wrong?
  3. Time Series: Graphs of only the important stuff. It is necessary to have dashboards that display the condition of your platform. Be careful with making these too complex as too much information can overwhelm users instead of aiding them.
Bend, don’t break. Turbulent conditions can catalyze resiliency or destruction.

Chaos Engineering can bring your platform to the next level by increasing both the resiliency of the platform, as well as the preparedness of the developers and operators to respond to production incidents. By introducing a healthy dose of chaos into your systems, you can begin experimenting with possible failures that will affect your platform. This will strengthen your confidence in the platform by aiding you in understanding the emergent properties of the platform.

Overall, undergoing this process of chaos engineering, you will become more prepared each day to resolve actual production issues. Chaos engineering will enable you to avoid future production issues that you can foresee happening, as well as to reduce the amount of time that is needed for resolving a production issue you did not foresee happening. There are many valuable ways to adopt and implement chaos engineering within your organization, and many companies and organizations have found these practices to be valuable and successful.

--

--