3 Lessons Learned From Implementing Chaos Engineering at Enterprise

Tanusree McCabe
Capital One Tech
Published in
5 min readJan 16, 2020
Tree bending in the wind as a symbol of resiliency.

Ever since Netflix introduced Chaos Engineering through their Simian Army toolset in 2012, the idea of inducing failure as a preventative means has become one of the preferred resilience techniques for cloud native distributed systems. Using performance and stress testing alone are legacy techniques and inadequate in unearthing comprehensive weaknesses; chaos engineering evolves and adds to this idea by conducting a series of scientific experiments designed to disrupt steady state. As a result, brittle assumptions are exposed, weaknesses are identified, and confidence is gained into a system’s reliability by addressing these weaknesses.

As an enterprise architect for resiliency, I promote the use of chaos engineering as a technique to improve reliability for our highly distributed cloud native systems. By partnering with numerous delivery teams and supporting the evolution of our chaos engineering capabilities, I learned several lessons that are elaborated below.

#1: It requires the right mindset to work

When done well, chaos engineering enables a deep understanding of a system’s behavior. Predictable outcomes enable automation and process improvement directly contributing to a system’s reliability. Assumptions are tested and verified with hard proof, and root causes of failures are identified and addressed to prevent future issues.

In addition, in complex cloud native distributed systems, it helps to highlight weaknesses in the dependency chain, and either overcome or work around them to provide a better experience to end users. In my experience, often assumptions tend to be made about the reliability of downstream or upstream dependencies, both internal and external to the application. Chaos engineering allows for proving out their veracity. No wonder companies such as Capital One and Netflix practice chaos engineering!

With all of these benefits, it can sometimes be easy to slip into the mindset that the goal is 100% reliability through identifying and fixing all outcomes. However, this is a misleading goal and the true goal should be a balance between the investment of time and labor required to perform these tests, and achieving the appropriate service level objective (SLO). Planning and performing chaos engineering takes an investment of time, labor and tooling. This investment needs to be weighed against the output and criticality of the system.

#2: Causes of failures require widespread coverage

Doing chaos engineering well requires well designed experiments that cover a wide breadth of failure modes. These experiments should be based on the failure mode analysis of the system. A common set of failure modes to start with that tools tend to focus on are:

  • Resource — Max out CPU, memory, etc.
  • Network — Bandwidth, throttling, latency, etc.
  • Dependencies — Critical application and infrastructure dependencies.
  • Application / Process — Failure in code, device, processes, container/server, database etc.

However, while conducting experiments in these areas are valuable, it is not enough. The following areas should also be addressed by failure scenarios because any associated events can also disrupt steady state:

  • Environment — Conditions such as IP usage, access vectors needed for troubleshooting, environment processes etc. For instance, self healing automation needs to be tested to ensure that any components it relies on is in turn available when needed.
  • Security — Compliance variables, security violations etc. For instance, a seemingly benign change of updating a chain of trust for a certificate can have unintended consequences.
  • People — Subject matter experts, change roles, etc. For instance, relying on human knowledge to troubleshoot and respond to an incident rather than shared knowledge available through self service can cause bottlenecks and increase mean time to repair (MTTR).
  • Process — Incident response, build pipelines, availability assumptions etc. For instance, relying on rarely used recovery procedures requires periodic testing to ensure the procedure is valid.

Also, while failures can occur due to one failure condition, in real life more than one thing can go wrong at the same time. Testing one failure event at a time is good, and testing multiple failure events is even better. This type of testing allows us to better understand interactions and cascading failures in order to define how to react to them. In addition, ensure there’s a rollback capability in case something goes wrong (beyond the failure itself) in performing the experiment. Given that it takes time and effort to design and perform these tests, it’s worth noting that starting small and evolving over time is a good strategy.

#3: A prerequisite is enabling observability

Understanding the impact of a chaos engineering experiment first requires an understanding of steady state behavior, which means needing a monitoring baseline. Monitoring can be quite hard and complex for a highly distributed, cloud native system. Enabling observability and establishing a baseline of behavior over time is no small feat. An output of chaos engineering should be the tuning of monitors, logs, traces and alerts. However, until a baseline is established, it can be quite difficult to understand the impact of the test in order to know what action to take.

Baselines should be established with respect to application versions and requires a defined period of time for stability without change. This allows for more effective impact analysis of a particular code release or change contributing to the results of the chaos engineering experiment.

Conclusion

There are many benefits to performing chaos engineering as a normal part of application delivery. It allows for proactive remediation of potential issues prior to adverse impact, and increases confidence in system reliability. To effectively wield chaos engineering requires broad comprehension of failure modes and cascading failures, as well as a well defined monitoring baseline of system behavior. The investment into chaos engineering is non-trivial, and so should be employed for the most impactful outcomes.

DISCLOSURE STATEMENT: © 2020 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.

--

--

Tanusree McCabe
Capital One Tech

Architect at Capital One, focused on Monitoring, Resiliency, Cloud, Containers, Serverless and DevOps