ChAP: Chaos Automation Platform

Netflix Technology Blog
Netflix TechBlog
Published in
5 min readJul 26, 2017

--

We are excited to announce ChAP, the newest member of our chaos tooling family! Chaos Monkey and Chaos Kong ensure our resilience to instance and regional failures, but threats to availability can also come from disruptions at the microservice level. FIT was built to inject microservice-level failure in production, and ChAP was built to overcome the limitations of FIT so we can increase the safety, cadence, and breadth of experimentation.

At a high level, the platform interrogates the deployment pipeline for a user-specified service. It then launches experiment and control clusters of that service, and routes a small amount of traffic to each. A specified FIT scenario is applied to the experimental group, and the results of the experiment are reported to the service owner.

Experiment Size

The best experiments do not disturb the customer experience. In line with the advanced Principles of Chaos Engineering, we run our experiments in production. To do that, we have to put some requests at risk for the sake of protecting our overall availability. We want to keep that risk to a minimum. This raises the question: What is the smallest experiment we can run that still gives us confidence in the result?

With FIT, the impact of an experiment affects metrics for the whole system. Statistics for the experimental population are mixed with the remaining population. The experimental population size (and the effect size) has to be large in order to be detectable in the natural noise of the system

Here’s an example chart from a FIT experiment:

Variable metric from a FIT experiment. Event was introduced around 19:07.

Can you determine when the experiment ran? Did it have an impact greater than the noise of the system? In order to create bigger differences that were verifiable by humans and machines, users ended up running larger and longer experiments, risking unnecessary disruptions for our customers.

To limit this blast radius, in ChAP we take a small subset of traffic and distribute it evenly between a control and an experimental cluster. We wrote a Mantis job that tracks our KPIs just for the users and devices in each of these clusters. This makes it much easier for humans and computers to see when the experiment and control populations’ behaviors diverge.

Here’s an example chart from a ChAP experiment:

Comparative metrics for experimental population (red, bottom line) and control population (blue, top line).

It is much easier to see how the experimental population has diverged from the control even though the impacted population was much smaller than in the FIT experiment.

Automation

Any change to the production environment changes the resilience of the system. At Netflix, our production environment might see many hundreds of deploys every day. As a result, our confidence in an experimental result quickly diminishes with time.

In order to have experiments run unsupervised, we had to make them safe. We designed a circuit breaker for the experiment that would automatically end the experiment if we exceeded a predefined error budget. An automated, ongoing analysis hooks into the same system we use to do canary analysis.

Before ChAP, a vulnerability could be identified and fixed, but then regress and cause an incident. To keep our results up-to-date, we have integrated ChAP with Spinnaker, our CI+CD system to run experiments often and continuously. Since rolling out this functionality, we have successfully identified and prevented resiliency-threat regressions.

Concentration

Some failure modes are only visible when the ratio of failures to total requests in a system crosses certain thresholds. The load balancing and request routing for FIT-decorated requests were evenly spread throughout our production capacity. This allowed increased resource consumption to be absorbed by normal operating headroom, which failed to trigger circuit breakers. We call this sort of experiment a “diffuse” experiment. It’s fine for verifying the logical correctness of fallbacks, but not the characteristics of the system during failure at scale.

There are critical thresholds that are crossed only when a large portion of requests are highly latent or failing. Some examples:

• When a downstream service is latent, thread pools may be exhausted.
• When a fallback is more computationally expensive than the happy path, CPU usage increases.
• When errors lead to exceptions being logged, lock contention may become a problem in your logging system.
• While aggressive retries occur, you may have a self-inflicted, denial-of-service attack.

With all this in mind, we want a way to achieve a high ratio of failures or latency while limiting the potential negative impact on users. Similar to how we segregated the KPIs for the experiment and control populations, we want to separate a few machines to experience extreme duress while the rest of the system was unaffected.

Example

Let’s say we want to explore how API handles the failure of the Ratings system, which allows people to rate movies by giving them a thumbs-up or thumbs-down. To set up the experiment, we deploy new API clusters that are proportionally scaled to the size of the population we want to study. For an experiment impacting 0.5% of the population on a system with 200 instances, we would spin up experiment and control clusters with one instance each.

We then override the request routing for the control and experimental populations to direct just that traffic to the new clusters instead of the regular production cluster. Since only experimental traffic is being sent to the “experiment” cluster, 100% of the requests between API-experiment and Ratings will be impacted. This will verify that API can actually handle any increased load that the failure scenario may cause. We call this class of experiments “concentrated” experiments.

The result? ChAP generates emails like the following:

“TL;DR: we ran a ChAP canary which verifies that the [service in question] fallback path works (crucial for our availability) and it successfully caught an issue in the fallback path and the issue was resolved before it resulted in any availability incident!”
-a stunning Netflix colleague

With ChAP, we have safely identified mistuned retry policies, CPU-intensive fallbacks, and unexpected interactions between circuit breakers and load balancers.

Learn more about Chaos

“Chaos Engineering,” authored by the Netflix Chaos Team.

We wrote the book on Chaos Engineering, available for free for a limited time from O’Reilly.

Aaron Blohowiak spoke at Velocity 2017 San Jose, on the topic of Precision Chaos.

Nora Jones also presented a talk at Velocity San Jose about our experiences with adoption of chaos tools.

Join the Chaos Community google group to participate in the discussion, keep up-to-date on evolution of the industry, and announcements about Chaos Community Day.

Ali Basiri, Aaron Blohowiak, Lorin Hochstein, Nora Jones, Casey Rosenthal, Haley Tucker

--

--

Learn more about how Netflix designs, builds, and operates our systems and engineering organizations