Automated Failure Testing

a.k.a. Training Smarter Monkeys

At Netflix, we have found that proactive failure testing is a great way to ensure that we have a reliable product for our members by helping us prepare our systems, and our teams, for the problems that arise in our production environment. Our various efforts in this space, some of which are manual, have helped us make it through the holiday season without incident (which is great if you’re on-call for New Year’s Eve!). But who likes manual processes? Additionally, we are only testing for the failures we anticipate, and often only for an individual service or component per exercise. We can do better!

Imagine a monkey that crawls through your code and infrastructure, injecting small failures and discovering if it results in member pain.

While looking for a way to build such a monkey, we discovered a failure testing approach developed by Peter Alvaro called Molly. Given that we already had a failure injection service called FIT, we believed we could build a prototype implementation in short order. And we thought it would be great to see how well the concepts outlined in the Molly paper translated into a large-scale production environment. So, we got in touch with Peter to see if he was interested in working together to build a prototype. He was and the results of our collaboration are detailed below.


“A lineage-driven fault injector reasons backwards from correct system outcomes to determine whether failures in the execution could have prevented the outcome.” [1]

Molly begins by looking at everything that went into a successful request and asking “What could have prevented this outcome?” Take this simplified request as an example:

(A or R or P or B)

At the start, everything is necessary — as far as we know. Symbolically we say that member pain could result from failing (A or R or P or B) where A stands for API, etc. We start by choosing randomly from the potential failure points and rerunning the request, injecting failure at the chosen point.

There are three potential outcomes:

  1. The request fails — we’ve found a member facing failure. (From this we can prune future experiments containing this failure.)
  2. The request succeeds — the service/failure point is not critical
  3. The request succeeds, and there is an alternative interaction that takes the place of the failure (i.e. a failover or a fallback).

In this example, we fail Ratings and the request succeeds, producing this graph:

(A or P or B) and (A or P or B or R)

We know more about this request’s behavior and update our failure equation. As Playlist is a potential failure point in this equation, we’ll fail it next, producing this graph:

(A or PF or B) and (A or P or B) and (A or P or B or R)

This illustrates #3 above. The request was still successful, but due to an alternate execution. Now we have a new failure point to explore. We update our equation to include this new information. Now we rinse, lather, and repeat until there are no more failures to explore.

Molly isn’t prescriptive on how to explore this search space. For our implementation we decided to compute all solutions which satisfy the failure equation, and then choose randomly from the smallest solution sets. For example, the solutions to our last representation would be: [{A}, {PF}, {B}, {P,PF}, {R,A}, {R,B} …]. We would begin by exploring all the single points of failure: A, PF, B; then proceed to all sets of size 2, and so forth.



What is the lineage of a Netflix request? We are able to leverage our tracing system to build a tree of the request execution across our microservices. Thanks to FIT, we have additional information in the form of “Injection Points”. These are key inflection points in our system where failures may occur. Injection Points include things like Hystrix command executions, cache lookups, DB queries, HTTP calls, etc. The data provided by FIT allows us to build a more complete request tree, which is what we feed into the algorithm for analysis.

In the examples above, we see simple service request trees. Here is the same request tree extended with FIT data:


What do we mean by ‘success’? What is most important is our members’ experience, so we want a measurement that reflects this. To accomplish this, we tap into our device reported metrics stream. By analyzing these metrics we can determine if the request resulted in a member-facing error.

An alternate, more simplistic approach could be to rely on the HTTP status codes for determining successful outcomes. But status codes can be misleading, as some frameworks return a ‘200’ on partial success, with a member-impacting error embedded within the payload.

Currently only a subset of Netflix requests have corresponding device reported metrics. Adding device reported metrics for more request types presents us with the opportunity to expand our automated failure testing to cover a broader set of device traffic.


Being able to replay requests made things nice and clean for Molly. We don’t have that luxury. We don’t know at the time we receive a request whether or not it is idempotent and safe to replay. To offset this, we have grouped requests into equivalence classes, such that requests within each class ‘behave’ the same — i.e. executes the same dependent calls and fail in the same way.

To define request classes, we focused on the information we had available when we received the request: the path (, the parameters (?baz=boo), and the device making the request. Our first pass was to see if a direct mapping existed between these request features and the set of dependencies executed. This didn’t pan out. Next we explored using machine learning to find and create these mappings. This seemed promising, but would require a fair amount of work to get right.

Instead, we narrowed our scope to only examine requests generated by the Falcor framework. These requests provide, through the query parameters, a set of json paths to load for the request, i.e. ‘videos’, ‘profiles’, ‘images’. We found that these Falcor path elements matched consistently with the internal services required to load those elements.

Future work involves finding a more generic way to create these request class mappings so that we can expand our testing beyond Falcor requests.

These request classes change as code is written and deployed by Netflix engineers. To offset this drift, we run an analysis of potential request classes daily through a sampling of the device reported metrics stream. We expire old classes that no longer receive traffic, and we create new classes for new code paths.

Member Pain

Remember that the goal of this exploration is to find and fix errors before they impact a large number of members. It’s not acceptable to cause a lot of member pain while running our tests. In order to mitigate this risk, we structure our exploration so that we are only running a small number of experiments over any given period.

Each experiment is scoped to a request class and runs for a short period (twenty to thirty seconds) for a miniscule percentage of members. We want at least ten good example requests from each experiment. In order to filter out false positives, we look at the overall success rate for an experiment, only marking a failure as found if greater than 75% of requests failed. Since our request class mapping isn’t perfect, we also filter out requests which, for any reason, didn’t execute the failure we intended to test.

Let’s say we are able to run 500 experiments in a day. If we are potentially impacting 10 members each run, then the worst case impact is 5,000 members each day. But not every experiment results in a failure — in fact the majority of them result in success. If we only find a failure in one in ten experiments (a high estimate), then we’re actually impacting 500 members requests in a day, some of which are further mitigated by retries. When you’re serving billions of requests each day, the impact of these experiments is very small.


We were lucky that one of the most important Netflix requests met our criteria for exploration — the ‘App Boot’ request. This request loads the metadata needed to run the Netflix application and load the initial list of videos for a member. This is a moment of truth that, as a company, we want to win by providing a reliable experience from the very start.

This is also a very complex request, touching dozens of internal services and hundreds of potential failure points. Brute force exploration of this space would take 2¹⁰⁰ iterations (roughly 1 with 30 zeros following), whereas our approach was able to explore it in ~200 experiments. We found five potential failures, one of which was a combination of failure points.

What do we do once we’ve found a failure? Well, that part is still admittedly manual. We aren’t to the point of automatically fixing the failure yet. In this case, we have a list of known failure points, along with a ‘scenario’ which allows someone to use FIT to reproduce the failure. From this we can verify the failure and decide on a fix.

We’re very excited that we were able to build this proof of concept implementation and find real failures using it. We hope to be able to extend it to search a larger portion of the Netflix request space and find more member facing failures before they result in outages, all in an automated way.

And if you’re interested in failure testing and building resilient systems, get in touch with us — we’re hiring!

— Kolton Andrus (@KoltonAndrus), Ben Schmaus (@schmaus)

See Also:

Originally published at on January 20, 2016.