Chaos Engineering — Review Lineage Driven Failure Injection(LDFI)

This is part of the Chaos Engineering series of articles

https://medium.com/becloudy/chaos-engineering-surviving-the-failures-in-distributed-systems-5688c6905dbb

According to Netflix Engineers “ there are many unknown real-production scenarios in which a failure recovery might not work”. Amazon leveraged the “GameDay” exercises that inject real failures like VM failures, power outages into production systems which has been a successful practice. Netflix has employed Chaos Monkey to randomly terminate instances and Latency Monkey to inject response delays in services.

Frequently, a root event which has encountered a failed recovery procedure leads to major outages. For example a routine service upgrade that took few servers offline followed by bad routing caused overloaded routing servers causing outage in Gmail in 2009. In this case, the recovery procedure failed to cover an edge case of a bad routing code.

Greedy recovery — instead of sacrificing some availability, procedures try to maintain high availability which leads to outages. Performing failure drills will be able to capture recovery problems.It is almost impossible to test recovery for all possible scenarios, especially for large-scale deployments.

Injecting Failures at various granularities will expose an inevitable disaster and has gained significant focus especially in the cloud era. Some of the benefits quoted by failure injection experiments done by PagerDuty under a theme called Failure Friday

  • Uncover issues that could reduce resiliency
  • Discover infrastructure deficiencies
  • Ops and Development teams get to work together to build a strong team culture
  • Experiments done during daytime in production helps walking through the issues while significant number of engineers are present in the office and helps spread the knowledge to newbies
  • Reminder that Failures are inevitable

Validating the impact of these failures is risky but resiliency remains critical. Any environment below Production doesn’t give opportunities expose all failure conditions all the time. Some occur only above certain load and certain user inputs making it almost impossible to test all potential failures in a pre-production system however good it mimics production system in scale.

Failure Injection Techniques have provided successful outcomes in large organizations like Amazon and Netflix. Its not widely adopted by smaller failures as the discipline is still slowly evolving while the cloud adoption is aggressively advancing.

In the following sections, we will review a research studies on Failure Injection Testing done by University of Berkeley in collaboration with Netflix. Using the study as a starting point, it will be possible to develop failure injection strategies for various kinds of applications.

Lineage Driven Failure Injection

This section is an essence of the research paper by Peter Alvaro, Kolton Andrus, Chris Sanden, Casey Rosenthal, Ali Basiri, Lorin Hochstein and is based on a research prototype called LDFI (Lineage Driven Failure Injection)

The scale at which companies like Netflix and large Enterprises operate, some of the fault-tolerant code may not be adequately tested and some error conditions may be present only when running on a large scale. Chaos Engineering evolved to experiment with live traffic build confidence in systems how they withstand turbulent conditions. Failures are injected into live systems by Chaos Engineers in a controlled fashion.

LDFI prototype system is called Molly. Molly’s input is a distributed program written in a specification language written in Dedalus which is based on Datalog. It takes correctness specification, program inputs, and execution length. It simulates the program execution under a variety of faults. It terminates under two conditions

  • A violation of the invariants found in the specification. Molly returns the details of the error trace and the faults that drove system into instability.
  • Molly completes the execution without discovering the invariant violation. In this case, the system is certified free from fault-tolerant bugs within the boundaries of the execution and program inputs

LDFI focuses on 2 insights

  • Fault-Tolerance is a redundancy — Redundancy provides alternate paths when error conditions happen and we achieve fault tolerance. If we identify the fault scenarios that expose the missing redundant paths, we can enhance redundancy
  • Navigate backward — Instead of searching for bugs from scratch, start from a successful outcome and navigate backward to identify the combination of faults that could have prevented the outcome

Diagrams Courtesy of the paper

Sample Use-Case

Lineage

In Lineage, we traverse the path of success and find the number of events that contribute to success. Let’s say we want to store data in a durable storage and we have 2 replicas to store and 2 broadcasts, it gives 4 events and 2 power 4 = 16 elements of failure paths. However, some of these paths may be uninteresting, for example, if broadcast1 and repA both fail, we still have success assured through broadcast2/repB

We try to find a boolean formula that will invalidate all alternate computations.

For example, any of following failures will invalidate successful outcome

(Bcast1&RepA), (Bcast2&RepA), (Bcast1&RepB), (Bcast1&RepB)

The above is a conjunction of all disjunctions of failures in the mathematical formula.

When we inject a failure, it can produce two outcomes as discussed in the first section

  • System fails to produce expected result and expose a fault-tolerance bug
  • System succeeds and indicates the alternative strategy to bring the expected outcome. A new formula should be extracted and solved.

A lineage graph is drawn by recursively creating the Boolean formula (CNF), solving the Hypothesis and finding a bug or eventually exhausting all hypothesizes.

Failure Injection Testing (FIT) @ Netflix

FIT is a Netflix platform to precisely control which component to fail and which users are impacted. It allows propagation of failures across microservices in a consistent and controlled manner. Let’s review some key terminologies and context diagram.

Failure Scope: Defines the potential impact of the failure. It can be a customer impact or a service impact or another attribute and controls the blast radius. The FIT server pushes failure simulation metadata to a proxy server called Zuul. Requests matching the failure scope are decorated with the metadata. The metadata can be adding latency to a service call or failure of a remote service call.

Injection Points: Injection Points provide the hook to inject failures. In the above diagram, there are four Injection Points as described below.

  • Hystrix: Isolate failures and define fallbacks
  • Ribbon: Communication layer to remote service
  • EVCache: Provides access to cached data from Memcached
  • Astyanax: Java Client for durable storage in Cassandra

Each of these layers interfaces with FIT context and determine if a given request has to be impacted. The failure behavior is implemented in respective layers for example sleep for 500 ms, throw an exception etc..

Failure Scenarios: FIT should know what to fail. Netflix used a tracing system to trace all failure injection points along the path.

Challenges

  • Choice of Dedalus programming language posted challenges with porting and third-party integrations
  • The accuracy of call graph vs Lineage graph, a trade-off has to be made. Call graphs provide a simple view of all services in the request Path while Lineage graphs provide permutations of various computation steps and data nodes. Some accuracy may need to be sacrificed in order to improve the efficiency of Failure Testing.
  • Defining successful outcome can be challenging. A heterogeneous ecosystem can have different elements as a successful outcome from a record in a table to HTTP return code. Some applications produce a 200 HTTP return code even if the client gets an error.
  • Dynamic nature of services with user updates and releases makes it difficult to replay the scenarios

Solutions

  • Measuring Success: Capturing right metrics to assess if the requested outcome came. Sometimes, the nature of failure might impact the metric reporting ability, in that case, it will be a bug in reporting. Consistency is key, hence having a rule like only if > 75% of requests result in failure of an induced failure, then it’s valid. This eliminates false positives.
  • Request Classes: From the potentially infinite set of user requests, we have to arrive at a finite set of traces
  • Learning Mappings: The call graphs may not show all possibilities. A fallback service might be invoked and might still give a satisfactory outcome.

Conclusion from the Paper

By running LDFI in production, it will continue to uncover new bugs in releases and lurking existing bugs not covered by unexplored request classes. Often it will be difficult to map idealized model with production systems in the real world. LDFI approach successfully approximates Successful outcomes, lineage and replay over real-world services and data structures. It showed that a research prototype can be pushed to production usage.

Closing thoughts

Failure injection can be done at multiple levels. Injecting failures at Infrastructure level as well as Application level and dependencies is key to this experiment. Failure injection will have to simplify the real world scenarios to model successful outcomes and identify failure scenarios that will break the successful outcome. As we improve the outcome by addressing the gaps, additional models will have to evolve in a closed loop process until we are convinced that the system will survive the failures within the boundaries of the experimental criteria

I encourage you to review the exhaustive references and apply the knowledge.

Key References

Technical blogs

Research Papers

Failure Testing Frameworks

Communities

Distributed Tracing

Books

Responses
The author has chosen not to show responses on this story. You can still respond by clicking the response bubble.