The Verification of a Distributed System
Today’s talk is from one of my favorite speakers on distributed systems, Caitie McCaffrey. During the talk, Caitie delivers a great overview of the various methods for testing and verifying distributed systems.
If one computer goes down and breaks your system — we have a really big problem.
The conference: GOTO Chicago 2016
The speaker: Caitie McCaffrey is a distributed systems engineer and conference speaker with deep experience as a practitioner building large scale systems.
We are all building distributed systems
Whether we know it or not, we’re all building distributed systems. When most people think distributed systems today, they think about large scale systems like Netflix — with 500 or more microservices. Caitie reminds us that we’re all building distributed systems. Even a monolithic application can scale horizontally — with each instance connected to a database — which could also be spread out across multiple nodes.
Whether or not your distributed system is small-scale or large-scale, the same useful methods for testing and verification are equally beneficial.
How do we increase our confidence that our system is doing the right thing?
Safety and liveness
When testing distributed systems, you’re typically checking the validity of certain properties of a system. All such properties can be expressed as the intersection between safety and liveness.
Safety properties assert that something bad will never happen. These statements are expressed as a promise or guarantee by a system — for example, what a database system will not do under a certain condition.
Null values — a database system will never return fields with null values.
Liveness properties assert that something will eventually happen but the system can still make progress. Eventual consistency is an example of a liveness property.
Eventual consistency — if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.
There are multiple forms of testing that can help to build confidence that a system will behave as expected under failure modes. Formal verification speaks to forms of testing that proves that a system is correct. Caitie stresses that this kind of verification is the gold standard, since it tests systems for provable correctness.
It’s a good idea to understand a system before building it, so it’s a good idea to write a specification of a system before implementing it.
— Leslie Lamport
TLA+ is a language for formal specification that was first developed by Leslie Lamport. TLA+ is interesting because it allows you to express a formal specification as a range of input states. In this way, you’re able to describe the safety guarantees of a system as a provable specification.
Caitie uses the example above—also available from Wikipedia — as a simple example of a TLA+ specification. Here we provide a range of inputs as
Init, which can be either
1. We then define the value of
Tick — which describes the next-state relation for the range of the two inputs. Finally, we have
Spec, which is the formal specification that describes the safety properties of the system.
Formal methods at Amazon
In 2014 Amazon published a technical report about using formal methods of verification for services deployed on AWS. Caitie introduces the paper as an example of TLA+ being used as a tool in industry and not just by academics. I really enjoyed reading this paper, as it shows a certain passion by the authors for embracing the use of formal specifications to find bugs that would not normally be found using less formal methods of testing.
Formal methods deal with models of systems, not the systems themselves.
Caitie stresses that while Amazon describes being successful in unearthing 2 devious bugs using TLA+, it’s important to remember that TLA+ only tests the system design and has no knowledge of the actual code used to implement it. Coq is another approach that Caitie mentions, which is able to generate code from a formal specification.
You could have a totally correct specification but your code that you write can still be wrong.
Distributed systems testing in the wild
There are many non-formal methods of testing that can build confidence that a distributed system will behave as expected when encountering a failure.
Unit tests are a common method of testing, and are implemented by the developer. Developers can run these tests locally or in a remote CI environment. For testing distributed systems, it’s important that unit tests focus on increasing confidence that their code is doing the right thing.
Integration tests are a common method of testing that checks for expectations at the boundaries of individual components of the system. For distributed systems, integration testing is going to focus on testing for behavior between applications communicating across the boundary of the network.
Integration testing is useful because it is testing the breakdown between your interfaces.
For testing distributed systems, Caitie stresses that simple testing can prevent most failures.
Three nodes or less can reproduce 98% of failures.
Caitie introduces another paper that is well worth studying: Simple Testing Can Prevent Most Critical Failures. The paper analyzes the common causes of failures in both simple and complex systems. One of the myths that the paper debunks is that you need a full replica of a production environment to accurately reproduce a majority of failures. The paper shows that you can reproduce 98% of failures in a test environment with only three nodes or less!
35% of catastrophic failures are caused by very basic things.
Caitie suggests that you use a code coverage tool to understand where there are major gaps in testing. Some of these gaps may have a higher probability of encountering an unexpected error that results in a catastrophic failure.
Testing error handling code could have prevented 58% of catastrophic failures.
If you’re testing just the golden path then you’re not paying attention to basic parts of code that end up causing a majority of catastrophic failures in production. Caitie describes this as straight up laziness. Using an analysis tool to understand which parts of your untested code are more likely to cause catastrophic failures will save a lot of pain later on — for you and your users.
Property-based testing is a testing method that uses a specification to assert that the behavior of a property should hold for a range of data points. Traditional tests—by contrast — will verify only behavior that is based on specific data points that are checked by a test.
You’re going to write properties about your system that you want to hold, and then [the property based test] is going to execute randomly over that state space.
The general idea is that you define a specification instead of a test case. You then use a tool to generate the many inputs required to test the scope defined by the specification. Caitie mentions that a really nice feature of these tests is that if one of the inputs failed for a property-based test that the tool will actually tell you the use case that caused the failure. Now let’s say that you didn’t use this kind of testing and you ended up encountering the failure in production. Since it was a catastrophic failure, you’d have to start debugging the issue by analyzing loads of logs to discover the range of inputs that caused the failure.
Property-based testing is low investment with high reward.
Caitie mentions that this lesser known method of testing is used successfully by a variety of different companies who are building distributed systems — Basho, the creators of Riak, is one example.
There are also some methods of testing that are used to force running systems to fail — sometimes even in production. These methods are associated with a technique that Netflix calls Fault Injection Testing.
If you don’t force a system to fail, all the theories in the world won’t give you any confidence that it will operate correctly in failure modes.
Caitie provides a few examples of frameworks and processes that use fault injection.
Netflix Simian Army
Netflix announced on their blog in 2011 that they were using a collection of tools they called the Netflix Simian Army. Each tool in the collection injects a certain type of failure into Netflix’s production system.
These are a few of the tools from Simian Army that Caitie mentions in the talk:
Chaos Monkey — a tool that randomly disables a production instance to test for common types of failure without any customer impact.
Latency Monkey — a tool that induces artificial delays in RESTful client-server communication to simulate service degradation and measures if upstream services respond appropriately.
Chaos Gorilla — a tool that simulates an outage of an entire Amazon availability zone.
Caitie cautions that these tools require a ton of investment to get rolling with.
Jepsen is a tool written by Kyle Kingsbury that simulates network partitions and then checks for whether or not your consistency guarantees will hold. Kyle has used the tool to test a variety of popular NoSQL databases, which he writes about on his blog.
Testing in production
Caitie mentions that testing in production should be used as a final resort. Just because you’ve read about Netflix being successful with a tool, does not mean you can skip writing critical tests that fault injection tools will end up revealing anyways.
There are a lot of test cases that don’t need to be run in production to prove something bad will happen.
There is no difference between the impact to users caused by fault injection and the impact to users caused by users. Caitie stresses that testing in production should always be a final step.
Monitoring is not testing
Monitoring is super critical but it’s not testing. If you have a graph that tells you there is a failure then maybe you’ll begin to look for the cause of the failure. Caitie says the key thing monitoring does is that it tells you something is wrong — not always the cause.
Canaries are a method for verification in production. Canaries gradually introduce new code in production which reduces the risk of doing new deploys. Caitie mentions that canaries have limitations.
Canaries usually just tell you the golden path is working.
Canaries can tell you that the system is working as well as it was before — but you usually don’t know if the canary is as good as the old version. Caitie says it is important to understand the limitations of canaries as it applies to your system.
Lineage-driven fault injection
Lineage-driven fault injection explores the state space of all inputs and failures. Caitie mentions that this method only explores for the inputs and states that matter. This method of fault injection will perform an operation and then look at the call graph to understand what happened. The tool will start injecting failure along that path that is defined by the call graph, which makes this a provably correct method of verification. Also, Caitie says that a benefit of this approach is that it will run through the state space of failures in a smaller amount of time than model checkers can.
This was the second write-up of a conference talk that I’ve done for this publication. Even after doing two of these, I can honestly say that I’ve realized how tough it is to appreciate the level of effort in preparation that a speaker brings into a conference talk until you spend time studying each part of it. This brilliant talk by Caitie is only 37 minutes long. That’s a shockingly brief period of time to cover all this wonderful content with such depth and clarity. Kudos to Caitie.
I highly recommend checking out Caitie’s other talks that are listed on her GitHub — which also provides links for each reference she mentions in her talks.