Detecting Bugs in Data Infrastructure using Formal Methods (TLA+ Series Part 1)

Published in

Splunk-MaaS

14 min readJun 3, 2021

How the Splunk Messaging as a Service team uses formal methods to improve system reliability and resilience.

Much of Splunk’s data infrastructure is built on the data streaming platform of Apache Pulsar and its storage layer Apache BookKeeper. Together they offer resilient, highly scalable and low latency event streaming and messaging capabilities.

Resiliency is a system’s ability to maintain a certain quality of service and data integrity even under adverse conditions. Servers and disks can fail, availability zones and racks can go offline and networks can suffer degradation and connectivity issues. A resilient system is one that can continue to function and not lose data while these kinds of problems occur.

At Splunk, high availability and integrity of data are of paramount importance and therefore significant effort is employed to ensure these two qualities are met. One such effort is in the form of formal verification. Formal methods have emerged in recent years as an additional tool for builders of distributed data systems to prove the correctness of their systems.

The basic premise of formal verification is to formalise a system or algorithm into a mathematical model which makes it possible for verifiers and model checkers to find bugs. One such technology that has gained popularity within the distributed data systems world is TLA+.

In a series of posts we’ll introduce formal methods, specifically focusing on TLA+ and cover how we have employed it to improve the correctness and robustness of our services which rely on Apache BookKeeper as their storage layer. Be warned, this is a highly technical series about distributed systems and is not light reading!

Introducing Formal Methods

Testing is the bedrock of any quality assurance process and both Pulsar and BookKeeper have a large number of automated tests from unit, integration, chaos and performance tests. But tests have their limits and cannot discover every failure mode of a system. Just like with security, a layered model is required in order to reach the best level of protection.

Formal methods are one such additional layer, usually applied to distributed system protocols or algorithm designs but now starting to make headway in the security space including a recent study that found problems in the Bluetooth protocol.

When we design a system, we specify what it should do but also define certain properties that are important. We usually care very much about these properties. Does it lose data? Will it always remain available if one node is down? Is data always guaranteed to be stored in temporal order? Will the system respond to a request within a certain time period?

With formal methods we break things down into a formal model of the system and the properties it should adhere to. Engineers with experience of property based testing already know how valuable it is to use properties for verification.

Bugs are found by identifying behaviour that violates a property. There are two main methods of doing this:

Constructing mathematical proofs
Using a model checker to do a brute force search of all the possible interactions and flagging any that lead to a property violation. This is called bounded model checking.

Constructing proofs is hard work and takes a considerable amount more effort than using bounded model checking. Proofs may require a dedicated formal verification expert or even a team, whereas model checking is far simpler and regular engineers are able to get value from it. Model checking does have its downsides though, primarily being that brute force search can be slow.

There are many formal verification technologies available but the one most heavily used by engineers of distributed data systems and cloud companies such as AWS is the TLA+ specification language and its bounded model checker TLC. The first major adopter of TLA+ was AWS and the engineers that led that effort wrote about their usage of formal methods in a widely read paper. Since then, other cloud providers such as Azure and many large open source data systems such as MongoDB, Elasticsearch and others have formally verified parts of their systems.

At Splunk we have employed TLA+ to find bugs in Apache BookKeeper and show that proposed fixes are correct.

A Short Guide to TLA+

As we covered earlier, we typically create a formal model of the system being verified and the properties that should hold.

TLA+ is a specification language which we use to write the formal model and describe the properties. This is called a specification. The specification contains the “what the system should do” and the properties.

What the System Should Do

When we model a system in TLA+ we are modelling its design, not its implementation. We model what it should do, not how it should do it. Modelling is partly about learning more about the proposed system and about sharing the design with others, so clarity and simplicity is important. We don’t model every minute detail of a system but rather choose a suitable level of abstraction. Many aspects of a system can be considered optimizations or implementation details and can be left out of our specifications.

The “what the system should do” is represented by variables, constants and actions.

Constants describe aspects of the system that do not change. This could be the actors in the system, or some threshold
Variables describe the mutable state of the system, such as its data or the current status of the different actors (a server being up or down)
Actions modify the system (mutating the variables), for example, a message is sent or received, data is written or deleted.

States, Steps and Time

In TLA+ a state is a mapping of values to variables.

A 24 hour clock can be modelled by the variables hour, minute and second. One state s1, could be the mapping of values to variables: hour=10, minute=34, second=2 while another state s2 could be: hour=23, minute=12, second=56.

This clock specification would have 86400 possible unique states as that is the number of seconds in a day.

A step is the transition from one state to another. If the clock ticks 1 second, then we would transition from s1 to s2, where s2 is one second ahead of s1.

Time is not continuous but is instead represented by discrete units, the step, where each step is the transition from one state to another.

s1 -> s2 -> s3 -> s4 …..sN

We describe the initial state or even multiple possible states that the system can start in. The actions that describe the “what the system should do” will then transition the system into subsequent states.

Properties: Safety and Liveness

We classify our properties into two types: safety and liveness.

A safety property describes something bad that must never happen. The most widely used type of safety property is an invariant. An invariant is a property that must always be true in every reachable state. An example is data loss. The invariant may declare that every item placed into a bag must forever be found in that bag. But if there were to be a hole in that bag, an item may fall out, which would immediately violate that invariant.

Invariants cover individual states only, but we may also want a safety property that describes an illegal transition between states. For example:

a plastic bag must never go from FULL to RECYCLED, it must first be in the EMPTY state.
The clock in state s+1 must be one second ahead of state s.

A liveness property describes something good that must eventually happen. For example, if Bob sends Alice a postcard, Alice must eventually receive it. Or, if I keep placing items in a bag, the bag must eventually become full. If however, the postcard is dropped when the postman empties the post box, then the liveness property is violated as Alice can never receive it. If there is a hole in the bag, then items may fall out at the same rate they get added, causing the bag to never fill — violating that property.

Model Checking

TLA+ comes with both a bounded model checker called TLC and a proof assistant called TLAPS. With our work on Apache BookKeeper, we have focused on model checking as it provides us the best ROI, given that we do not have a full-time formal methods team.

The downside of model checking is that there are limitations to what it can achieve. TLC performs a brute force breadth first search (BFS) of all the possible reachable states. Some specifications can have huge or even infinite state spaces which can make brute force model checking unviable. Limiting the size of the state space is often necessary.

An Example: Two counters

We have two counters and in each step, one of them increments its value by 1. If we set no upper limit on the counter values the state space is infinite, but if we limit the counters to staying equal to or lower than 2 we get the following possible states and state transitions:

Fig: State transitions for two counters, up to the value of 2

Each state is a unique combination of all the variable values that occur at the same time and in our two counters example there are only 9 possible states.

There are also a number of different paths through this tree of possible unique states.

Fig 2: Six different paths through the state space

This particular system has 6 possible paths through the states. In TLA+, these paths are called Behaviours.

Below is the specification. It consists of two constants (set from the model checker):

C is the set {c1,c2}
Limit is set to 2

It has a single variable, counters, which a programmer can think of as a map or dictionary, where the key is the counter id (c1 or c2) and the value is the counter value. In TLA+ this is known as a FUNCTION and the keys are known as the DOMAIN.

In the initial state both counters are set to 0. In the next state we non-deterministically choose a counter using the existential quantifier (there exists \E) and check if its value is below the limit. If so, we set the counters variable in the next state, to be the same as the current state except for this counter, which we increment by 1.

This will look alien to most programmers, there are a number of things that are strange:

the use of the /\symbol which represents AND (&& in many languages)
the use of the ` symbol, known as prime. A variable with the prime symbol is a primed variable and represents the variable in the next state. Each state is immutable, the variable values cannot be changed. We can only assign new values to a variable in the next state and we do that by using the primed form of the variable and the = operator.
EXCEPT is a shorthand for not having to explicitly rebuild the whole function in order to set the primed variable. We can say the function is the same, except for one part of it.

The alternative to EXCEPT would be to explicitly define a new function with the same domain and values but increment the value only of the chosen counter c:

counters’ = [ctr \in C |-> IF ctr = c THEN counters[c] + 1 ELSE counters[c]]

The Init formula sets the initial state. The model checker then explores the state space by repeatedly executing the Next formula, until it has explored all 9 states.

An Example: A Clock

Our clock specification has three variables: hour, minute and second.

Fig 3: State transition for our clock specification

The initial state is our starting point, where we set initial values of our variables. We set the clock to start at 00:00:00 as its initial state.

The next state formula describes the possible actions that could occur in any given state. An action is a step that takes us from the current state to some next state.

Our Clock specification could look like this.

In most specifications, the Next formula will contain multiple actions, but our clock only does one thing, it ticks. Therefore the next state formula has only a single action called Tick. The Tick action takes the time composed of the hour, minute and second variables and increments the time by one second.

When we give this specification to the model checker, it will assign the values as described in the Init state formula.

Limiting the State Space

The clock naturally has a finite state space, but the TwoCounters specification has an infinite state space without a counter limit. We could make the clock have an infinite state space by adding the date as a variable and setting no limit. The model checker would try to run it forever. In contrast, with only the hh:mm:ss parts, there are only a limited number of seconds in a day. There would be only 86,400 possible unique states (one for each second in a day). If we included the date, all we would need to do is tell TLC to stop exploring when a certain number of days have been explored. Wall clock time is not typically part of a formal specification though there may be some virtual clock involved.

With our BookKeeper specification, we also do not have a concept of wall clock time but the state space is still huge so we limit the state space by restricting the number of events that can be written to a cluster.

States Form a Graph

We saw that the state space for our Two Counters example formed a graph where the vertices are the states and the edges are the steps.

Our simple Clock specification has an initial state of hour=0, minute=0 and second=0. That state has only a single next state of hour=0, minute=0 and second=1, and so do all the other states. This forms a graph that is a simple chain where the number of vertices is 86400 and the number of edges is 86399.

Fig 4: Clock graph is a chain as each state has only one next state

On my MacBook Pro, TLC took 6 seconds to find all 86400 unique states.

However, if we changed the specification to allow for the display of the clock to pause for any amount of time, then the clock would appear to jump. In that version, a valid step might be to go from 10:37:00 to 10:37:02 or even from 00:00:00 to 23:59:59.

Such a graph might look more like:

Fig 5: Jumpy clock graph, same number of states but many more possible transitions

In fact, this simple change to our clock specification would inflate the number of edges (steps) from 86399 to 3,732,436,800. But note that the number of vertices (states) is the same at 86400 (there are still only that number of seconds in a day). Now it takes 30 minutes to explore the state space. Although the number of states remained the same, the number of steps was several orders of magnitude higher.

Our JumpyClock specification below uses non-determinism like the TwoCounters specification in order to choose the next values of hour, minute and second. The if statements have been replaced by existential quantifiers (there exists \E) which non-deterministically select a value from the specified range. TLC will explore every possible value of each existential quantifier in its brute-force BFS.

Most specifications of distributed systems are run with state spaces that reach billions or even trillions of states potentially taking days to execute and this is true of our BookKeeper specification.

The existential quantifiers translate to: there exists an hour between 0 and 23, and there exists a minute between 0 and 59, and there exists a second between 0 and 59 SUCH THAT they form a time point which is in the future compared to the time of the current state. Then set the hour, minute and second variables in the next state to those values.

Creating Properties

For our Two Counters example we could have the following properties:

All counter values are non-negative — safety (invariant)
All counter values eventually reach the limit — liveness

We can define those as follows:

AlwaysNonNegative == \A c \in C : counter[c] > -1AllAtLimit == \A c \in C : counter[c] = LimitEventuallyAllAtLimit == <>[]AllAtLimit

AlwaysNonNegative declares that for all counters, their value is above -1. We instruct TLC that AlwaysNonNegative is an invariant and it checks that it is true in every reachable state.

EventuallyAllAtLimit uses the eventually symbol <> and the always symbol [] to declare that eventually all counters will reach the limit and stay there forever. We instruct TLC that EventuallyAllAtLimit is a temporal operator and it ensures that every behaviour conforms to it.

Liveness relates to behaviours whereas invariants relate to individual states. We cannot inspect any given state and say that it does not conform to our liveness constraint, because there could always be some subsequent sequence of states that lead to conformance.

TLC evaluates all the enabled invariants in each step and halts execution when an invariant has been violated and also prints out the steps (known as an error trace) that led to this violation. Likewise, when TLC reaches the end of a behaviour and the behaviour does not conform to a liveness constraint, it halts and prints out an error trace.

Where Do Formal Methods Fit In?

The test pyramid with unit testing at the bottom and end-to-end testing is a well known representation of a layered test strategy where unit testing forms the wide base and end-to-end testing forms the much smaller peak. In this post we’ll avoid the use of a pyramid as there is no one-best ratio of the different types of test. Instead we’ll present the techniques as layers where lower layers form the foundations of the higher layers.

In this presentation we put formal methods at the very bottom because we use them to verify our designs and protocols. The engineering process of very complex systems such as Apache Pulsar and BookKeeper are very design heavy. Building these systems is not like building modern web apps where agile techniques are used to refactor as you go. New features need to be designed upfront, with design documents written and discussed, just like the worldwide web and internet protocols are.

Design documents are great at communicating the general design and functioning of a protocol or algorithm but words can sometimes be ambiguous and the human mind struggles to analyze all the possible ways that the different components can interact. This is where TLA+ and other formal methods shine. The TLA+ specification language is precise and leaves far less room for misinterpretation and gaps. The very process of formalising a design into a TLA+ specification helps the design process by forcing those involved to think and communicate more clearly.

Summary

Modelling and verification is a technique used by system builders for many reasons. It helps to learn more about the system being designed, for communicating those designs, for finding flaws and gaining confidence.

Systems can be modelled in discrete units of time as a series of steps that take the system from one state to another. Properties can be specified to ensure that the system behaves correctly like not losing data or deadlocking. This allows the engineer to verify the correctness of the design, whether that design is only on paper, or has already been implemented.

Before we cover how we applied TLA+ to Apache BookKeeper, it’s worth covering BookKeeper and its replication protocol first. The next post will explain in prose and sequence diagrams how the protocol works.