Modelling and verifying the BookKeeper protocol (TLA+ Series Part 3)
We stated in part 1 of this series, TLA+ is best employed during the design phase of a distributed system as it helps the engineer reason about the design, communicate the design to others and verify its correctness.
In our case, the design and implementation were already in place but TLA+ still showed itself to be highly valuable. When I joined Splunk to work on Apache BookKeeper I had had very little hands on experience of BookKeeper itself. I wrote a series of blog posts on Apache Pulsar/BookKeeper back in 2018 based on a series of chaos testing experiments and discussions with the Pulsar developers on how it all worked, so I had a good high level understanding. However, given that my first projects were all related to improving BookKeeper’s data integrity I felt I should understand the project at a fundamental level, understanding all the subtleties and nuances before making design and code changes. The best way I know of understanding a new distributed system is to reverse engineer it into a TLA+ specification.
TLA+ and Reverse Engineering
The BookKeeper Replication Protocol documentation is good for a high level description but it does not include enough detail for a specification, so the only way to do it was to read the code directly. The whole process of reading code, writing TLA+, rereading code, writing/editing TLA+ took about two weeks then another week to optimize it. 60% percent of that was reading code to understand the protocol and about 40% was writing and testing the TLA+.
It was amazing how useful the invariants were at finding bugs in the specification during the writing process. All it takes is to misread or miss completely a small snippet of implementation code for the specification to be wrong leading to an invariant violation. Time after time I would hit data loss because I had not fully understood what the code was doing and all the checks/decisions the code made. It really taught me all the nuances of the protocol and some surprising behaviours along the way.
Suffice to say that I am a believer in reverse engineering code into formal specifications as a learning process.
The Specification
The specification models the life of a single ledger, from its creation to its closing. Doing anymore in a single specification would be problematic. Even the life of a single ledger is complex enough to model and has a state space already very large. Once the lifecycle of a single ledger is shown to be correct, another specification can be written at yet a higher level that models a chain of ledgers, built on assumptions of the correctness of a single ledger.
In the previous post we talked about BookKeeper clients that perform reads and writes. In our TLA+ we only have clients that write so we refer to them as Writers in this specification.
The Variables and Constants
Metadata:
- Version
- Status (OPEN, IN_RECOVERY, CLOSED)
- Ledger Fragments where each fragment is made of the first entry id and the bookie ensemble.
- LastEntryId
Bookies:
A map containing the state of each bookie, where the value contains:
- the entries stored
- fenced status of the ledger
- LAC
Writers:
- Ledger metadata version
- LAC and LAP (Last Add Pushed)
- Add operations in progress
- Known fenced bookies
- Recovery phase and responses
The Actors and Actions
Our specification has only two types of actor (also known as processes): bookies and writers.
The specification models exactly two writers (known as w1 and w2), one that is the original owner and a second that can, at any time, attempt recovery of the ledger and close it. It also models one or more bookies known as b1 to bN.
The specification uses message passing to model communication between the writers and the bookies. Each message can go through the following three motions:
- Writer sends a bookie a message. The sent message may get lost.
- The message is processed by the bookie and a response is sent to the writer. The response may get lost.
- The writer receives and processes the response.
An example of this message passing is a writer sending an entry to a bookie and getting an ack from the bookie.
Writer Interactions
The full specification is about 1000 lines, so for now we’ll just show the Next formula which contains the actions listed above. The \/
symbol means OR and therefore only one is chosen per step.
Next ==\* Bookies\/ BookieSendsAddConfirmedResponse\/ BookieSendsAddFencedResponse\/ BookieSendsFencingReadLacResponse\/ BookieSendsReadResponse\* W1\/ W1CreatesLedger\/ W1SendsAddEntryRequests\/ W1ReceivesAddConfirmedResponse\/ W1ReceivesAddFencedResponse\/ W1ChangesEnsemble\/ W1TriesInvalidEnsembleChange\/ W1SendsPendingAddOp\/ W1CloseLedgerSuccess\/ W1CloseLedgerFail\* W2\/ W2PlaceInRecovery\/ W2ReceivesFencingReadLacResponse\/ W2SendsReadRequests\/ W2ReceivesNonFinalRead\/ W2CompletesReadSuccessfully\/ W2CompletesReadWithNoSuchEntry\/ W2WritesBackEntry\/ W2ReceivesAddConfirmedResponse\/ W2ChangesEnsemble\/ W2TriesInvalidEnsembleChange\/ W2SendsPendingAddOp\/ W2ClosesLedger
These 25 actions model the core operations of the BookKeeper Replication Protocol. This is a rather large number of operations and so the state space of this specification is massive.
The Invariants
The specification includes the following invariants.
No Divergence Between Writer And MetaData
NoDivergenceBetweenWriterAndMetaData states that the entry id of each confirmed write (confirmed according to writer 1) cannot be larger than the Last Entry Id of the ledger upon ledger closing. A violation of this invariant would mean data loss has occurred and would signal a problem in the Recover + Close operation by a second writer. This invariant was violated.
No Out of Order Entries
NoOutOfOrderEntries states that entries of a ledger are logically stored in temporal order by the ledger ensemble.
All Committed Entries Reach Ack Quorum
AllCommittedEntriesReachAckQuorum states that all confirmed entries achieve and maintain the minimum replication factor as long as no bookies permanently lose their data. A violation would be for a writer to acknowledge a write back to its own client but there not be Ack Quorum copies of the entry across the bookies.
OnlyValidFragments
OnlyValidFragments states that there cannot exist two fragments such that the fragment created first has a higher first entry id than the second. This would be violated if the writer performing recovery creates a new fragment whose first entry id is lower than an existing fragment. In reality there is a check to prevent this from happening and the specification logs the attempt, but does not allow for it to succeed. This invariant was violated.
The Violations
The model checker detected a violation of both the NoDivergenceBetweenWriterAndMetaData and OnlyValidFragments invariants, outputting the sequence of events that produced the violations.
The case of the NoDivergenceBetweenWriterAndMetaData violation is the perfect example that shows how the human mind finds it hard to analyze distributed system interactions and will be the focus of the next section.
The Writer-Metadata Divergence Defect
In part 2 of this series we covered the BookKeeper Replication Protocol in detail. In it we covered how a leader fail-over includes ledger fencing. The fencing operation prevents the old owner from performing enough writes to reach Ack Quorum. This means that there can be two writers that consider themselves the owner, but only one can actually write to the ledger successfully.
But the TLA+ specification demonstrated that this fencing mechanism is not quite enough to prevent the original writer from committing further entries. It showed that if we are unlucky with a certain sequence of events, then the original owner can sneak in a write after the new owner reaches the minimum number of fenced bookies. It requires that a single fence request is lost (or reordered).
Writer 2 receives enough NoSuchEntry responses that it decides that there is no entry 1000 that is committed and it can close the ledger at entry 999.
The issue is that it is possible for writer 1to reach AckQuorum even after “enough” bookies are fenced. For a more in-depth analysis checkout the GitHub issue: https://github.com/apache/bookkeeper/issues/2614
Humans Are Bad At Reasoning About Distributed Systems
It is easy to see why this particular problem was not detected by any human. The original idea seems simple and obvious.
It took a machine with a formal model to find that this obvious mechanism had a flaw and once it was found the solution was trivial.
The fix was very simple: during recovery, once a bookie responds to a recovery read, it cannot be allowed to change its mind later. Once a bookie tells a recovery writer that it doesn’t have an entry, it cannot accept an add for that entry later on. We achieve that by ensuring that recovery reads also perform a fencing operation. This means that once a recovery read says NoSuchEntry or Ledger, it will always return that value. The specification was updated with this new logic and the violation was no longer triggered. The fix is already merged to master and will appear in a release soon.
Conclusions
Distributed systems are complex and humans struggle to reason about all the possible interactions between concurrent processes. Formal methods give us the tools to improve our reasoning as well as detect hard to find bugs that even state-of-the-art testing may struggle to find.
TLA+ is well suited to engineering teams as it is probably the simplest specification language available and allows regular engineers without strong mathematical logic skills to be productive. In about 2–3 weeks most engineers can become proficient enough with TLA+ to be able to write useful specifications of real systems.
Further Reading
Learning TLA+:
- Leslie Lamport’s excellent video series on TLA+
- Leslie Lamport’s Specifying Systems book
- Hillel Wayne’s Learn TLA+ website focuses on a pseudo-code called PlusCal that compiles to TLA+
- Specification examples
Papers and Industrial Use
Community