A Guide to the BookKeeper Replication Protocol (TLA+ Series Part 2)

Published in

Splunk-MaaS

10 min readJun 3, 2021

A relational database stores tables and serves reads and writes to those tables. Apache BookKeeper stores logs and serves reads/writes to those logs. A log is simply an append-only data structure that supports multiple readers and non-destructive reads.

Queues and logs are often bundled into the same bucket and they have their similarities but a log differs in that multiple readers can all read the entire log independently at the same time and from different positions. For that to work, reads cannot be destructive. A queue on the other hand is a destructive read data structure where elements are read from the head and deleted, meaning that each element can only be read by a single reader.

Apache BookKeeper is the distributed log storage layer of Apache Pulsar and is a complex sub-system in itself. BookKeeper offers high availability and data safety by using replication. Replication means that each entry is replicated across multiple servers so that in the event of a failure, it can continue to provide service and doesn’t lose any data. BookKeeper uses its own replication protocol, and this protocol defines how its various components work together to provide both the high availability and data safety guarantees.

Segment Based Logs

Other queue and log based systems such as Apache Kafka or RabbitMQ store their queues/partitions as single monolithic entities. The entire log structure must fit on a single server. BookKeeper offers a segmented log approach where each log is in fact a series of logs chained together. So a single Pulsar topic partition is in fact many log segments.

Each Pulsar topic is owned by a single Pulsar broker at a time. The owner has the job of creating the segments and chaining them together to form a single logical log.

Fig 1: A Pulsar topic is a sequence of log segments

BookKeeper calls these log segments Ledgers and they are stored on BookKeeper servers (known as bookies).

Fig 2: Pulsar brokers store topic data on Bookies

The BookKeeper replication protocol is all about the lifecycle of each ledger. The replication protocol itself lives mostly in the BookKeeper client library. Each Pulsar broker uses that client to interact with BookKeeper such as creating ledgers, closing ledgers and reading/writing entries. There is a surprising amount of complexity and nuance that the client hides behind its API, and we’ll be peeling back some (but not all) of the layers to look at the protocol in this blog post.

Firstly, each ledger is owned by only a single client, the client that created it. All writes to that ledger can only be performed by that owner client. In the case of Pulsar, it is the broker that created the ledger to form a segment in one of its topic partitions. But if that client were to die for some reason, another client (a broker in Pulsar’s case) will need to step in and take over. This process involves repairing any under-replicated entries (recovery) and then closing it.

There can be only one open ledger in any given Pulsar topic. All writes go to the open ledger while reads can hit any ledger.

Each ledger is replicated across multiple bookies and the existence of each ledger and the group of bookies that host it (known as the ensemble) is stored in ZooKeeper. Once an open ledger reaches its maximum size, or the owner of that ledger fails, it gets closed and a new ledger is opened. New ledgers are created on a subset of the available bookies according to the replication factor configured.

Fig 4. Ledgers are distributed across multiple bookies and that metadata and the list of ledgers forming a topic are stored in ZooKeeper

Writing to ledgers

BookKeeper uses the following configurations to control replication of any given ledger:

Write quorum (how many bookies each entry should be written to), referred to as WQ.
Ack Quorum (how many bookies should confirm an entry for that entry to be considered committed), referred to as AQ
Ensemble size (the total pool of bookies that can host the ledger), referred to as E. When E > WQ, entries get striped across bookies.

The writeset of an entry is the bookie ensemble that is gets written to. When E > WQ, the writeset will be different for adjacent entries.

Pulsar exposes AQ, WQ and E in its own API for controlling the replication factor of its topics.

Fig 5. Writes and confirms with WQ=3, AQ=2

Last Add Confirmed (LAC)

The BookKeeper client keeps track of the highest contiguous entry id that has reached Ack Quorum. This is known as the Last Add Confirmed (LAC). This is the watermark where above it, the entries are not committed and at or below it, the entries have been committed. Each entry sent to a bookie also contains the current LAC and that way, each bookie also tracks what the LAC is, albeit with some lag. The LAC plays further roles in addition to being the committed entry watermark as we’ll see later.

Ledger Fragments

Ledgers themselves can also be broken down into one or more fragments. When a ledger is created, it consists of a single fragment with an ensemble of bookies. When a write to a bookie fails, the client finds a new bookie to replace the failed one, forming a new fragment with a new ensemble and resends all uncommitted entries to that bookie as well as all future entries. If a bookie write failure occurs again, another fragment is created and so on. A write failure does not necessarily mean a bookie has failed, only that the write failed. Fragments are also simply known as ensembles.

Fragments are basically metadata that tell BookKeeper clients where to look for the entries of a ledger. The bookies themselves know nothing of fragments, they just store the entries they are told to and index them by ledger id and entry id.

Fig 7. A ledger with two fragments caused by a write failure of entry 1000 at bookie B3

Reading from ledgers

There are various types of read:

Regular entry read
Long poll LAC read
Quorum LAC read
Recovery read

Unlike writes, a regular read can be served by a single bookie. If a read fails then the client can simply send the read to another bookie in the ensemble.

A client that is reading typically only wants to read committed entries so it will only read up to the LAC. In regular read results, the bookie piggybacks it’s currently known LAC which helps the reader know when to stop reading. If a reader client reaches the LAC it can send a long-poll LAC read to a bookie which will respond when a new committed entry is ready to be read.

The last two types of read are important during recovery, which we’ll cover soon.

Operations and the Different Quorums

Different operations require different numbers of positive responses from bookies. Some, like a regular read, only require a single positive response, while others need a quorum.

The following quorums are used:

Ack quorum (AQ)
Write quorum (WQ)
Writeset Quorum Coverage (QC) where QC = (WQ — AQ) + 1
Ensemble Quorum Coverage (EC) where EC = (E — AQ) + 1

Writeset Quorum Coverage (QC) and Ensemble Quorum Coverage (EC) are both defined by the following, only the cohorts differ:

A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
There exists no ack quorum of bookies that do not satisfy the property within the cohort.

For QC, the cohort is the writeset of a given entry, and therefore QC is only used when we need guarantees regarding a single entry. For EC, the cohort is the ensemble of bookies of the current fragment. EC is required when we need a guarantee across an entire fragment.

WQ and AQ are primarily used for writes while QC and EC are primarily used in the ledger recovery process.

Ledger Recovery

When the owner client of a ledger disappears another client can step in to perform recovery on the ledger then close it. For Pulsar that means that a topic owner broker died and another broker takes on the ownership of the topic.

Recovery involves finding out what the highest committed entry is across the ensemble of bookies and ensuring that each entry is replicated up to the Write Quorum. Once that is complete the new client closes the ledger, setting the status to CLOSED and the Last Entry Id to the highest committed entry.

Avoiding Split Brain

BookKeeper is distributed meaning that network partitions can break the cluster up into two or more chunks. If a client loses its connection to ZooKeeper it is classed as dead and another client can start recovery. But the client may still be alive and still be connected to the BookKeeper cluster. Now we have two clients interacting with the same ledger, this could cause what is known as split-brain. Split-brain is a term that describes when a cluster is broken into two or more pieces and each half operates independently for a while. This can cause data inconsistency when the partition is healed and the cluster becomes one again.

To avoid this BookKeeper has the concept of fencing. When a second client (another Pulsar broker for example) attempts to recover and close a ledger, it fences the ledger which makes the ledger refuse any further writes. Once enough bookies are fenced the original client, if it is still alive, cannot make any more successful writes. The recovery process then continues without the possibility of conflicting clients writing and recovering same ledger at the same time.

Fig 8. A new topic owner starts fencing the open ledger, preventing the current owner from reaching Ack Quorum on its writes.

Recovery Phase 1 — Fencing

Fence the ledger and find out the LAC.

The fencing requests are in fact Ensemble Coverage LAC reads with the fencing flag set. Each bookie responds to a fencing LAC read by changing the ledger status to fenced and returning the LAC the bookie has for that ledger. Once the client has received enough responses it can start the next phase. But how many is enough?

The idea of fencing is to prevent the original client from making progress. If there are less than Ack Quorum bookies left unfenced in the ensemble for that ledger, then the original client can’t reach Ack Quorum for a write. So the new client doesn’t have to wait for all bookies to be fenced, just enough such that no Ack Quorum of bookies remain unfenced which equals Ensemble Coverage responses.

Recovery Phase 2 — Recovery reads and writes

Next the client starts sending recovery reads starting at the entry id of LAC + 1 and rewriting them back to the ensemble of bookies. Writes are idempotent so if a bookie already has the entry, then the write does not add a duplicate. It continues to read and write until it can no longer read any further entries. This read and write phase ensures that any committed entries get fully replicated before the ledger is closed.

Recovery reads are different to regular reads in that they require a quorum. Each recovery read determines whether an entry is committed or uncommitted. The following quorums decide:

Committed = Ack Quorum positive responses
Uncommitted = Quorum Coverage negative responses (there exists no ack quorum of bookies that might host the entry)

If all responses are received and neither threshold is reached then the result is inconclusive and recovery stalls. Recovery can be run repeatedly until the recovery reads are able to explicitly determine what the final committed entry is.

Fig 9. New client gets enough negative responses for entry 3 to know it is not committed. Ensure all entries up to entry 2 are fully replicated.

Recovery Phase 3 — Close the Ledger

Once all committed entries have been identified and repaired, the client closes the ledger. Closing the ledger involves updating the ledger metadata in ZooKeeper with the CLOSED status and the Last Entry Id set to the highest committed entry id. Bookies themselves have no concept of open or closed.

This metadata change is a CAS op that is protected by versioning. If another client was also recovering the ledger at the same time and had already closed the ledger, this CAS op would fail. This protects against multiple clients all trying to recover a ledger concurrently.

Summary

We’ve covered most of the BookKeeper Replication Protocol in this post. The important thing to remember is that bookies are just dumb storage nodes that are really good at storing and reading entries. The BookKeeper client logic contains all the smarts related to the creation of ledgers, choosing ledger ensembles, creating fragments, ensuring Write Quorum and Ack Quorums are respected and performing recovery/close operations when a leader fail-over needs to take place.

Next we’ll look at how we applied TLA+ to BookKeeper and what we discovered along the way.