Apache BookKeeper Insights Part 2 — Closing Ledgers Safely

Published in

Splunk-MaaS

14 min readNov 25, 2021

In the last post we looked at how BookKeeper replication is not performed by integrated stateful nodes but by stateless clients. In this post we’re going to look at how the protocol deals with client failures, avoids split brain scenarios and some tricky areas that can be a little subtle.

I recommend reading the high level description of the protocol in this post first. In this post we’ll be digging into specific aspects of the protocol.

Some aspects of the protocol are required due to the way that the protocol authors designed BookKeeper to offer distributed log segments (ledgers) rather than unbounded logs (like Raft/Kafka). BookKeeper ledgers didn’t have to be designed this way, but for various reasons that is the design choice that was made. As with everything, design is about trade-offs.

Before we look at the interesting and tricky parts we need to understand the ledger lifecycle and how ledgers are chained together to form unbounded logs.

Ledgers Are Log Segments

Each ledger has a lifecycle. Any client can create a ledger, but only the client that created a ledger can write to it. So the happy path is that Client A creates ledger L1, writes entries to it for a while then closes it.

An unhappy path is that Client A creates ledger L1, writes to it then abruptly disappears leaving the ledger dangling in an OPEN state. In order for the segmented log to make progress another client must come in and close the ledger, then create a new one so that writes to the log can continue.

In the context of Apache Pulsar, the brokers use ZooKeeper to agree on which brokers own which topics. Any given topic can only be owned by a single broker, and that broker uses the BookKeeper client to create, write to and close the ledgers of each topic it owns.

Fig 1. A topic is a segmented log built of BookKeeper ledgers.

But if a broker dies, then its topics get their ownership transferred to other brokers. Each broker that assumes ownership of a given topic creates a BookKeeper client to close the last ledger of the topic (created by the BK client of another broker), create a new ledger and carry on serving the topic writes.

Fig 2. Broker failure and topic segments

A more unhappy path is that the original client didn’t die, but just was unreachable for a while, perhaps a long GC or a network partition. So what happens if one client is closing a ledger, while the original client is still trying to write to it? That would be what is known as split-brain and leads to data loss.

So we need a safe way for a client to close the ledger of another client and this is called ledger recovery. Whenever a client closes another client’s ledger they do it via this mechanism.

Ledger Recovery

When a client closes its own ledger, it sets the status to CLOSED and sets the LastEntryId to the Last Add Confirmed (LAC) which is the highest entry that was committed by the client. For the rest of the life of this closed ledger, other clients that perform reads against the ledger will never read past the Last Entry Id. The Last Entry Id is the end of the ledger.

When a client closes the ledger of another client, it must also set this Last Entry Id, and now that the original client is gone, the only way to find out the LAC is to interrogate the storage nodes (bookies).

Recovery begins with the client changing the status of the ledger from OPEN to IN_RECOVERY in its metadata. Next it asks the bookies of the last fragment for the LAC they have stored for that ledger. The LAC stored in each entry is generally trailing the real LAC and so finding out the highest LAC among all the bookies is the starting point of recovery. There may be subsequent entries that were committed but they need to be discovered.

When the client asks for the LAC from each bookie it also sets the fencing flag on the request causing each bookie to fence the ledger. Fencing is a local one-way idempotent operation. Once a bookie fences one of its ledgers, that bookie will not accept any further regular writes to that ledger ever again. This prevents the original client from continuing to perform writes if it is in fact still alive. Without fencing, we could have a split-brain situation where two clients believe they are both in-charge and both making progress on the same ledger which would likely lead to eventual data loss.

Fig 4. Recovery client fences enough bookies such that the original client cannot make progress if it still lives.

NOTE: You can learn more about fencing and some edge cases in the series about the verification of the BookKeeper protocol in TLA+.

Fig 5. Recovery client discovers the highest known LAC of the bookie ensemble of the last fragment. In this case the highest is 1 (which lags the LAC of the original client).

Learning the highest LAC is only the first step, now the client must find out if there are more entries that exist beyond this point. The client starts sending read requests (also with the fencing flag) for the entries beyond the highest LAC. For each entry, the client must decide whether the entry counts as recoverable or not. If it decides an entry isn’t recoverable, the client stops reading at that point and treats the previous entry as the Last Entry Id. The client keeps reading ahead until it reaches what it considers to be the last recoverable entry.

For each entry that it successfully reads, it writes it back to the ensemble of bookies again. Writing entries is idempotent and does not cause duplication or ordering issues. We’ll see in another post why rewriting entries during recovery is necessary for correctness. If all the writes are successful (reach AQ) then the last action is to close the ledger by setting the status to CLOSED and the LastEntryId to the last committed entry that it found.

Ledger Truncation (Bad!)

Ledger truncation happens when a client performing recovery sets the Last Entry Id too low, making committed entries unreadable. These entries are safely on disk on at least AQ bookies, but because those entries are now past the Last Entry Id they are unreadable and so are essentially lost.

Fig 6. The recovery client closes the ledger at entry 1 but entry 2 had been previously committed.

Ledger truncation is something that all BookKeeper contributors and administrators need to be wary of. Further down we’ll cover some ways it could happen and how to avoid it. Critical to understanding how ledger truncation can happen is understanding how the ledger recovery process decides which entries are recoverable and which are not.

Deciding when an entry is recoverable

During ledger recovery, for each recovery read response it receives, the client decides if the entry is recoverable, unrecoverable or needs more responses to decide.

NOTE. An entry may be unrecoverable because it doesn’t exist. The recovery client can keep reading until it passes the entries that were written by the original client.

For any given entry, the client can’t wait for a response from every bookie in the ensemble. If it did it would mean that ledger recovery couldn’t complete if a single bookie were down. But also, if no bookies respond we can’t just close the ledger at that point as that could lead to ledger truncation. The entry may exist on all the bookies, but if none are reachable at that moment we just need to abort the ledger recovery not close the ledger.

We want to avoid ledger truncation but we don’t want to be blocked by slow or unavailable bookies — recovery needs to be fast because as long as it goes on, the segmented log cannot make progress. So there is some tension between getting ledger recovery done fast and done safely.

The approach that BookKeeper takes is to be generous with positive results and strict when it comes to non-positive results that would lead to closing the ledger. That is, a single positive recovery read can lead the client to decide an entry is recoverable. However, in order to treat an entry as unrecoverable (and closing the ledger at that point), the bar is set higher. We don’t want to truncate a ledger!

We break down recovery read responses into three categories:

Positive (the entry is on that bookie)
Negative (the entry does not exist on that bookie)
Unknown (an error or timeout occurred so we just don’t know)

NOTE: One mistake that has been made in the past was to treat all non-positive reads as negative. This is wrong and can lead to ledger truncation. Some responses could be timeouts, rejected requests due to bookie overload, transient i/o errors etc, meaning that the entry could actually exist.

Quorum Coverage

The bar to close a ledger is quorum coverage. We described quorum coverage in the TLA+ series. But basically we define it as:

A given property is satisfied by at least one bookie from every possible ack quorum within the cohort.
There exists no ack quorum of bookies that do not satisfy the property within the cohort.

These two definitions are equivalent.

Ledger recovery uses quorum coverage is two places:

Fencing. The property is that a given bookie is fenced and the cohort is the current ensemble. Fencing reaches quorum coverage when no ack quorum in the current ensemble are not fenced, or equivalently for each AQ of bookies, at least one is fenced. If we leave an AQ of bookies unfenced, then the original client could continue writing to the ledger for as long as it wants (split-brain).
Deciding an entry is unrecoverable. The property is that a given bookie does not have the entry and the cohort is the writeset of the entry (the bookies that should have the entry). If there is no AQ of bookies that could possibly have the entry then the entry is unrecoverable.

The threshold for quorum coverage can be easily computed down to a number:

Bookies that satisfy property = (Cohort size — Ack quorum) + 1

Applying that to fencing gives us a threshold (E-AQ)+1 of bookies that satisfy the “is fenced” property.

Applying that to recovery reads, the negative read threshold = (WQ-AQ)+1 of bookies that satisfy the property “entry does not exist”.

The client simply counts the number of negative responses and if the count reaches quorum coverage, then the entry is unrecoverable.

+----+----+------------------------+
| WQ | AQ | Neg Responses Required |
+----+----+------------------------+
|  2 |  1 |                      2 |
|  2 |  2 |                      1 |
|  3 |  1 |                      3 |
|  3 |  2 |                      2 |
|  3 |  3 |                      1 |
|  4 |  2 |                      3 |
|  4 |  3 |                      2 |
|  4 |  4 |                      1 |
+----+----+------------------------+

NOTE: If you want a bells and whistles more formal definition of quorum coverage you can find TLA+ specifications that demonstrate the equivalence of the different ways of measuring quorum coverage:
https://github.com/Vanlightly/bookkeeper-tlaplus/blob/main/QuorumCoverageFencing.tla
https://github.com/Vanlightly/bookkeeper-tlaplus/blob/main/QuorumCoverageRecoveryReads.tla

Generous on positive, Strict on close

A single positive read can be treated as a successful recovery read but in order to close a ledger having reached the first non-recoverable entry, we must apply quorum coverage.

On receiving each response, the client assesses whether an entry is recoverable, not recoverable or needs to wait for more responses.

Examples…

Positive recovery read #1

Fig 7. The 2nd response is positive which is enough to treat the entry as recoverable, further responses are ignored.

Negative recovery read #1

Fig 8. The first two responses are negative which reaches the negative threshold.

Unknown #1

Fig 9. No explicit positive or negative responses meaning we don’t know if the entry is recoverable or not, abort recovery.

Wait for more responses

Fig 10. Neither threshold is reached so wait for the last response.

Positive recovery read #2

Fig 10a. The last response arrives and it is positive, reaching our positive threshold.

Negative recovery read #2

Fig 10b. The last response arrives and it is negative, reaching our negative threshold.

Unknown #2

Fig 10c. The last response arrives reaching neither threshold. There exists an AQ of bookies that might host the entry, so we can’t treat this read as negative. Instead ledger recovery is aborted, to be retried again.

If ledger recovery can complete, then all recoverable entries have been committed and the Last Entry Id has been set. If it can’t complete because it cannot tell if an entry is recoverable or not, then recovery can be repeatedly attempted until it succeeds. Once complete, the segmented log can continue to be written to.

The Danger of Ack Quorum of 1

Using an ack quorum of 1 is not only dangerous because you may not have redundancy for some entries, but also it can cause ledger recovery to stall if a single bookie is offline. With an ack quorum of 1, you can only close the ledger if you receive a response from every single bookie — the single bookie that timed out or was offline might host the entry! Stalled recovery translates to an unavailable topic.

Remember that with BookKeeper you can set WQ=AQ due its dynamic ledger membership, so if you only want a replication factor of 2, then choose WQ=2 and AQ=2. Using a AQ=1 is setting yourself up for data loss or unavailability.

Beware of Ledger Truncation!

We know what ledger truncation is and we know how ledger recovery avoids it, but there are other ways of inadvertently allowing ledger truncation to occur! This warning applies to BookKeeper operators and BookKeeper contributors alike.

The basic rule for avoiding ledger truncation is to never allow a bookie to respond with a NoSuchEntry or NoSuchLedger response for an entry it previously acknowledged. Doing so can cause the truncation of ledgers that undergo ledger recovery.

Example: Let’s say we have a ledger with WQ=3, AQ=2.

Entry 10 was only written to b1 and b3 due to an ensemble change.
Writes continue and the Last Add Confirmed is currently 100.
Entry 10 gets lost on b1 somehow.
A client initiates recovery on that ledger. When it reaches entry 10 the first two responses it receives are NoSuchEntry responses from b1 and b2. The negative threshold has been satisfied.
The recovery client closes the ledger at entry 9 truncating it (losing 91 entries).

All it takes is a single erroneous NoSuchEntry response!

But why would an acknowledged entry disappear? Let’s look at some examples.

Operator Example 1 — Running without the journal on a version prior to 4.15.

NOTE: At the time of writing my team is working on submitting a change to BookKeeper to allow it to run without the journal without causing ledger truncation. Hopefully this will be out in 4.15.

You can run without the journal today by using the config journalWriteData=false. Writes are added to the ledger storage write cache then acknowledged, skipping the journal completely. Entries stay in the cache until the next batch flush is performed. A crash can cause the loss of acknowledged entries as all the unflushed entries will be lost.

Don’t run without the journal for now.

Operator Example 2 — Bringing back a dead bookie (and deleting the cookie)

If the disk of a bookie fails, as an operator you might be tempted to bring up a new bookie with the same id and a new empty disk. The BookKeeper cookie prevents a bookie from booting in this case but you can use the CLI tooling to delete the cookie in ZooKeeper to allow the bookie to boot with empty disks. You figure that the entries are replicated on other bookies so it will be fine. But you forgot about the possibility of ledger truncation!

You’ve gone from a recoverable data loss scenario to a potentially unrecoverable data loss scenario!

Instead use the decommissioning process to remove a dead bookie safely, then bring back the bookie with empty disks and add it back to the cluster.

Operator Example 3— Index file corruption

File corruption can happen and it can happen to the DbLedgerStorage entry location and ledger indexes. Luckily there are CLI commands to rebuild these indexes if this should happen. You should run these rebuilds while the bookie is either offline or in read-only mode.

If you run the rebuild while the bookie is running, the rebuilt index won’t have the entries that were added during the rebuild op. Now those ledgers are vulnerable to ledger truncation.

Contributor Example 1— Storage expansion

This is a broken BookKeeper feature that did not take into account ledger truncation. It allows you to add or remove journal and ledger storage directories. BookKeeper supports multiple directories where you mount different disks, thus scaling-up/down a single bookie.

Each journal directory gets a journal instance and likewise each ledger directory gets a ledger storage instance. Ledger reads and writes are routed to journal and ledger storage instances by their ledger id. If you change the routing you must also rewrite all the existing data else subsequent reads may hit the wrong ledger storage instance. For example expanding from 1 ledger directory to 2 makes 50% of the ledgers unreadable. Unreadable entries of an open ledger now make that ledger vulnerable to truncation.

Instead, use the decommissioning process to remove the bookie safely, then bring it back with the configuration you want.

Contributor Example 2— Changing the ledger routing algorithm

A change to how bookies route reads and writes to journal and ledger storage instances was proposed without taking into account the existing data. If the change were accepted and a bookie were upgraded with modified routing it would make a lot of existing entries unreadable and thus vulnerable to truncation.

Contributor Example 3— Returning a NoSuchEntry response due to entry log file corruption

It is possible that file corruption can cause one or more entries to be unreadable on a bookie. In some of these cases this has caused a NoSuchEntry response to be returned.

Contributors: know the impact of returning a NoSuchEntry response for an entry that the index says does exist on the bookie. It is possible for metadata to say an entry exists on a bookie, but it doesn’t and that is valid, it can happen due to ensemble changes. However, if the index on a bookie says it exists locally, then it definitely should exist and returning a NoSuchEntry/Ledger response in those cases makes ledger truncation possible.

There are probably more ways than listed above, such as changing metadata in various ways for example.

Bookie decommissioning process

The decommissioning process uses the external recovery mechanism to migrate data from one bookie to another. It will migrate all ledger fragments hosted on the target bookie to other bookies in the cluster. For each fragment, it re-replicates it and then updates the metadata to include the replacement bookie.

Read about decommissioning a bookie here: http://bookkeeper.apache.org/docs/latest/admin/decomission/

It is not the perfect decommissioning mechanism as you take the bookie down first, then the entries are brought back up to their configured replication factor. A safer way would be to migrate the data first, then bring down the bookie.

In the end the safest way of decommissioning a bookie is simply to set it to readonly and wait for data retention to remove all the data.

Summary

The BookKeeper protocol is quite different to an integrated protocol like Raft or the Kafka replication protocol. You can shoot yourself in the foot by doing things that would be safe with a different system.

So take care and if you contribute code to the project, ensure any changes to BookKeeper cannot cause a bookie to return a NoSuchEntry or NoSuchLedger response for an entry that was previously written. If you run BookKeeper, make sure you don’t cause a previously written entry to become unreadable or simply gone.

In the next post, we’ll be tinkering with the protocol a little, making small changes and watching it catch on fire as a result. Always a good way of learning why things are the way they are.