Apache BookKeeper Internals — Part 2 — Writes

Published in

Splunk-MaaS

6 min readOct 18, 2021

Based on BookKeeper 4.14 as configured for Apache Pulsar.

In the last post we looked at the 10000 foot view of a bookie in terms of its components, threads and the read and write path. In this post we’ll dive deeper into the write path, understanding how the components and threading model combine to deliver fast and durable writes. We’ll still try and keep this as high level as we can, this is simply for our mental model.

The write path involves a chain of many threads and calls to both the Journal and LedgerStorage APIs. As we already covered in the previous post, the write path is part synchronous (Journal) and part asynchronous (DbLedgerStorage).

Fig 1. How writes are handled by various threads

Remember that multiple Journal instances and DbLedgerStorage instances can be configured, and each instance has its own threads, queues and caches. So when referring to some threads, caches and queues, these may also exist in parallel.

Netty threads

Netty threads handle all TCP connections and the requests that come in over those connections. They dispatch write requests to the write thread pool, these requests include the entry to be written and a callback that can be called at the very end of the chain for sending the response to the client.

Write thread pool

The write thread pool doesn’t do much work and therefore typically not many threads are required (default is 1). Each write request adds the entry to the DbLedgerStorage write cache and if that succeeds then adds the write request to an in-memory queue in the Journal. The write thread’s job is complete at this point and other threads will do the rest.

There are in fact two write caches (per DbLedgerStorage instance), one that is active and one that is free to be flushed in the background. The two write caches are swapped when a DbLedgerStorage flush is required, allowing the bookie to continue to serve writes on the new empty cache while it flushes the full cache to disk. As long as we can flush the swapped out write cache before the active one gets full, we don’t have a problem.

A DbLedgerStorage flush can occur either via the Sync Thread which performs checkpoints on a timer or a DbStorage thread (one DbStorage thread per DbLedgerStorage instance). If a write cache is full when the write thread tries to add an entry to it, the write thread will submit a flush operation on the DbStorage thread. If the swapped out cache has already completed its flush, the caches are swapped immediately and the write thread can then add the entry to the newly swapped in write cache, completing its part in the write operation. However, if an active cache is full and the swapped out cache is still flushing then the write thread waits for a short time and eventually rejects the write request. The time that it waits for an available write cache is set by the config dbStorage_maxThrottleTimeMs which defaults to 10000 (10 seconds).

By default the write thread pool has only a single thread so if flushes take a long time it can cause the lone write thread to block for up to 10 seconds at a time, causing the write thread pool task queue to quickly fill up with write requests leading to additional writes being rejected. This is how the back-pressure mechanism works with DbLedgerStorage. Once a flushed write cache becomes available for writes again the write thread pool becomes unblocked.

The maximum size of the combined write caches defaults to 25% of available direct memory. You can also set this via the config dbStorage_writeCacheMaxSizeMb. The total available memory for the write cache is split between the DbLedgerStorage instances (one per ledger directory) and then between the two write caches of each. So with two ledger directories and 1GB of write cache memory, each DbLedgerStorage instance gets 500MB and each then splits that into two 250MB write caches.

DbStorage Thread

Each DbLedgerStorage instance has its own DbStorage thread which is responsible for executing on-demand flushes when its write cache is full.

Sync Thread

This thread is outside of both the Journal and DbLedgerStorage. Its job is to periodically perform a checkpoint. The objective of a checkpoint is to:

flush ledger storage (long term storage)
mark the position (log mark) in the journal that is now safely persisted to long term storage. It does this by writing a file with the log mark to disk.
Clean-up any old journal files that are no longer needed

There is synchronization that prevents two different threads triggering a flush at the same time.

When a DbLedgerStorage flush occurs, the swapped out write cache is written to the current entry log file (there is also log rotation). First the entries are sorted by ledger and entry id, then for each they are written to the entry log file and their locations written to the Entry Locations Index. This sorting of entries on writing is an optimization for reads, which we’ll cover in the next post.

Once all writes are fully flushed to disk the swapped out write cache is cleared, ready to be swapped in again.

Journal thread

The journal thread is a loop which takes entries from its in-memory queue and writes them to disk. Periodically it adds a force write request to the force write queue which will trigger an fsync.

The journal doesn’t perform a write sys call for every entry it reads from the queue, instead it accumulates entries and writes them to disk in batches (called flushes in BookKeeper). This is also referred to as group commit. The flushes occur according to the following criteria:

Reaches the maximum wait time (the config journalMaxGroupWaitMSec which defaults to 2 ms)
Reaches the maximum number of accumulated bytes (the config journalBufferedWritesThreshold which defaults to 512Kb)
Reaches the maximum number of accumulated entries (the config journalBufferedEntriesThreshold which defaults to 0 meaning it is not applied)
Upon dequeuing the last entry from the queue, aka going from non-empty to empty (the config journalFlushWhenQueueEmpty which defaults to false)

Every flush creates a corresponding Force Write Request which contains those entries flushed.

Force Write Thread

The force write thread is a loop which takes force write requests from the force write queue and performs an fsync on the journal file. The force write request contains the entries that were flushed including their callbacks so that after being persisted to disk each callback is submitted to the Journal Callback Thread to be executed.

Journal Callback Thread

This thread executes the write request callbacks which sends the response to the clients.

Common Bottlenecks

Bottlenecks in the write path are usually in the Journal or DbLedgerStorage as the usual culprit is disk IO. If journal writes or fsyncs are too slow for the write demand then the Journal Thread and Force Write Thread just won’t be able to take entries from their respective queues fast enough. Likewise if DbLedgerStorage flushes are too slow then the write caches won’t be able to be emptied and swapped fast enough for the demand.

Often if the journal is the bottleneck this will cause the write thread pool task queue to reach capacity as enqueuing entries on the journal queue is blocking and write threads get blocked. Once the thread pool task queue is full, writes will be rejected at the Netty layer as Netty threads will be unable to submit more write requests to the write thread pool. If you were to take a flame graph you’d see all the write thread pool threads very busy.

If DbLedgerStorage is the bottleneck then DbLedgerStorage itself can start rejecting writes, but only after waiting for up to 10 seconds (by default) which will also cause the write thread pool to fill fast and writes to start getting rejected from the Netty threads.

Finally, it is possible that disk IO is not the bottleneck and instead you reach very high CPU utilization. This can happen if you have fast disks but too few CPUs to handle all the needs of Netty and the various other threads. This is easy to diagnose by simply using your system resource metrics.

Summary

We’ve covered the path each write takes, through both the journal and DBLedgerStorage components and which threads do the actual work. In the next post we’ll take a look at the read path.

Series posts:

Apache BookKeeper Internals — Part 2 — Writes

Common Bottlenecks

Summary

Written by Jack Vanlightly