Apache BookKeeper Internals — Part 3 — Reads

Published in

Splunk-MaaS

5 min readOct 18, 2021

Based on BookKeeper 4.14 as configured for Apache Pulsar.

In the last post we looked at the path each write takes from the Netty layer down to file IO and all the threads and components involved. In this post we’ll do the same for the read path.

Read requests are submitted to the read thread pool to be executed and there can be multiple threads available to do the work (hence it being a pool). There are also long poll reads which by default are also submitted to the read thread poll, but can be configured to run on a separate long poll thread pool.

To perform a read, the selected read thread calls the DbLedgerStorage getEntry(long ledgerId, long entryId) method and the read is performed synchronously.

Fig 1. Reads only go to DbLedgerStorage and are performed synchronously.

The read request goes through the following sequence:

Check the write cache, if cache hit then return entry.
If a write cache miss occurs, check the read cache, if cache hit then return entry. If a read cache miss occurs then the entry only exists on disk.
Get the location on disk of the entry using the Entry Location Index (RocksDB). The index returns a location (which entry log file and the offset in the file).
Read the indicated entry log file at the indicated offset.
Perform the readahead.
Load all readahead entries into the read cache
Return the entry

The readahead operation involves reading from the entry log file, starting at the next entry, and keep reading entries until one of the following occurs:

Reach the readahead batch size limit (1000 entries by default as configured by Pulsar)
Reach the end of the file
Reach an entry of a different ledger.

By default the combined size of the read caches (one per DbLedgerStorage instance) is 25% of available direct memory. So with two ledger directories and 2GB of total read cache, each DbLedgerStorage instance gets a 1GB read cache.

Readaheads are efficient because entries of the same ledger are laid out on disk in contiguous blocks due to flushes first sorting entries by ledger and entry id before writing to file. This means that the readahead reads do not require the locations index and the reads are sequential.

Using sticky reads on the client side is useful for performance as all reads from a given bookie client will be sent to a single bookie, which will leverage this readahead well. If reads are scattered across different bookies the usefulness of the readahead operations is reduced. For example, if a client (a Pulsar broker) distributes reads of entries 0–99 randomly between three bookies, then each will end up covering most of that range with readahead reads, but each will only serve roughly ⅓ of those reads, meanwhile we’ve loaded a copy of each entry three times across the bookies.

There are also some nasty cases such as cache thrashing where the read cache effectiveness is reduced to the point where it actually hurts performance.

The Read Cache

Each read cache consists of one or more segments that are no larger than 1GB in size and can be considered logically, a single ring buffer. Being a ring buffer, the memory is preallocated and newly added entries end-up overwriting old entries. Each cache has an index of the entries it contains for fast lookup and retrieval.

Read Cache Thrashing

Cache thrashing is the scenario where the read cache is too small for the demand for readahead operations and entries keep getting evicted from the read cache before they can be read. This causes entries that were already read from disk previously to be read again to be placed in the cache.

To avoid read cache thrashing, the read cache of a bookie needs to be big enough so that it can accommodate the readaheads of all current Pulsar consumers. BookKeeper can receive a heavy read load when there are many Pulsar consumers at different positions in a topic or across topics and these positions are outside of the Broker’s own caches. In these scenarios Pulsar must send read requests to bookies and the worst case is that the entries and readaheads are completely disjoint from one another and the total size of those disjoint readaheads exceeds the capacity of the cache.

Fig 2. Clients (consumer objects in Pulsar brokers) evict each others readahead entries before they can be utilized

As the demand exceeds the cache size, more and more entries get evicted from the cache that have not yet been read and therefore must be reread from disk.

Fig 3. The relationship between how much of the read cache demand fits into the cache and the average number of times each entry ends up being read (and reread) from disk. Based on numbers from a real incident.

This read cache thrashing can be eliminated or reduced by:

Increasing the amount of memory available to the read cache
Reducing the read cache batch size

Reads Summary

If we ignore reaching high CPU utilization as a whole, bottlenecks in the read path will be because of disk IO. The simple case will be that the underlying storage volume is not fast enough to keep up with demand. Another cause could be that there are not enough read threads to push the volume to its limit. Finally there is read cache thrashing which can cause the amount of disk IO to go up dramatically, causing you to artificially reach the limit of your storage volume far sooner than you should have.

In the next post we’ll look at the back-pressure mechanisms that BookKeeper has to protect itself from overload. We’ve already covered many of these but there are some others we yet to talk about.

Series posts:

Apache BookKeeper Internals — Part 3 — Reads

The Read Cache

Read Cache Thrashing

Reads Summary

Written by Jack Vanlightly