Apache BookKeeper Observability — Part 5— Read Metrics in Detail

Jack Vanlightly
Splunk-MaaS
Published in
7 min readOct 21, 2021

Based on BookKeeper 4.14 as configured for Apache Pulsar.

In the last post we focused on some simple indicators of overload in the read path using utilization and saturation metrics. In this post we’ll describe all the metrics that are available, what they mean and how to use them.

We covered the higher level metrics in the read-path in part 1. This post is focused on the lower level DbLedgerStorage metrics.

It is recommended to have read BookKeeper Internals Part 3 as it describes the inner workings of the DbLedgerStorage component.

The locations of the various DbLedgerStorage read metrics.

Fig 1. Location of the various DbLedgerStorage metrics for reads
  1. bookie_read_cache_hits and bookie_read_cache_misses (both are OpStatsLoggers): Latency histograms and counts of read cache hits and misses. These metrics group read and write caches together. The latency histograms are not so useful as in-memory accesses are extremely fast. More useful are the counters (bookie_read_cache_hits_count, bookie_read_cache_misses_count). Can be useful for detecting cache thrashing (see further down for more).
  2. bookie_read_cache_size (gauge): This metric is pretty meaningless and should be ignored. Read caches are ring buffers which are pre-allocated on start-up. This metric more or less just returns the position in the ring buffer which means nothing.
  3. bookie_read_cache_count (gauge): The combined number of entries in the read caches across all DbLedgerStorage instances. Not a very useful metric.
  4. bookie_readahead_batch_count (OpStatsLogger): A histogram of the number of entries read in each read-ahead operation. Also we have a counter for calculating the rate of read-ahead operations (bookie_readahead_batch_count_count) and a sum that tells us the total entry read rate of the read-ahead ops (bookie_readahead_batch_count_sum). This is an extremely useful metric for diagnosing cache thrashing, see below for the details.
  5. bookie_readahead_batch_size (OpStatsLogger): A histogram of the number of bytes read in each read-ahead operation.
  6. bookie_read_entry (OpStatsLogger): Latency histogram of the entire read operation as a whole (within DbLedgerStorage) which means it includes the latency start to finish across the caches, reading from indexes, reading the entry from its entry log file and the read-ahead. Because it encompasses so many parts of the read process, it can only show the latency characteristics of the reads as a whole. Also we have a counter bookie_read_entry_count (for read rate) and a latency sum bookie_read_entry_sum (for time spent on reads) which are useful for diagnosing that reads are a bottleneck.

If the issue were a RocksDB issue the metrics can’t currently tell you that and likewise if the read-ahead is taking up all the time we can’t know that. There are other ways of getting that information such as flame graphs, though that requires access to the bookie server (or pod in a Kubernetes environment).

Detecting Read Cache Thrashing

We covered what read cache thrashing is in BookKeeper Internals Part 3 but as a reminder here, it’s what happens when the read-ahead operations end up evicting entries of other clients before they have been able to read them. The more early evictions there are the more entries must be read from disk again. This occurs when the read cache is too small for the demand.

In bad cases this can cause each entry to be read multiple times from disk and make read throughput drop off a cliff. You really don’t want cache thrashing occurring on your bookies.

Luckily, detecting thrashing isn’t too hard to do.

The following Grafana panels show a load test where I purposely induced cache thrashing by adding more and more consumers on different topics reading from the beginning of the topic. These messages are no longer in the broker caches and so the brokers must send read requests to the bookies.

You can see in the next panel that as just before 22:20, the operation rate starts to drop quickly while the request rate continues to climb. Correspondingly the failure rate climbs because the read thread pool task queues are full. The reason that the read operation rate drops is due to cache thrashing. The reason the request rate continues to climb is that the bookie is rejecting requests immediately with low-latency.

Fig 2. Read operation rate drops as the failure rate climbs due to full thread pool task queues

Read request rate: sum(irate(bookkeeper_server_READ_ENTRY_REQUEST_count[1m])) by (pod)

Failed read request rate: sum(irate(bookkeeper_server_READ_ENTRY_REQUEST_count{success=”false”}[1m])) by (pod)

Read operation rate: sum(irate(bookkeeper_server_READ_ENTRY_count[1m])) by (pod)

Read throughput: sum(irate(bookie_READ_BYTES[1m])) by (pod)

These high level metrics were covered in part 1.

How can we detect cache thrashing? We simply measure the number of reads to disk compared to the number of read operations. Anything above 1 disk read to every 1 read op indicates entries are being evicted early from the cache and are being read another time.

Fig 3. A strong signal of read cache thrashing

Reads to disk per read op (see bottom of post for an explanation): (sum(rate(bookie_readahead_batch_count_count[1m])) by (pod) + sum(rate(bookie_readahead_batch_count_sum[1m])) by (pod)) / sum(rate(bookie_read_entry_count[1m])) by (pod)

Cache miss-to-hit ratio: (sum(irate(bookie_read_cache_misses_count[1m])) by (pod) / sum(irate(bookie_read_cache_hits_count[1m])) by (pod)) * 100

Also included is a secondary indicator based on the cache miss-to-hit ratio. Ideally we want the ratio to be 1/”read-head batch size”. You can see the batch size histogram with the bookie_readahead_batch_count OpStatsLogger metric. In the above case it’s hovering between 93 and 110 which means that a healthy ratio is around 1% (as in 1 in 100 read operations should miss the cache). Going above 1% for this batch size indicates that thrashing “may” be occurring, but in the end the only way to know for sure is the “read to disk vs read op” ratio.

Remediation Options

If you don’t have cache thrashing then:

  1. First if you have the journal and DbLedgerStorage running on the same volume then separating them out into separate volumes can help a lot.
  2. Add more bookies to the cluster to spread out the load. One of the features of BookKeeper is its dynamic scaling-out.
  3. Configure multiple DbLedgerStorage directories and mount a separate volume per directory. This spreads out the file IO over multiple volumes.
  4. If you are using hard-drives then try switching to high performance SSDs.

If you have cache thrashing then look at one or more of the following:

  1. Increase the direct memory of the bookie (25% of that will go to read caches)
  2. Specifically set the read cache memory to a higher value (dbStorage_readAheadCacheMaxSizeMb)
  3. Reduce the read-ahead batch size in order to be able to accommodate more Pulsar consumers at different positions/topics (dbStorage_readAheadCacheBatchSize).

Metrics Blind Spots

The following questions cannot be answered easily by metrics:

  • As with writes, cannot quickly tell if failed read requests are due to errors or rejected reads. You must check your error logs to see if the increase can be explained by genuine errors.
  • In multi ledger storage configurations, we can’t know if there is one bad DbLedgerStorage instance. It could be an issue of one bad volume or one RocksDB instance that is having issues but all metrics are aggregated across all instances.
  • Are slow reads due to the RockDB locations index, the initial entry read from an entry log file or the read-ahead operation? Current metrics do not separate these out.
  • When we see failed requests that we think are actually rejected requests, and we also see that the time utilization of the thread pool is less than 100%, then we know one or more read threads have saturated, but we don’t know which ones or the distribution of load.

In order to answer some of these questions, other techniques such as flame graphs are required.

Series Summary

This is the end of this metrics series. We’ve covered a lot but we can summarize some things:

  • Utilization and saturation metrics are usually the first to place to look to see if your BookKeeper servers are a bottleneck in your Pulsar deployment. We’ve covered everything that is available right now in version 4.14.
  • We’ve covered almost all the metrics available regarding read and writes and placed them visually in the mental model.
  • There are some blind spots and a future BookKeeper release should close those.

The big thing we’ve not covered are the Pulsar metrics. BookKeeper does not work in isolation and usually diagnosing performance issues requires a more holistic approach. Back pressure exerted from BookKeeper can travel all the way up the chain to the Pulsar client for example.

It’s complicated

Back pressure or just higher write latencies can cause the brokers and then the clients to reduce their write rate. When this happens and BookKeeper is actually the bottleneck there may not be overly strong signals of overload in BookKeeper itself. In these cases you are still likely to see high utilization but not necessarily saturation. This is where Pulsar and also Pulsar client metrics are important. Hopefully in the future we’ll write a series that looks at Pulsar brokers in the same way we have done for BookKeeper servers.

All the examples in this series are just examples. You are likely to see all kinds of patterns depending on your hardware, specific load patterns etc. Hopefully this series at the very least has armed you with knowledge so you can investigate your own specific snowflake performance issues more expertly.

Series links:

** Explanation of read to disk vs read op metric **

(sum(rate(bookie_readahead_batch_count_count[1m])) by (pod) + sum(rate(bookie_readahead_batch_count_sum[1m])) by (pod)) / sum(rate(bookie_read_entry_count[1m])) by (pod)

We need to take the sum of the entries read in read-ahead ops (bookie_readahead_batch_count_sum) plus the number of normal entry reads. When a cache miss occurs, the entry is read from disk and then a read-ahead is performed. So the number of normal reads = the number of read-ahead ops (bookie_readahead_batch_count_count). Adding them together gives us the number of entry reads to disk, then bookie_read_entry_count gives the number of read operations.

--

--