Apache BookKeeper Observability — Part 3 — Write Metrics in Detail

Published in

Splunk-MaaS

8 min readOct 21, 2021

Based on BookKeeper 4.14 as configured for Apache Pulsar.

In the last post we focused on some simple indicators of overload in the journal and DBLedgerStorage using utilization and saturation metrics. In this post we’ll describe all the metrics that are available, what they mean and how to use them.

We covered the higher level metrics in the write-path in part 1. This post is focused on the lower level journal and DbLedgerStorage metrics.

It is recommended to have read BookKeeper Internals Part 2 as it describes the inner workings of the journal and DbLedgerStorage components.

Journal Metrics

NOTE! For those that run with multiple journal directories, all the following metrics are the aggregate of all journal instances. The implication of this is that it is not possible to view the metrics of individual journal instances, so one slow journal instance attached to a problematic volume may only show in histogram metrics.

The journal has the following relevant metrics.

Fig 1. Location of the various journal metrics

bookie_journal_JOURNAL_QUEUE_SIZE (gauge): The size of the in-memory queue that feeds the Journal Thread. This is a blocking queue with default size bound of 10000 entries, that means that attempts to add an entry once the queue has reached its size limit will block the caller (write thread) until space is made available. If this metric is consistently high then this indicates the journal is saturated.
bookie_journal_JOURNAL_QUEUE_LATENCY (OpStatsLogger): Measures the time spent in the in-memory queue before being dequeued by the Journal Thread. Higher than normal latencies here indicate that the Journal Thread is struggling to keep up.
bookie_journal_JOURNAL_ADD_ENTRY (OpStatsLogger): Measures the latency from the moment an entry is placed in the queue to the moment the callback is called (end-to-end journal latency).
bookie_journal_JOURNAL_NUM_FLUSH_MAX_WAIT (counter): Increments every time a journal flush is performed (a write syscall to the journal file) from the Journal Thread due to the max wait time (default 2ms). When the journal is not at capacity this counter should normally be the only or at least the dominant counter of the three flush trigger metrics (this metric and the next two). See BookKeeper Internals Part 2 for more on the journal flush triggers.
bookie_journal_JOURNAL_NUM_FLUSH_MAX_OUTSTANDING_BYTES (counter): Increments every time a flush is made to the journal file from the Journal Thread due to either reaching the maximum accumulated bytes or maximum entries. When this counter is increasing it indicates that journal may be under pressure.
bookie_journal_JOURNAL_NUM_FLUSH_EMPTY_QUEUE (counter): By default journals do not flush upon emptying their queue, but if this is configured. This is a counter for the number of flushes triggered by an emptying of the queue. A steadily increasing value for this counter indicates that the Journal Thread is at low capacity.
bookie_journal_JOURNAL_FLUSH_LATENCY (OpStatsLogger): Measures the latency of journal flushes by the Journal Thread. Being an OpStatsLogger, we also get a counter bookie_journal_JOURNAL_FLUSH_LATENCY_count (for flush rate) and a latency sum bookie_journal_JOURNAL_FLUSH_LATENCY_sum (for time spent flushing). The flush latency, rate and time spent flushing are all helpful for diagnosing flushing as a bottleneck (see last post).
bookie_journal_JOURNAL_WRITE_BYTES (counter): The number of bytes written by the journal.
bookie_journal_JOURNAL_FORCE_WRITE_BATCH_BYTES (OpStatsLogger): A histogram for the size of the flushes in terms of number of bytes (and therefore size of force write requests which are created after flushes). A bookie under-pressure will be performing larger batch sizes and you might see this metric flatline at its maximum flush size.
bookie_journal_JOURNAL_FORCE_WRITE_BATCH_ENTRIES (OpStatsLogger): A histogram for the size of the flushes in terms of number of entries. Not usually an important metric.
bookie_journal_JOURNAL_FORCE_WRITE_QUEUE_SIZE (gauge): The number of force write requests in the force write queue. This is a blocking queue with default size bound of 10000 entries, attempts to add a request once the queue has reached its size limit will block the caller (Journal Thread) until space is made available. If this metric is consistently high then this indicates the journal fsyncs may be saturated.
bookie_journal_JOURNAL_FORCE_WRITE_ENQUEUE (OpStatsLogger): The latency histogram for how long force write requests remain in the force write queue. When these latencies go up expect the force write queue size to also increase and the journal sync latencies (next metric) to also rise.
bookie_journal_JOURNAL_SYNC (OpStatsLogger): The latency histogram of journal fsync ops by the Force Write Thread. Also we have a counter bookie_journal_JOURNAL_SYNC_count (for flush rate) and a latency sum bookie_journal_JOURNAL_SYNC_sum (for time spent on fsyncs) which are useful for diagnosing that fsyncs are a bottleneck.
bookie_journal_JOURNAL_CB_QUEUE_SIZE (gauge): This is not an actual queue but the number of entries that are currently in the process of being processed by the journal and not yet had their callbacks executed. The upper bound of this value depends on the size of the journal entry queue, the number of accumulated entries yet to be flushed, the size of the force write request queue and the size of the task queue of the Journal Callback Thread. Therefore it is harder to use this metric to know when saturation is occurring.

Remediation Options for Journal Bottlenecks

First if you have the journal and DbLedgerStorage running on the same volume then separating them out into separate volumes can help a lot.
Add more bookies to the cluster to spread out the load. One of the features of BookKeeper is its dynamic scaling.
Configure multiple journal directories and mount a separate volume per directory. This spreads out the journal file IO over multiple volumes.
If you are using hard-drives then try switching to high performance SSDs.

DbLedgerStorage Metrics

NOTE! For those that run with multiple ledger directories, all the following metrics are the aggregate of all DbLedgerStorage instances. The implication of this is that it is not possible to view the metrics of individual DbLedgerStorage instances, so one slow ledger storage instance attached to a problematic volume may only show in histogram metrics.

bookie_throttled_write_requests (counter): When a write thread finds the write cache is full this counter is incremented. A DbLedgerStorage flush is triggered at this point if it already isn’t in progress. In itself this metric does not mean that DbLedgerStorage has reached its capacity, just that a write is forced to wait for a write cache to be available. Part of the synchronous flow of a write request.
bookie_rejected_write_requests (counter): This counter is incremented when the write thread times out waiting for an empty write cache to be swapped in. The write thread stops waiting and rejects the write. This counter is a sure sign of DbLedgerStorage reaching capacity as this indicates that the swapped out cache is still flushing. Part of the synchronous flow of a write request.
bookie_write_cache_size (gauge): The sum in bytes of both the active write cache and the swapped out write cache across all DbLedgerStorage instances. The closer this gauge gets to the maximum write cache size (25% of direct memory by default), the more pressure DbLedgerStorage is under. If the cache regularly reaches this maximum or remains at this maximum for long periods then write requests are likely getting rejected due to the write caches being full.
bookie_write_cache_count (gauge): The combined number of entries in both write caches across all DbLedgerStorage instances. There is no limit on the number of entries, only bytes, so this metric is not as useful as its bytes counterpart.
bookie_flush (OpStatsLogger): Latency histogram and counter for DbLedgerStorage flushes. This metric is useful to diagnose the cause of why write requests are being rejected and the write cache size is high. It is a combined value of flushing the ledgers index, the entry location index and the entry log file, across all ledger directories. Therefore it is not currently possible to determine if the issue is a RocksDB instance or slow entry log flushes. Usually it is because of the entry log file as the number of bytes written to the indexes is comparatively small. You can get the flush rate from bookie_flush_count.
bookie_flush_size (OpStatsLogger): Size histogram for the number of bytes per flush. Not indicative of anything by itself.

A Note on Flushes

A DbLedgerStorage flush operation performs three operations serially:

Flush the swapped-out write cache to an entry log file(s).
Add the entry locations (location being entry log file and offset) of the flushed entries to the locations index then flush the index.
Flush the ledger index (the index that stores which ledgers a bookie has and other metadata such as whether it is fenced or not).

Typically you’d expect that the entry log flushing to take the most time, but there have been cases where RocksDB has ended up in a weird state and flushes taking a long time.

Remediation Options for DbLedgerStorage bottlenecks

The basic advice is the same as for the journal: make sure the journal and DbLedgerStorage run on separate volumes. Secondly, make use BookKeeper’s dynamic scaling out ability by adding more bookies. Thirdly, spread out the ledger storage over multiple volumes.

If you’ve already got multiple ledger directories then another option is to increase the number of write threads to match. This can reduce the amount of blocking that happens when one write cache becomes full. If you have only one write thread and four ledger directories, then one DbLedgerStorage instance can end up blocking writes of other instances that are not full.

Metrics Blind Spots in the Write-Path

The following questions cannot be answered easily by the current metrics:

Going back to the high level metrics, are those failed write requests an IO failure or rejected because of full thread pool task queues? Currently there is no specific metric for requests rejected at the Netty layer. At a glance you can’t attribute failed requests to a saturated write path, you have to check the server logs to see if there really are a lot of IO errors.
In a multi DbLedgerStorage instance configuration, could the issue be one bad volume? Metrics are aggregated over all instances making a single problematic volume difficult to diagnose.
Are slow DbLedgerStorage flushes due to the RocksDB ledger index, the RocksDB locations index or the entry log file? Current metrics do not separate them out.
Calculating utilization requires that you know the maximum number of concurrent operations of its kind. For example, having two journals means that two threads are performing force writes and therefore 100% utilization means 2 seconds time spent per second. BookKeeper does not emit gauges for this. This is not a big deal if you always use the same config, or you only have one cluster, but becomes a problem if you have multiple clusters with different configurations.
Calculating saturation due to full buffers requires that you know the buffer size limits and these are not currently emitted as gauges.

These blindspots are being addressed and a future BookKeeper release will cover them. When that happens we’ll blog about it.

Summary

In this post we’ve covered the metrics in the journal and DbLedgerStorage that cover the write-path. There are also higher level metrics in part 1 of this series that should not be overlooked.

Next we’re going to look at the read-path.

Series links: