Apache BookKeeper Observability — Part 2 — Write USE Metrics

Jack Vanlightly
Splunk-MaaS
Published in
8 min readOct 21, 2021

Based on BookKeeper 4.14 as configured for Apache Pulsar.

In the last post we introduced the types of BookKeeper metrics as well as the highest level metrics that exist for showing request rates/latencies as well as some high level indicators of saturation. Those metrics can tell us we have a performance problem but they don’t tell us where the problem originates, for that we need to start looking at the journal and DbLedgerStorage metrics.

In this post we’ll focus on the basic USE metrics (Utilization, Saturation, Errors) for the write path (journal and DbLedgerStorage) and in the next post take a further look at all the write path metrics available.

It is recommended to have read BookKeeper Internals Part 2 as it describes the inner workings of the journal and DbLedgerStorage components.

There are two general types of write bottleneck below the Netty layer:

  • Journal
  • Ledger Storage

Time Utilization

Time utilization is the calculation of time spent doing something per second. So if a thread spends 500ms per second doing work then it is at 50% utilization. We can measure time utilization in different places to discover whether there is a bottleneck or not.

Write thread(s)

The write threads only place entries on the journal queue and the DbLedgerStorage write cache which are both normally very low latency operations.

When a queue or cache is full the write thread blocks and therefore when a write thread sees higher utilization it usually means that either the journal queue is full or the write cache is full. This means that write threads will have low utilization or very high utilization but not necessarily much time spent in the middle.

This case below is with a rather slow spinning disk for the journal. There is a tipping point that is reached when disk IO latencies spike causing these quite fat spikes in write thread utilization as it waits for space on the journal queue.

Fig 1. Write thread time utilization (time spent per second) oscillates wildly as a downstream bottleneck oscillates.

Write thread time spent: sum(irate(bookkeeper_server_BookieWriteThreadPool_task_execution_sum[1m])) by (pod)

A bookie under-pressure can still have low write thread utilization, however, once utilization does spike it means there is saturation either in the journal or DbLedgerStorage (which we’ll cover further down).

Time Utilization ~> Resource Utilization

All file IO in BookKeeper is synchronous and also for each journal/DbLedgerStorage instance it is serial. This means that time utilization for file IO operations closely correlates to resource utilization.

For example, the force write thread spends the vast majority of its time performing fsyncs of the current journal file. If 99% of the time is spent on fsyncs then we know that fsyncs are fully utilized and there probably isn’t much more work it can do — the journal is very likely the bottleneck.

Likewise in DbLedgerStorage, if 99% of the time is spent busy on flushes then we know that the DbLedgerStorage hasn’t got much more to give and is likely a bottleneck.

NOTE! Disks are hard to predict, even when reaching 99–100% utilization on file IO, it is possible you’ve still got a little more room for extra throughput.

The Journal

Flushes (writes)and force writes (fsyncs) each get a dedicated thread whose only purpose is to do that operation. Assuming that this bookie has only a single journal, then 100% utilization would mean 1 second time spent per second.

Fig 2. As load increases so does the force write time spent per second.

Journal force write 1 second window: sum(irate(bookie_journal_JOURNAL_SYNC_sum[1m])) by (pod)

Journal force write 60 second window: avg_over_time((sum(irate(bookie_journal_JOURNAL_SYNC_sum[1m])) by (pod))[1m:])

Journal force write 5 minute window: avg_over_time((sum(irate(bookie_journal_JOURNAL_SYNC_sum[1m])) by (pod))[5m:])

Note in the above Grafana panel that suddenly we start getting huge peaks and troughs for the 1 second windows. This occurs when fsyncs take longer than the Prometheus scrape interval, which causes alternate scrapes to see no change in the bookie_journal_JOURNAL_SYNC_sum counter and so the rate is reported as 0.

This is a rather extreme example, but can happen, especially on spinning disks. To avoid this you can calculate these rates over longer windows and we can see that using a 5 minute window smoothes things out, showing that fsyncs reach 100% utilization (by reaching 1 second per second spent on fsyncs).

You can plot the same metrics for journal flushes using the bookie_journal_JOURNAL_FLUSH_LATENCY_sum metric.

DbLedgerStorage

DbLedgerStorage flushes can be run by either the Sync thread or the DbStorage thread, but only one flush can be executed at a time. So like with the journal, we can measure the amount of time per second that flushes are taking place to calculate utilization:

Fig 3. As the load is increased, the amount of time per second spent flushing increases.

Flush time, 1 second window: sum(irate(bookie_flush_sum[1m])) by (pod)

Flush time, 60 second window: avg_over_time((sum(irate(bookie_flush_sum[1m])) by (pod))[1m:])

Flush time, 5 minute window: avg_over_time((sum(irate(bookie_flush_sum[1m])) by (pod))[5m:])

Again note that the 1 second windows suffer from peaks and troughs under heavy load as the flush time exceeds the scrape interval. Longer windows can be used to avoid this. With a 5 minute window we see flush utilization reach 100% utilization (1 second spent flushing per second).

NOTE! In BookKeeper 4.14 and earlier you will likely see the time spent flushing reach 2 seconds per second! This is an anomaly due to the beginning of each DbLedgerStorage flush being recorded before a lock is acquired that prevents concurrent flushes. So when both the Sync Thread and the DbLedger thread are trying to flush ledger storage at the same time, a non-trivial amount of time in either thread is spent waiting for the lock. The important thing to note is that once this value reaches 1 second (per DbLedgerStorage instance), DbLedgerStorage has reached full utilization.

Saturation

A fully utilized system is unlikely to keep up with demand and so internal buffers (queues, caches) for pending work start to grow until full. These buffers can act as strong signals of a bottleneck with full buffers indicating that you have reached full capacity of the resource behind it.

The problem with metrics for queue sizes is that often queues are either empty or full and have a very spiky pattern. But being a sample based metric we can often miss these short lived spikes.

Write Thread(s)

Write threads can be tricky to use as a saturation signal as they are usually doing fine or they are blocked. When they are blocked their queues can fill rapidly and then once the downstream bottleneck is cleared the queue empties just as fast. The queue size metrics are samples and can completely miss these spikes if they are short-lived.

Take the example below. At the top we see 5 spikes of failed requests (which are rejected requests because the write thread pool task queue is full). This means that the task queue reached 10000 requests on 5 occasions but the task queue lengths metrics only show that happening twice.

Fig 4. Queue length metrics sometimes miss short-lived spikes

This kind of spiky behaviour is common when the journal is the bottleneck as the journal queues themselves tend to have the same spiky behaviour.

When the DbLedgerStorage is the bottleneck we tend to see the write thread queue lengths change less slowly as when the write cache is full, the write thread gets blocked for up to 10 seconds at a time.

The Journal

The journal has two queues:

  • journal queue
  • force write queue

If either of these starts filling up then we know that the journal is not keeping up, and when it is full it causes the write thread to block which in turn causes its own task queue to fill up.

The journal queues can fill up very fast when flushes and fsyncs take longer than normal creating a pattern like below. For the period of about an hour we start seeing the max flush latency and the p50 force write latency spike which cause the journal and force write queues to repeatedly fill and empty.

Fig 5. Latency spikes in disk IO can cause queues to fill fast

Journal flush latency (max): sum(bookie_journal_JOURNAL_FLUSH_LATENCY{success=”true”, quantile=”1.0"}) by (pod)

Journal queue length: sum(bookie_journal_JOURNAL_QUEUE_SIZE) by (pod)

Journal force write latency (max): sum(bookie_journal_JOURNAL_SYNC_LATENCY{success=”true”, quantile=”1.0"}) by (pod)

Journal force write queue length: sum(bookie_journal_JOURNAL_FORCE_WRITE_QUEUE_SIZE) by (pod)

DbLedgerStorage

On the DbLedgerStorage side of things, if the flushes cannot keep up with the write demand then both write caches end up full, blocking further writes to ledger storage which causes the write thread to block and requests to start getting rejected higher up.

Another indicator of saturation is throttled and rejected writes. Writes are throttled when the active write cache is full but cannot be swapped out because the other write cache is still being flushed. If a write cache doesn’t become available soon enough, writes are rejected by DbLedgerStorage. So seeing non-zero throttled writes is an indication of a DbLedgerStorage instance under pressure and non-zero rejected writes indicate a saturated DbLedgerStorage instance.

Fig 6. As the write cache reaches capacity writes start getting throttled and then rejected

Throttled writes per minute: ceil(sum_over_time((sum(irate(bookie_throttled_write_requests[1m])) by (pod))[1m:]))

Rejected writes per minute: ceil(sum_over_time((sum(irate(bookie_rejected_write_requests[1m])) by (pod))[1m:]))

Write cache size: sum(bookie_write_cache_size) by (pod)

The default time a write will block waiting for a write cache to become available is 10 seconds which is quite long. This means that even very low values of throttled and rejected writes indicate a problem. In the above case it reached peaks of 2 per minute. With the default wait of 10 seconds, the maximum rate is 6 per minute per write thread (which defaults to 1).

NOTE! There is no metric for the maximum write cache size in version 4.14 and before, so you’ll need to calculate it yourself (by default 25% of direct memory).

Summary

Knowing the rate of disk flushes, or the latency of a given operation usually isn’t enough to diagnose when a bookie has reached its limit. For most metrics we need to know what normal is and sometimes even that is not possible if a problem has been going on long enough.

In order to quickly diagnose when a bookie is a performance bottleneck we employ the tried and tested USE method. Very quickly we can identify whether a bottleneck exists or not, and if so where it exists without having to try to interpret the meaning of metrics like rates and latencies.

Just be careful with those queue length metrics which can hide short-lived spikes.

In the next post we’ll dive deeper into all the journal and DbLedgerStorage metrics available to you.

Series links:

--

--