Why Apache BookKeeper? Part 1: consistency, durability, availability

Sijia-w
StreamNative
Published in
7 min readJan 9, 2020

Apache BookKeeper is a scalable, fault-tolerant, and low-latency log storage service optimized for real-time workloads. BookKeeper was originally developed at Yahoo! Research, then incubated as a subproject under Apache ZooKeeper beginning in 2011, and finally graduated to being a top-level Apache project in January, 2015. Since its initial introduction, BookKeeper has been widely adopted — including by enterprises like Twitter, Yahoo, and Salesforce — for storing and serving mission-critical data and supporting a variety of use cases. In this blog post, I’ll take a look at how BookKeeper provides durability, consistency, and low latency and I’ll highlight the guarantees and key features of BookKeeper — all of which are available out of the box, in open source.

In a previous blog post, I gave a high-level overview of Apache BookKeeper and introduced several BookKeeper concepts and terms. A BookKeeper cluster is composed of:

  1. A set of individual storage servers, called bookies
  2. A metadata storage system for service discovery and metadata management

BookKeeper clients can use either the higher-level DistributedLog API (aka the log stream API) or a lower-level ledger API that enables direct interaction with bookies. A typical BookKeeper installation is shown in Figure 1 directly below.

Figure 1. A typical BookKeeper installation (plus applications connecting via multiple APIs)

Stream storage requirements

As we described in the Intro to Apache BookKeeper blog post, a real-time storage platform should meet all of the following requirements:

  • Clients should be able to write and read streams of entries with very low latency (under 5 milliseconds), even when providing strong durability
  • Data storage should be durable, consistent, and fault tolerant
  • The system should enable clients to stream or tail ledgers to propagate data as they’re written
  • The system should be able to store and provide access to both historic and real-time data

BookKeeper meets all of these requirements by providing the following guarantees:

Replication

BookKeeper replicates and stores several identical copies of each data record — typically three or five — on different machines within a datacenter or, optionally, spanning multiple datacenters. Unlike other distributed systems, that use a master/slave or pipeline replication algorithm to replicate data between replicas (such as Apache HDFS, Ceph, and Kafka), Apache BookKeeper uses a quorum-vote parallel replication algorithm for replicating data to ensure predictable low latency. Figure 2 below illustrates replication within a BookKeeper ensemble.

Figure 2. Ensemble, write, and ack quorums in BookKeeper replication

In the diagram above:

  1. A set of bookies is (automatically) chosen from the BookKeeper cluster (in this case bookies 1 through 5). These bookies act as an ensemble for storing data records on a given ledger.
  2. The data records stored on the ledger are striped to the bookies in the ensemble. That is, each record is stored across a number of replicas. This number, which you can configure at the client level, is called the write quorum size. In the diagram above, the write quorum size is three, which means that records are written to bookies 2, 3, and 4.
  3. Whenever a client writes a data record to the ensemble, the client waits until a specified number of replicas send an acknowledgment (ack). This number of replicas is called the ack quorum size. When the required number of acks is received, the client will consider the write operation successful. In the diagram above, the ack quorum size is two, which means that an acknowledgment will be sent to the client if, for example, bookies 3 and 4 store the record.
  4. The makeup of the ensemble changes when bookie failures occur. The failed bookies will be replaced, perhaps just temporarily, by healthy bookies. For example, Bookie x might replace Bookie 5 if it goes down.

Replication: core ideas

BookKeeper replication is based on a set of core ideas:

  1. Log streams are record oriented rather than bytes oriented. What this means is that data are always stored as indivisible records (including metadata) rather than as individual byte arrays.
  2. The ordering of records in a log (stream) is decoupled from the actual storage of copies of the record.

These two core tenets ensure that BookKeeper replication is able to:

  • Provide a wide variety of options for writing records to bookies, which ensures that writes can always be completed even if many bookies in the cluster are down or slow (as long as there’s still enough capacity to handle the load). Changing the ensemble enables this.
  • Maximize the bandwidth for a single log (stream) by increasing the ensemble size, so that a single log will not be limited to one or a small set of machines. This can be achieved by configuring the ensemble size to be larger than the write quorum size.
  • Improve tail latency by tuning the ack quorum size. This is crucial to ensuring low latency in BookKeeper while also still providing consistency and durability guarantees as described below.
  • Provide fast re-replication via many-to-many replica recovery (re-replication creates more replicas for records that have become under-replicated, i.e. below the write quorum size). All bookies can act as both donors and recipients of record copies.

We’ll walk you through additional details about BookKeeper replication in a future blog post.

Durability

Every data record written to BookKeeper is guaranteed to be replicated and durably stored in the configured number of bookies. This is achieved using explicit disk fsync and write confirmation.

  1. On a single bookie, data records are explicitly written (fsynced) to disk before acknowledgements are sent to clients, in order to persist data in the event of crashes. This guarantees that data written to persistent storage that doesn’t require power, in a form that allows it to be brought back and used.
  2. Within a cluster, data records are replicated to multiple bookies for fault tolerance.
  3. A data record is only acknowledged when clients receive a configured number of responses (specified by the ack quorum size) from bookies.

Most recent NoSQL-type databases, distributed filesystems, and messaging systems (such as Apache Kafka) assume that simply replicating data to multiple nodes in volatile storage/memory is sufficient to meet best-effort durability requirements, but the problem that those systems present is that they allow for potential data loss. BookKeeper was designed to provide much stronger durability guarantees, with the goal of fully preventing data loss and thus meeting stringent enterprise requirements.

Consistency

Guaranteeing consistency is a common problem in distributed systems, particularly when replication is introduced to provide durability and availability. BookKeeper provides a simple but strong consistency guarantee — repeatable read consistency — for data stored in logs:

  • If a record has been acknowledged to the application, it must be immediately readable.
  • If a record is read once, it must always be readable.
  • If a write of record R is successful, all records up until R are successfully committed/persisted and will always be readable.
  • The order of the stored records must be exactly identical between readers as well as repeatable.

This repeatable read consistency is accomplished by a LastAddConfirmed (LAC) protocol used by BookKeeper. I’ll walk you through some additional details in a future blog post.

Availability

In a CAP (Consistency, Availability, Partition tolerance) theorem context, BookKeeper is a CP system. In reality, however, Apache BookKeeper offers very high availability even in the presence of hardware, network, and other failures. To guarantee high write and read availability, BookKeeper uses the following mechanisms:

Low latency

Strong durability and consistency guarantees are complex distributed system problems, especially when combined with the goal of meeting enterprise-grade low latency requirements. BookKeeper fulfills these requirements in the following ways:

  • On a single bookie, the bookie server is designed for I/O isolation between different workloads (write, tailing read, and catch-up/random read). A group-committing mechanism is deployed on a journal to balance latency and throughput tradeoffs.
  • The quorum-vote parallel replication scheme is used to mask the latency penalty coming from network failures, JVM garbage collection pauses, and slow disks. This improves tail latency and ensures predictably low 99th-percentile latency.
  • A long-polling mechanism to notifies and delivers newly written records to tailing readers as soon as they are acknowledged and confirmed.

Finally, it is worth pointing out, durability with explicit fsync and write confirmation and repeatable read consistency are crucial to stateful processing, especially effectively-once processing for streaming applications.

Conclusion

In this blog post, I examined which guarantees you can expect from BookKeeper regarding durability, consistency, availability, and low latency. I hope that this post provides a compelling set of reasons to choose BookKeeper as a storage platform for your real-time workloads. In next few blog posts, I’ll delve more deeply into how BookKeeper replicates data as well as which mechanisms it uses to ensure low latency while still guaranteeing consistency and durability.

If you’re interested in BookKeeper or DistributedLog, you may want to join the growing BookKeeper community via the BookKeeper mailing list or the Slack channel. You can also download the latest release (version 4.5.0) from here to get started.

For more information about the Apache BookKeeper project in general, please visit the official website at bookkeeper.apache.org and follow the project on Twitter @asfbookkeeper and @distributedlog.

About the author

Sijie Guo is the founder of StreamNative, VP of Apache BookKeeper, and PMC Member of Apache Pulsar. Previously, he was the tech lead for the messaging group at Twitter, where he co-created Apache DistributedLog. Prior to Twitter, he worked on push notification infrastructure at Yahoo. You can follow him on @twitter.

--

--