Under-the-hood of a scale-out Storage Engine

Published in

Wrong AI

7 min readOct 5, 2017

Imagine buying a car — you won`t just look at the color and interiors to make a buy decision (assuming of course that this is not an impulsive buy). You will typically have a few different car models and then compare the price/performance i.e., engine, transmission, braking, etc. How do you do the same for Big Data Systems?

After years of trying to answer this question myself, I developed a blueprint of distributed storage systems. Typically, you start by understanding the application requirements in terms of the data schema, operations, QoS, as well as the deployment infrastructure (disks versus SSDs). This will lead to a shortlist of a few candidates for further analysis. At this point, looking under the hood helps to get past the marketing FUD, and get an understanding of what to expect day-to-day as well as during failures such as disk/node failures, network partitions, etc.

Any distributed storage system (Cassandra, MongoDB, HDFS, Ceph, Redis, etc.) can be understood by looking at 3 key design philosophies + strategies used in 9 key services (illustrated below). The rest of this post describes popular algorithms for each of the building blocks. We conclude by using the taxonomy to compare HDFS, Cassandra, MongoDB, Redis (many more examples to be added in the future).

Sharding refers to the ability to divide the responsibility of serving data and metadata operations among the nodes within the cluster. A good sharding algorithm is one that can equally distribute the resource usage among the nodes both in terms of incoming requests as well as storage space utilization. Broadly, Sharding policies can be divided into three broad categories:

Directory-based look-up: The namespace is distributed across nodes of the cluster, with a centralized directory to track the mapping of the namespace to nodes.
Range-based lookup: This is a common pattern in Masterless systems. The key-space is pre-divided into ranges and assigned to individual nodes. The mapping is computed based on the range.
Hybrid (Directory + Range-based): Uses a combination — writes are distributed using range-based sharding combined with deterministic policy-based functions; reads follow the directory lookup.

The clustering model is one of the most important and distinguishing characteristics of any Big Data system. The clustering model defines the approach by which the individual nodes within a scale-out cluster coordinate to manage resources, handle events, and support storage data and metadata operations for the clients. Broadly speaking, there are three prevalent Clustering Models:

Master Taxonomy: In this taxonomy, there is a single node (called Master) within the cluster with special responsibilities for managing the cluster state, and coordinating activities/tasks (described later). HDFS is an example of this taxonomy.
Masterless Taxonomy: All the nodes within the cluster share equal responsibilities for managing the cluster state and activities. This taxonomy became popular with Amazon’s Dynamo paper, and currently used in systems such as OpenStack Swift, Cassandra.
Multi-Master Taxonomy: This is the evolution of the single Master taxonomy — the global Master acts as a light-weight coordinator, and divides the cluster management tasks among the among the nodes within the cluster. This is in contrast to the Masterless taxonomy, where there is no single global state maintained by the entire cluster. Ceph, GPFS, Bigtable, are examples of this taxonomy.

2: Consensus Service

Several distributed operations require consensus — ordering of IOs and replica updates, resource management, cluster creation, distributed locking, etc., are a few examples. The design choices for Leader orchestrated are:

Consensus Orchestration: Consensus Service can be either implemented using a leader-based model or quorum-based.
Leader election: Decides how the leader is actually elected
Elected: Using one of the classic algorithms such as Paxos, 2PC, ABD, etc.
Pre-selected: This typically involves heuristic to pick one of the nodes as the leader to orchestrate consensus.

3: Cluster HA Service

Cluster High Availability (HA) ensures that data remains accessible in the cluster-level failures. The service needs to be designed to handle three types of common cluster failures:

Metadata node failure: For master/multi-master architectures such as HDFS, HBase where namespace metadata is separate from data.
Data node failure: Handling of data nodes that are responsible for subsets of the namespace. Also, for masterless architectures such as Cassandra where data and metadata coexist.
Partition handling where multiple nodes with the cluster are impacted (either crashed or disconnected by a network failure)

The design options for implementing failover depend on the state of the replicas i.e., Active-Active, Active-Passive (serve data only during failures), and Active-Backup where the replica is reconstructed only during failure.

4: Namespace Mgr Service

Namespace Lookup Service translates the namespace to the location where data is persisted within the cluster. The design of the service has 3 key aspects:

Cluster-wide metadata representation: Whether the metadata directly maps to the data location, instead maps to the Object Manager for further translation.
Managing the freshness of cluster-wide metadata: Is it fully materialized or derived on the fly.
Object lookup: Architecture model for translating namespace request
Replica lookup: Architecture model for replica access

5: Consistency Service

The goal of the Consistency service is to ensure that the data model assumptions of the application programmer are satisfied. For instance, if the application expects the reads to be atomic (i.e., always the most updated data), the requirement needs to be serviced by the Consistency Service. The service is build using 4 building blocks:

Mutual Exclusion (ME) primitive: RW and WW exclusion.
Ordering primitives: RW and WW ordering
Atomicity primitives: Unit of o-1 update semantics
Transaction support primitives: Different forms of transaction semantics can be implemented using ME, atomicity, and ordering

The granularity of guarantees can be on a per-object basis or at a broader namespace granularity e.g., objects within the same container/directory.

6: Durability Service

Data Durability ensures minimal data loss in the event of hardware failures, component corruption, software bugs. The most popular approach for data durability is to create multiple replicas of data. Replication imposes overheads w.r.t. space usage (e.g., 3X the capacity), but is cheaper w.r.t. partial update overheads and the amount of data required to be read during recovery. In contrast, Erasure coding across nodes has the inverse pros/cons compared to replication. With the adoption of All-Flash environments, erasure coding is getting lot of attention in recent research.

Building a Durability Service requires deciding the following design aspects:

Durability model: Defines how redundancy is implemented to handle hardware failures
Replica byte-level fidelity: Defines the degree to which the replicas are similar w.r.t. contents
Replica consistency: Different levels of consistency semantics
Propagating replica updates: How the updates are actually propagated
Selecting replica coordinator
Ordering replica updates
Replica selection during reads
Replica placement
Replica access
Geo-redundancy protocol

7: Object Manager Service

Object Manager Service is responsible for managing the resources associated with the namespace objects. For persisting the data on durable media, the Object Manager abstracts the hardware into resource unit. Designing the Object Manager involves the following 5 aspects:

Resource unit: Smallest level of data read/write from durable media. The unit can be file, block, memory address.
Namespace-to-resource unit: Whether the namespace object is mapped to single or multiple resource units
Metadata of the namespace object: Whether the metadata is persisted along with the data within the same resource unit
Garbage Collection: How space is reclaimed
Delete Model: How the deletion of the objects is handled

8: Caching/Buffering Service

Typically Write Buffering and Distributed Caching are implemented as separate services.

Buffering was originally built to bridge the gap between CPU and disks speeds. The idea was to group the updates together to write them to durable media (in order to avoid millisecond of delay for each operation). This is similar to group commit in databases. Caching focusses on maintaining commonly accessed data in memory to reduce disk traffic and improve latency performance. Useful when there is locality of workload access; also predictable pattern for prefetching of data. The key design aspects are:

Buffer Flushing Model
Buffering RW Coherence
Placement of caching service
Caching Strategy
Cache Invalidation Protocol

9: Resource Management Service

Resource Management Service decides how the data is actually laid out on the physical hardware. There are three design aspects involved:

On-disk layout: Different layouts are optimized for different access patterns and data types.
Striping the data across nodes and disks: There are multiple disks within the node/server, as well as multiple servers within the cluster.
Accounting for cluster heterogeneity: This is an important aspect with hardware is acquired over multiple generations, with differing properties.

Examples: Applying Blueprint to Real-world Systems