Under-the-hood of a scale-out Storage Engine

Sandeep Uttamchandani
Wrong AI
Published in
7 min readOct 5, 2017

--

Imagine buying a car — you won`t just look at the color and interiors to make a buy decision (assuming of course that this is not an impulsive buy). You will typically have a few different car models and then compare the price/performance i.e., engine, transmission, braking, etc. How do you do the same for Big Data Systems?

After years of trying to answer this question myself, I developed a blueprint of distributed storage systems. Typically, you start by understanding the application requirements in terms of the data schema, operations, QoS, as well as the deployment infrastructure (disks versus SSDs). This will lead to a shortlist of a few candidates for further analysis. At this point, looking under the hood helps to get past the marketing FUD, and get an understanding of what to expect day-to-day as well as during failures such as disk/node failures, network partitions, etc.

Any distributed storage system (Cassandra, MongoDB, HDFS, Ceph, Redis, etc.) can be understood by looking at 3 key design philosophies + strategies used in 9 key services (illustrated below). The rest of this post describes popular algorithms for each of the building blocks. We conclude by using the taxonomy to compare HDFS, Cassandra, MongoDB, Redis (many more examples to be added in the future).

Sharding refers to the ability to divide the responsibility of serving data and metadata operations among the nodes within the cluster. A good sharding algorithm is one that can equally distribute the resource usage among the nodes both in terms of incoming requests as well as storage space utilization. Broadly, Sharding policies can be divided into three broad categories:

  • Directory-based look-up: The namespace is distributed across nodes of the cluster, with a centralized directory to track the mapping of the namespace to nodes.
  • Range-based lookup: This is a common pattern in Masterless systems. The key-space is pre-divided into ranges and assigned to individual nodes. The mapping is computed based on the range.
  • Hybrid (Directory + Range-based): Uses a combination — writes are distributed using range-based sharding combined with deterministic policy-based functions; reads follow the directory lookup.

The clustering model is one of the most important and distinguishing characteristics of any Big Data system. The clustering model defines the approach by which the individual nodes within a scale-out cluster coordinate to manage resources, handle events, and support storage data and metadata operations for the clients. Broadly speaking, there are three prevalent Clustering Models:

  • Master Taxonomy: In this taxonomy, there is a single node (called Master) within the cluster with special responsibilities for managing the cluster state, and coordinating activities/tasks (described later). HDFS is an example of this taxonomy.
  • Masterless Taxonomy: All the nodes within the cluster share equal responsibilities for managing the cluster state and activities. This taxonomy became popular with Amazon’s Dynamo paper, and currently used in systems such as OpenStack Swift, Cassandra.
  • Multi-Master Taxonomy: This is the evolution of the single Master taxonomy — the global Master acts as a light-weight coordinator, and divides the cluster management tasks among the among the nodes within the cluster. This is in contrast to the Masterless taxonomy, where there is no single global state maintained by the entire cluster. Ceph, GPFS, Bigtable, are examples of this taxonomy.

Read more details

Mutability model defines the strategy to update existing data, and impacts the design of other services within the system namely caching, buffering, consistency, disk layout, read/write model. There are three broad buckets for mutability models:

  • In-place: Updates are written directly in existing data locations are modified.
  • Direct updates: Data is directly updated in its original location. Ensuring atomicity in this model requires blocking synchronization, as well as coordination with the buffering module.
  • Temporary out-of-place: The updates are aggregated in a new location (from its original location). This persistence is temporary, and updates eventually update the original data.
  • Out-of-place: Updates to the namespace object are persisted at a new resource location (i.e., logical or physical address). The metadata associated with the namespace object is updated to reflect the change.
  • Versioning: The update is persisted as a new version of the object.

Read more details

Cluster Manager is responsible for tracking the group membership and failures. Given the nature of distributed systems, nodes can constantly leave and join the cluster. Cluster Manager is additionally responsible for tracking the state of the replicas, and corresponding corrective actions. The design involves 4 building blocks:

  • Tracking membership and node failures
  • Completing IO operations when one or more replicas are offline (either due to a network partition or node is down)
  • Update of the offline replica
  • Tracking the lag of the replicas

2: Consensus Service

Several distributed operations require consensus — ordering of IOs and replica updates, resource management, cluster creation, distributed locking, etc., are a few examples. The design choices for Leader orchestrated are:

  • Consensus Orchestration: Consensus Service can be either implemented using a leader-based model or quorum-based.
  • Leader election: Decides how the leader is actually elected
  • Elected: Using one of the classic algorithms such as Paxos, 2PC, ABD, etc.
  • Pre-selected: This typically involves heuristic to pick one of the nodes as the leader to orchestrate consensus.

3: Cluster HA Service

Cluster High Availability (HA) ensures that data remains accessible in the cluster-level failures. The service needs to be designed to handle three types of common cluster failures:

  • Metadata node failure: For master/multi-master architectures such as HDFS, HBase where namespace metadata is separate from data.
  • Data node failure: Handling of data nodes that are responsible for subsets of the namespace. Also, for masterless architectures such as Cassandra where data and metadata coexist.
  • Partition handling where multiple nodes with the cluster are impacted (either crashed or disconnected by a network failure)

The design options for implementing failover depend on the state of the replicas i.e., Active-Active, Active-Passive (serve data only during failures), and Active-Backup where the replica is reconstructed only during failure.

4: Namespace Mgr Service

Namespace Lookup Service translates the namespace to the location where data is persisted within the cluster. The design of the service has 3 key aspects:

  • Cluster-wide metadata representation: Whether the metadata directly maps to the data location, instead maps to the Object Manager for further translation.
  • Managing the freshness of cluster-wide metadata: Is it fully materialized or derived on the fly.
  • Object lookup: Architecture model for translating namespace request
  • Replica lookup: Architecture model for replica access

5: Consistency Service

The goal of the Consistency service is to ensure that the data model assumptions of the application programmer are satisfied. For instance, if the application expects the reads to be atomic (i.e., always the most updated data), the requirement needs to be serviced by the Consistency Service. The service is build using 4 building blocks:

  • Mutual Exclusion (ME) primitive: RW and WW exclusion.
  • Ordering primitives: RW and WW ordering
  • Atomicity primitives: Unit of o-1 update semantics
  • Transaction support primitives: Different forms of transaction semantics can be implemented using ME, atomicity, and ordering

The granularity of guarantees can be on a per-object basis or at a broader namespace granularity e.g., objects within the same container/directory.

6: Durability Service

Data Durability ensures minimal data loss in the event of hardware failures, component corruption, software bugs. The most popular approach for data durability is to create multiple replicas of data. Replication imposes overheads w.r.t. space usage (e.g., 3X the capacity), but is cheaper w.r.t. partial update overheads and the amount of data required to be read during recovery. In contrast, Erasure coding across nodes has the inverse pros/cons compared to replication. With the adoption of All-Flash environments, erasure coding is getting lot of attention in recent research.

Building a Durability Service requires deciding the following design aspects:

  • Durability model: Defines how redundancy is implemented to handle hardware failures
  • Replica byte-level fidelity: Defines the degree to which the replicas are similar w.r.t. contents
  • Replica consistency: Different levels of consistency semantics
  • Propagating replica updates: How the updates are actually propagated
  • Selecting replica coordinator
  • Ordering replica updates
  • Replica selection during reads
  • Replica placement
  • Replica access
  • Geo-redundancy protocol

7: Object Manager Service

Object Manager Service is responsible for managing the resources associated with the namespace objects. For persisting the data on durable media, the Object Manager abstracts the hardware into resource unit. Designing the Object Manager involves the following 5 aspects:

  • Resource unit: Smallest level of data read/write from durable media. The unit can be file, block, memory address.
  • Namespace-to-resource unit: Whether the namespace object is mapped to single or multiple resource units
  • Metadata of the namespace object: Whether the metadata is persisted along with the data within the same resource unit
  • Garbage Collection: How space is reclaimed
  • Delete Model: How the deletion of the objects is handled

8: Caching/Buffering Service

Typically Write Buffering and Distributed Caching are implemented as separate services.

Buffering was originally built to bridge the gap between CPU and disks speeds. The idea was to group the updates together to write them to durable media (in order to avoid millisecond of delay for each operation). This is similar to group commit in databases. Caching focusses on maintaining commonly accessed data in memory to reduce disk traffic and improve latency performance. Useful when there is locality of workload access; also predictable pattern for prefetching of data. The key design aspects are:

  • Buffer Flushing Model
  • Buffering RW Coherence
  • Placement of caching service
  • Caching Strategy
  • Cache Invalidation Protocol

9: Resource Management Service

Resource Management Service decides how the data is actually laid out on the physical hardware. There are three design aspects involved:

  • On-disk layout: Different layouts are optimized for different access patterns and data types.
  • Striping the data across nodes and disks: There are multiple disks within the node/server, as well as multiple servers within the cluster.
  • Accounting for cluster heterogeneity: This is an important aspect with hardware is acquired over multiple generations, with differing properties.

Examples: Applying Blueprint to Real-world Systems

--

--

Sandeep Uttamchandani
Wrong AI

Sharing 20+ years of real-world exec experience leading Data, Analytics, AI & SW Products. O’Reilly book author. Founder AIForEveryone.org. #Mentor #Advise