Distributed SQL Database Internals (3) —Distributed Storage Engine

11 min readJan 5, 2024

The storage engine is responsible for managing how the data is stored, organized, and accessed on a disk or other permanent storage media. It plays a crucial role in the overall architecture and performance of the database, influencing how data is written, read, and maintained. For distributed databases, the storage engine should also be distributed. Distributed storage engines manage data across multiple nodes in a network, often spanning different geographical locations.

In this chapter, we will focus on the following key aspects:

Data Storage and Retrieval: The primary function of a storage engine is to store data on a disk or other persistent storage in a manner that makes it efficient to retrieve when needed. This involves organizing data into files, indexes, and other structures.
Replication and Fault Tolerance: To ensure high availability and data durability, distributed storage engines replicate data across multiple nodes. This means that even if one node fails, the data is still accessible from other nodes, ensuring the database remains operational.
Data Distribution: The engine handles the partitioning and distribution of data across various nodes in the network. This can involve sharding (splitting data into smaller, manageable pieces) and determining the optimal placement of these shards to balance load and optimize performance.
Distributed Transaction Processing: Storage engines are often responsible for handling transactions, and ensuring data integrity and consistency. They employ mechanisms like locking, logging, and journaling to maintain transactional integrity and support features like atomicity, consistency, isolation, and durability (ACID properties).

In our discussion about TiKV, it’s important to set aside conventional SQL concepts and focus instead on the implementation of TiKV. This system is essentially a large, distributed, and ordered map, designed with an emphasis on both high performance and reliability. Understanding TiKV requires a shift in perspective, concentrating on its unique architecture and the innovative approaches it employs to manage data efficiently in a distributed environment.

Key-Value

At the core of any data storage system lies the decision on the model for storing data, essentially determining the format in which data is maintained. TiKV adopts the Key-Value model, offering an efficient and orderly method for data traversal. It is also where the name of TiKV comes from.

The Key-Value in TiKV is essentially any sequence of bytes, regardless of the content. The specifics of this content are defined in higher-level subsystems, which will be discussed in the following blog post.

Conceptually, envision TiKV as a giant map, where both Key and Value are maintained as their original byte arrays. Within this map, the keys are organized in a specific order, based on the raw binary bits of the byte array. The ordering of the keys enables efficient data location and access. For instance, one can seek a particular key’s position and then proceed to the next key-value pairs, all of which have greater binary values than the initially sought key.

Data Storage and Retrieval

To efficiently store Key-Value data on persistent storage mediums and rapidly access the stored data, a sophisticated data storage structure is required. TiKV opts for the LSM tree (Log-Structured Merge-tree), with one of the best open-source implementations being RocksDB. Here is a good article that provides detailed insights into RocksDB.

The LSM tree is renowned for its high-speed write capabilities, making it an ideal choice for systems like TiKV that require fast data ingestion and efficient OLTP operations. Additionally, RocksDB functions as an embedded storage engine, meaning it seamlessly integrates within TiKV’s architecture. This integration allows TiKV to leverage RocksDB’s optimized disk interactions, contributing to TiKV’s overall performance and durability. The choice of RocksDB thus aligns with TiKV’s goals of providing a high-performance, distributed key-value store capable of handling large-scale data with speed and efficiency.

Now, we have a simple wrapper on top of RocksDB which could provide fast key-value data operations on a single node. To build a distributed storage engine, we need to expand the data from a single node to multiple nodes.

Data redundancy and replication

In distributed systems like TiKV, data redundancy, and replication are fundamental for ensuring data durability and availability. Redundancy means storing copies of data across different nodes or locations, protecting against data loss due to hardware failures, network issues, or other disruptions. Replication involves synchronizing these data copies across the nodes to maintain a consistent state, ensuring that even if one node fails, the system can continue to operate seamlessly using the data from the remaining nodes.

Raft is chosen for its simplicity and effectiveness in managing distributed consensus. This consensus algorithm ensures that all nodes in a TiKV cluster agree on the data state, crucial for maintaining consistency across replicas. Raft simplifies the complexity of distributed systems, making it easier to understand and implement. It’s particularly known for its strong leadership and easy-to-follow log replication model, which are essential for reliable and consistent data replication. Those who are interested in Raft can refer to this paper for more details. I want to point out that the Raft paper only presents a basic solution and the performance would be bad if strictly followed the paper. We have made numerous optimizations to implement Raft and for more details, please refer to this blog.

Raft works by electing a leader among the nodes in the cluster. This leader is responsible for managing log replication. All changes to the data (writes and updates) are sent to the leader, who then appends the changes to its log and replicates these entries to the follower nodes. Once the majority of nodes have written these log entries, the changes are committed across the cluster.

TiKV leverages Raft to achieve high availability by ensuring that data is replicated across multiple nodes in a consistent and coordinated manner. This replication guarantees that even in the event of node failures, the data remains accessible and the system operational. The leader election mechanism of Raft ensures that there’s always a node managing the replication process, which is vital for the continuous availability of the system. By using Raft, TiKV maintains a high level of data integrity and availability, which is crucial for distributed database systems where downtime or data loss can have significant implications.

In summary, through the single node RocksDB, we can store data on a disk rapidly; through Raft, we can replicate data to multiple machines in case of machine failure. Data is written through the interface of Raft instead of RocksDB. Thanks to the implementation of Raft, we have a distributed Key-Value and no longer need to worry about machine failure.

Data Distribution

In this section, I aim to introduce a crucial concept known as ‘Region,’ which forms the backbone of understanding several mechanisms in TiKV. To start with, let’s set aside the concept of Raft for a moment and consider a scenario where all data has only a single replica.

TiKV is conceptualized as a vast, ordered Key-Value Map. To achieve horizontal scalability in storage, it’s necessary to distribute data across multiple machines. Generally, there are two common methods for distributing data in a Key-Value system. The first method involves creating a hash of the data and allocating it to a storage node based on this hash value. The second method, which TiKV employs, is based on range. This approach involves storing a sequential range of keys in a single storage node. TiKV divides the entire Key-Value space into numerous segments, with each segment containing a series of adjacent keys. These segments are referred to as ‘Regions.’ Each Region has a size limit for data storage (defaulting to 96MB, though this is configurable). Furthermore, each Region is defined by a left-closed-right-open interval, stretching from StartKey to EndKey.

The default size of each Region is a result of careful consideration. It strikes a balance, being neither too small — which would result in an excessive number of regions and an overload of metadata management — nor too large, as large regions can lead to slow data movement and delayed recovery. This size works effectively for most scenarios, ensuring efficient data distribution and management.

Once data is segmented into Regions, two crucial tasks are undertaken:

Distribution of Data Across the Cluster: Using Regions as the basic unit, data is distributed across all nodes in the cluster. It is essential to ensure that the number of Regions on each node is approximately equal. This approach not only achieves horizontal scalability of storage capacity (especially when new nodes are added and the system automatically reallocates Regions across nodes) but also ensures load balancing. This prevents scenarios where one node is overloaded with data while others have significantly less. To facilitate data accessibility for upper-level clients, another system component tracks the distribution of Regions across nodes. This setup enables clients to query the exact Region for a specific Key and identify the node where that Region is located.
Raft Replication and Membership Management within Each Region: Implementing Raft replication in each Region is vital for ensuring data consistency and fault tolerance. This involves managing the membership of nodes in the cluster, ensuring that data is replicated and synchronized across nodes within a Region.

Both tasks are fundamental to the efficient functioning of TiKV and will be discussed in more detail later. The first task involves careful division and allocation of data to maintain system balance and efficiency. The second task ensures data integrity and availability through replication and membership management within the Raft framework. These components work together to provide a robust, scalable, and reliable storage solution.

Now, let’s delve into the second task. In TiKV, data replication is managed within Regions. This means that each Region’s data has multiple copies, known as “Replicas.” To ensure consistency among these Replicas, TiKV employs the Raft consensus algorithm. These Replicas, belonging to a single Region, are stored across different nodes, forming what is known as a Raft Group. Within this group, one Replica is designated as the Leader, while the others act as Followers.

All read and write operations are initially processed by the Leader. Subsequently, the Leader replicates these operations to the Followers. This mechanism ensures that all Replicas maintain a consistent state, as the Leader coordinates the data replication process. By utilizing this structure, TiKV effectively manages data consistency and reliability across its distributed network.

The distribution and replication of data across Regions in TiKV create a distributed Key-Value system with inherent disaster recovery capabilities. This framework significantly mitigates concerns regarding storage capacity limits and data loss due to disk failures. While this is a substantial achievement, it’s not the endpoint. To enhance the system further, additional functionalities are necessary to address more complex requirements and scenarios, pushing beyond the basic framework of data distribution and replication.

Distributed Transaction

Concurrency Control

To facilitate ACID (Atomicity, Consistency, Isolation, Durability) transactions, a storage engine must implement a mechanism for concurrency control. This control manages the interactions between transactions that are executed concurrently. Among the prevalent technologies for concurrency control are Optimistic Concurrency Control (OCC), Pessimistic (or conservative) Concurrency Control (PCC), and Multiversion Concurrency Control (MVCC). MVCC is widely favored, and it’s the method chosen by TiDB. For more information on other concurrency control mechanisms, additional resources are available. In the upcoming sections, we will delve into how TiKV implements MVCC, providing insights into its efficient management of concurrent data operations.

Consider a scenario where two clients simultaneously update the value of the same key. Without MVCC, the data would be locked, potentially causing performance bottlenecks and deadlocks in a distributed environment. TiKV addresses this by implementing MVCC, which appends a version to each key, thus allowing for more efficient handling of concurrent operations. The data layout in TiKV without MVCC can be visualized as follows:

As we discussed in the previous sector, logically, TiKV’s data layout is like

Key1 -> Value1
Key2 -> Value2
……
KeyN -> ValueN

With MVCC, a version is appended to each key. The version is a timestamp generated by PD (Placement Driver). With MVCC, the data layout of TiKV looks like this:

Key1-Version3 -> Value
Key1-Version2 -> Value
Key1-Version1 -> Value
……
Key2-Version4 -> Value
Key2-Version3 -> Value
Key2-Version2 -> Value
Key2-Version1 -> Value
……
KeyN-Version2 -> Value
KeyN-Version1 -> Value

It’s important to note that for multiple versions of a Key in TiKV, the version with the larger number is placed first. This is in line with the Key-Value structure mentioned earlier, where the Key is part of an ordered array. By adopting this approach, when a user retrieves a Value using Key + <Version>, they can construct the MVCC Key as Key-<Version>. This allows them to directly execute Seek(Key-Version), efficiently locating the first position that is greater than or equal to this specific Key-Version. For a more in-depth understanding of this process, refer to this blog post about MVCC in TiKV, where these concepts are elaborated further.

2-Phase Commit

TiKV implements distributed transactions by utilizing a two-phase commit protocol, ensuring data consistency across multiple nodes. This is essential for maintaining the ACID properties (Atomicity, Consistency, Isolation, Durability) in a distributed environment.

In the first phase, the ‘prepare’ phase, TiKV locks the keys that are part of the transaction. This step is crucial to avoid conflicts with other transactions. Once all involved keys are locked and the changes are prepared, the transaction moves to the second phase.

The second phase is the ‘commit’ phase. Here, TiKV checks if all the involved keys are still locked and haven’t been modified by other transactions. If this check passes, the transaction is committed across all nodes, ensuring that all changes are applied simultaneously. If any issues are detected, such as a deadlock or a conflict, the transaction is rolled back to maintain data integrity.

TiKV’s approach to distributed transactions is designed to handle the complexities of a distributed database system, such as network delays and node failures. It ensures that transactions are executed atomically across multiple nodes, maintaining consistency and reliability, which are paramount in distributed systems. This makes TiDB suitable for scenarios that require strong consistency and high availability, even across large and distributed datasets.

The transaction model in TiKV is based on the Percolator model and includes several optimizations. Here is a great blog post about how TiDB implements distributed transactions.

Miscellaneous

So far, I have covered the fundamental concepts and intricate details of TiKV, highlighting its layered architecture as a distributed and transactional Key-Value engine. Additionally, I’ve delved into the implementation of multi-datacenter disaster recovery within this framework. In the next article, I will explore the construction of the SQL layer, building upon the storage model of Key-Value that underpins TiKV. This upcoming discussion aims to bridge the gap between the underlying storage mechanisms and the more familiar SQL interface, providing a comprehensive understanding of TiKV’s capabilities and design.