Distributed SQL Database Internals (4) — SQL Engine

13 min readJan 11, 2024

My previous blog introduces the way that TiDB/TiKV stores data. A quick recap here, TiKV provides the following capabilities:

Storages data safely/quickly on the disk of a single machine with the help of RocksDB.
Replicates data among the machines efficiently and consistently.
Automatically manages data by splitting or merging it into smaller segments, referred to as “regions”, which ensures efficient data management and scalability.
Provides ACID transaction semantics for the operations on the Key-Value pairs.
Provides operations for Key-Value (KV) data, including Get/Set/Seek/Delete/…. It abstracts and shields the complexities of underlying processes like data distribution and replication. From a higher-level perspective, TiKV can be viewed as a massive, ordered Key-Value map. This abstraction allows users to interact with the data without needing to manage the intricate details of its distributed architecture.

With these capabilities, it is possible to build an SQL layer on top of TiKV to provide the SQL capabilities. In this article, I’ll elaborate on TiDB’s SQL layer or SQL engine. Let’s forget about the complex things like Raft, and RocksDB, just focus on SQL engine logic.

The implementation of an SQL Engine involves numerous components and concepts. To make it easier for readers to grasp, I will use a specific example throughout this article. Below, you will find the table structure definition and data for this example, providing a clear and practical context to understand how the SQL Engine operates. This approach aims to bridge the gap between theoretical knowledge and practical application, enhancing comprehension of the SQL Engine’s intricate workings.

Assume we have a table like this:

CREATE TABLE Employee (
ID INT PRIMARY KEY,
NAME VARCHAR(128),
DEPARTMENT VARCHAR(64),
AGE INT UNSIGNED,
SSN CHAR(11)
);

And two indexes:

CREATE INDEX NAME_IDX ON Employee (NAME);
CREATE UNIQUE INDEX SSN_IDX ON Employee (SSN);

A few records in the table:

This is a very simple table but enough to help you to understand the essential parts of TiDB’s SQL engine:

How to store the records in the table in TiKV
How to store the indexes in TiKV
SQL Metadata Management.
How to read data from TiDB
Distributed SQL Executing Engine

The Translation to Key-Value Pairs

Relational databases organize data into tables, which consist of rows, columns and indexes. Each table has a defined schema specifying the columns and data types. For instance, an Employee table might have columns such as ID, Name, Department, Age, and SSN. These fields store the corresponding attributes for each employee, enabling complex queries and data relationships. Indexes are for optimizing query performance. TiDB, the SQL Engine, translates the relational data model into a Key-Value storage.

Record to Key-Value

Each row of the table is converted into a Key-Value pair, where the key is a unique identifier composed of the Table ID and the row’s primary key (or record ID). The value is a serialized string or binary blob representing the row’s data.

Key: TableID_RecordPrefix_PrimaryKey
Value: [col1, col2, col3, col4]
The RecordPrefix is a constant with value of "r".

Index to Key-Value

TiDB supports both the Primary Index and Secondary Index. The function of Index is for faster query, higher query performance, and constraints. Indexes are also stored as Key-Value pairs in TiKV. For each index, there will be an extra Key-Value pair for a record. The key is constructed from the Table ID, Index ID (generated during the index-creating process, unique in the table), and the indexed column(s)’s content. The value is the reference to the primary key. For example, if the Employee table has indexes on the Name and SSN columns (NAME_IDX and SSN_IDX), TiDB generates keys for these indexes that reference the primary keys of the corresponding rows, allowing for efficient search and retrieval.

The encoding mechanism of TiDB is as follows:

Unique Index:
Key: TableID_IndexPrefix_IndexID_Value(s)
Value: PrimaryKey

Non-Unique Index:
Key: TableID_IndexPrefix_IndexID_Value(s)_PrimaryKey
Value: NULL

The IndexPrefix is also a constant with the value "i".

Encoding Algorithm

We carefully design the encoding algorithms for the key, ensuring the comparison relation remains unchanged before and after encoding. So the records and index data can be stored in TiKV in the right order. As for any type of data, the comparison result of two objects before encoding is the same with that of the byte arrays after encoding. For more detailed information, please refer to the codec package of TiDB. With this encoding algorithm, all record data of a table will be arranged in the Key space of TiKV according to the primary key order. And the data of a certain Index is orderly stored in TiKV according to the ColumnValue of the Index.

Using the previous example to show the logical view of the record and index data stored in TiKV. (Assume the Table ID is 1, Index ID for NAME_IDX is 1 and Index ID for SSN_IDX is 2.)

You can see that both the record and index data will be encoded with Table ID as a prefix of the key. So all the data of a particular table will be stored in a contiguous range. The Key-Values are stored in order by the key, so that SQL Engine can leverage this characteristic to do filtering or sorting by scanning a specific range(s) of index data. For example:

Point query, which uses equivalent conditions such as Primary Key or Unique Key for a query, e.g. `SELECT name FROM empolyee WHERE id=1;`, locating a certain row of data through an index.
Range filtering, e.g. `SELECT name FROM empolyee WHERE SSN > “111–11–0000” and SSN < “222–22–5555”;`

Those queries can leverage indexes to speed up. Scanning a small range in the index data is much faster than scanning all the records. After scanning index data, since the primary key is stored in the index Key-Value entry, the executing engine can get the location of the corresponding records for feature processing. This is basically how a globally ordered and distributed Key-Value storage can help an SQL engine.

Put all the things together:

SQL Engine converts relational data into Key-Value pairs according to the table metadata, then inserts data into TiKV. TiKV hides the complexity of data distributing/splitting/replicating. So SQL Engine can see it as a giant and ordered Key-Value map. This abstraction makes the work of SQL Engine much easier.

Metadata Management

After explaining how data and index of a table are mapped to Key-Value, this section introduces metadata storage.

The metadata of TiDB is similar to the Data Dictionary of MySQL. It stores the description of the actual data within the database. It’s essentially information about the database’s structure, configuration, and characteristics, rather than the actual data (like customer names, product information, etc.). Common types of metadata in SQL databases include:

Table Metadata: This includes information about the database tables, such as table names, the number of columns in each table, and the data type of each column (e.g., integer, text, date). It also involves constraints (like primary keys, foreign keys, and unique constraints) and indexes.
Column Metadata: Details about each column in a table, such as column names, data types, whether null values are allowed, default values, and any constraints specific to the column.
Index Metadata: Information about the indexes created on the tables, including index names, the columns they are associated with, and the type of index (e.g., unique, non-unique).
User and Privilege Metadata: Details about the database users and their privileges, such as which users can access which tables, and the types of operations (select, insert, update, delete) they can perform.
Database Schema Information: The overall structure of the database, including the schema names and how the tables are organized within these schemas.
Database Configuration Metadata: Information about the database configuration settings, such as character encoding, timezone settings, and other parameters that control the database’s behavior.

This metadata is crucial for understanding the database’s structure and behavior and is used by database administrators and developers for managing, maintaining, and optimizing the database. SQL databases typically provide system tables or information schema views that allow users to query and examine this metadata.

Metadata Storage

In TiDB all of this information needs to be persisted and stored in TiKV, but in two different ways.

The most fundamental metadata, such as the definitions of database, table, and index metadata, are stored in TiKV as Key-Value (KV) pairs. Each Database/Table gets a unique ID and this ID serves as the unique identification. When encoding into Key-Value, this ID will be encoded in Key, with a prefix of m_. The Value is the encoded json data that represents the definition or structure of a database/table/index. SQL engine uses lower-level KV APIs (Get/Set/Seek) to interact with those metadata on TiKV.

Other metadata like user privileges, statistics, and system variables, are stored as normal tables. You can find the schema of those tables here. In this way, we can leverage the high-level SQL APIs to manage those metadata. The operations of user privileges will be converted to operations (insert/select/update/delete) on the corresponding table. This makes things easier.

Metadata Retrieval

Metadata is persisted in TiKV. However, it is not efficient to access metadata from TiKV each time. For instance, checking if a query can match the schema of a table expects low-latency access to the metadata. It is not affordable to load table metadata from TiKV for each SQL execution. So most of the metadata are preloaded and cached in the memory of the SQL engine. The caches are updated periodically to make sure they can get the updates.

Caching metadata in memory can benefit performance. However since there are multiple SQL engine nodes in a cluster, how to maintain the consistency of metadata is a challenge. For example, if the table metadata is not consistent across the cluster, one node is doing a schema change like adding a column, and another node is running a DML like inserting a record. The mismatch of schemas cached in different nodes may cause data inconsistency issues.

Schema Change

To resolve issues of schema metadata inconsistency, TiDB employs an online schema change mechanism. This approach utilizes a dedicated background thread to monitor for any alterations in the schema recorded in TiKV. Upon detecting a change, it ensures that the update is retrieved within a predetermined timeframe. Throughout the process of schema evolution, TiDB guarantees the safe and consistent deployment of the new schema throughout the entire cluster. For more detailed information, please refer to The Implementation of Asynchronous Schema Change of TiDB.

Safety and consistency are must-to-have for schema management. The Online Schema Change mechanism solves this issue. As a distributed database, data volume stored in a table may be huge: not only millions but also billions or even trillions of records. The performance of DDL may be a nice-to-have feature of monolithic SQL databases, but it is a must to have feature for a distributed SQL database. Starting from v7.1.0, TiDB introduces a Distributed Execution Framework (DXF) to further leverage the resource advantages of the distributed architecture. The goal of the DXF is to implement unified scheduling and distributed execution of tasks and to provide unified resource management capabilities for both overall and individual tasks, which better meets users’ expectations for resource usage. For more information about DXF, see the design doc.

Query Execution

Understanding how TiDB’s SQL engine processes queries and reads data from TiKV is fundamental to appreciating its powerful capabilities. TiDB processes queries in a methodical fashion, designed to efficiently utilize the distributed nature of its storage layer, TiKV. Here is a breakdown of the steps involved:

Generating Query Execution Plans: The initial step in query processing is the creation of potential execution plans. TiDB evaluates the SQL query’s structure, including its predicates, aggregation functions, and sort requirements, and uses its data encoding algorithm to determine which parts of the data should be accessed. These parts could be specific ranges of records or indexes within the TiKV nodes.
Plan Selection via Cost Estimation: Once TiDB has a set of potential plans, it employs a cost-based approach to select the most efficient execution path. It uses statistical data about the data distribution, such as row counts and index selectivity, to estimate the cost of each plan. The planner then chooses the plan that is expected to consume the least amount of resources and complete the query as quickly as possible.
Executing the Optimal Plan: With the optimal plan selected, TiDB proceeds to the execution phase. This step involves the actual data retrieval and processing tasks as per the plan’s directives. The execution process is highly optimized to leverage the parallel processing capabilities of TiDB, ensuring a swift response to the query.

Throughout this process, TiDB aims to minimize the query’s execution time and resource consumption while ensuring that the results are accurate and consistent with the data’s current state.

Let’s see some concrete examples of how TiDB executes queries.

Query Processing and Data Reading from TiKV

TiDB operates by translating SQL queries into operations on a distributed key-value store. The process of reading data from TiKV varies depending on the query’s requirements. Let’s explore two examples: a full table scan and index access using a secondary index.

Example 1: Full Table Scan

When a query requires a full table scan, such as `SELECT * FROM Employee;`, TiDB fetches all data related to the Employee table. TiDB performs the following steps:

Identifying Key Ranges: It first identifies the key range that encompasses all the data of the Employee table. The range begins with the smallest possible key and extends to the largest possible key for that table. You can always get the smallest and largest key according to the data type of the primary key.
Retrieving Data: TiDB then scans the keys in this range from TiKV, retrieving all row data corresponding to each Employee record.
Assembling Results: The row data, which includes all fields (ID, NAME, DEPARTMENT, AGE, SSN), is decoded and assembled into the result set that the SQL query requires.

Example 2: Index Access Path

For a query that utilizes an index, like `SELECT * FROM Employee WHERE NAME = ‘Alex’;`, TiDB uses the secondary index NAME_IDX to access the data:

Locating Index Keys: TiDB locates the keys in TiKV that correspond to the NAME_IDX index where the NAME column matches ‘Alex’.
Retrieving Row Keys: These index data contain the Keys of the rows where the NAME is ‘Alex’.
Fetching Rows: TiDB uses these Row Keys to fetch the corresponding rows from the table.
Delivering Results: The complete row information for each Employee named ‘Alex’ is then returned.

Distributed SQL Execution

TiDB’s architecture leverages TiKV’s distributed capabilities through a component known as the Coprocessor. Here’s how TiDB uses TiKV to execute queries in a distributed environment:

Coprocessor Mechanism: TiKV includes a feature called the Coprocessor, which allows it to perform computation tasks locally. This means that instead of simply acting as a passive storage layer, TiKV nodes can process data, reducing the amount of data transferred over the network and speeding up query execution.
Query Plan Generation and Pushdown Decision: The SQL layer in TiDB is responsible for generating query execution plans. It analyzes these plans to determine which operations can be pushed down to be executed directly on the TiKV nodes. Operations that are commonly pushed down include filters (WHERE clauses), aggregations (SUM, COUNT, AVG), and some sort(+limit) operations.
Dispatching Coprocessor Tasks: Once the SQL layer decides on the plan and identifies the operations that can be pushed down, it dispatches Coprocessor tasks to the TiKV nodes that hold the relevant data segments. The distribution of tasks is based on key ranges identified in the query plan that correspond to the data’s location within the TiKV cluster.
Local Data Processing in TiKV Nodes: Each TiKV node processes its portion of the data locally. By executing tasks in parallel across different nodes, TiKV effectively utilizes distributed computing resources. The nodes read the necessary data from their local storage and process it according to the instructions in the Coprocessor tasks.
Aggregating Results at SQL Layer: After the TiKV nodes have processed their respective data, the SQL layer collects the intermediate results from all the relevant nodes. It then performs any necessary aggregation, sorting, or additional computations required to construct the final result set.

By using the Coprocessor for distributed query execution, TiDB efficiently distributes the computational load across multiple nodes, minimizing network overhead and accelerating the query processing time. This approach ensures TiDB can handle large-scale data with high performance and scalability.

Example: Aggregating Data across Nodes

Consider a query that calculates the average age of employees in each department: `SELECT DEPARTMENT, AVG(AGE) FROM Employee GROUP BY DEPARTMENT;`.

Distributing Tasks: The query is broken down into smaller tasks distributed to various TiKV nodes holding parts of the Employee table data.
Local Computations: Each TiKV node computes the average age for the data it holds, grouped by DEPARTMENT.
Partial Aggregation: These partial results are then returned to TiDB, where a partial aggregation is performed.
Final Aggregation: Once TiDB has received all partial results, it performs the final aggregation to calculate the overall average age per department.
Result Set: The final result set with the DEPARTMENT and the calculated average AGE is then returned to the client.

The Architecture of the SQL Engine

The previous sections introduce some functions of the SQL layer and I hope you have a basic idea about how to process the SQL statement. Actually, TiDB’s SQL Layer is much more complicated and has lots of modules and layers. The following diagram lists all important modules and the call relation:

The SQL requests will be sent directly or via a Load Balancer to SQL Layer (we call it TiDB or TiDB-server), which then parses the MySQL Protocol Packet for the requested content. After that, it performs syntax parsing, makes and optimizes the query plan and executes the plan to access and process data. As all data is stored in the TiKV cluster. TiDB-server needs to interact with TiKV cluster to access data during the process. Finally, TiDB-server needs to return the query result to users.

Summary

Thus far, we’ve explored data storage and operational procedures through the lens of SQL. Insights into the SQL layer — including optimization principles and the intricacies of the distributed execution framework — await in forthcoming discussions.

In our next installment, we’ll delve into the intricacies of Placement Driver (PD), focusing particularly on cluster management and scheduling mechanisms. This promises to be a fascinating exploration, as we uncover the typically unseen yet critical elements that ensure TiDB’s seamless functionality across the entire cluster.