Introduction to Apache Doris Storage Layer Design — Explanation of Storage Structure Design

Apache Doris
13 min readJul 8, 2022

Author: ZhangYu0123

Apache Doris is an interactive SQL data warehouse based on MPP structure, mainly used for real-time reporting and multidimensional analysis, and the efficient import and query of Doris can not be achieved without the sophisticated design of its storage structure. In this study, we analyze the implementation principle of the Doris BE module storage layer in detail by reading the code of the Doris BE module, and explaining and deciphering the core technology behind the efficient writing and querying ability of Doris. It includes the design of Doris column storage, index design, data read/write process, Compaction process, and other functions.

Here are three articles to introduce it step by step, which are:

  • “Introduction to Doris Storage Layer Design 1 — Explanation of Storage Structure Design”
  • “Introduction to Doris Storage Layer Design 2 — Analysis of Write and Delete Processes”
  • “Introduction to Doris Storage Layer Design 3 — Analysis of Reading and Compaction Processes”

Doris’ official website: http://doris.apache.org

Doris Github: https://github.com/apache/doris

This article is the first “Introduction to Doris Storage Layer Design 1 — Explanation of Storage Structure Design”, which introduces the storage layer structure of Segment V2, including ordered storage, sparse index, prefix index, bitmap index, BloomFilter, and other rich functions, which can cope with various complex scenarios and provide extremely fast query capability.

Design Goals

  • Bulk import, few updates
  • Majority of reading requests
  • Wide table scenario, read a large number of rows and a small number of columns
  • Non-transaction scenarios
  • Good extensibility

Storage file format

Storage directory structure

The management of storage data in the storage layer is configured via storage_root_path, which can be multiple paths. The next level of the storage directory is organized in buckets, which store specific tablets and name the subdirectories according to tablet_id.

Segment files are stored in the tablet_id directory and managed by SchemaHash. Segment files can have more than one, and are generally split by size, with the default being 256MB. The Segment v2 file naming rule is: ${rowset_id}_${segment_id}.dat. The specific storage directory storage format is shown in the following figure.

Segment v2 File Structure

The overall file format of Segment is divided into three parts: data area, index area, and footer, as shown in the following figure.

  1. Data Region: Used to store the data information of each column, the data here is loaded on separate pages on demand.
  2. Index Region: Doris stores the index data of each column in the Index Region, the data here is loaded according to the column granularity, so it is stored separately from the column data information.
  3. Footer information:
  • SegmentFooterPB: Defines the metadata information of the file
  • 4-byte checksum of FooterPB content
  • 4-byte FileFooterPB message length for reading FileFooterPB
  • 8 bytes of MAGIC CODE, which is stored at the end to facilitate the identification of file types in different scenarios

4. Footer information

The Footer information segment is at the end of the file and stores the overall structure of the file, including the location of the data fields, the location of the index fields, etc. There are four parts: SegmentFooterPB, CheckSum, Length, and MAGIC CODE.

SegmentFooterPB data structure is as follows.

SegmentFooterPB uses PB format for storage, which mainly contains column meta information, index meta information, short key index information of Segment, and a total number of rows.

Column meta information

  • ColumnId: the serial number of the current column in the schema
  • UniqueId: the globally unique id
  • Type: the type information of the column
  • Length: the length information of the column
  • Encoding: an encoding format
  • Compression: Compression format
  • Dict Page Pointer: dictionary information

Column index meta information

  • OrdinalIndex: holds the meta information of the sparse index of the column.
  • ZoneMapIndex: stores the meta information of the ZoneMap index, including the maximum value, minimum value, whether there is a null value, and whether there is no non-null value. SegmentZoneMap stores the global ZoneMap information, and PageZoneMaps stores the statistical information of each page.
  • BitMapIndex: stores the meta information of BitMap index, including BitMap type, dictionary data BitMap data.
  • BloomFilterIndex: stores the BloomFilter index information.

In order to prevent the index itself from having too much data, ZoneMapIndex, BitMapIndex, BloomFilterIndex adopt two levels of Page management. When a Page is sufficient, the current Page directly stores the index data, i.e., a 1-level structure is used; when a Page is not enough, the index data is written to a new Page, and the Root Page stores the address information of the data Page.

Ordinal Index

Ordinal Index provides the ability to find the physical address of a Column Data Page by its row number. Ordinal Index can store data by column aligned by row, which can be understood as a first-level index. All other indexes look for data by Ordinal Index to find the location of the Data Page. Therefore, the Ordinal Index is introduced here first.

In a segment, data is always stored in the order of key (AGGREGATE KEY, UNIQ KEY, and DUPLICATE KEY), i.e., the ordering of key determines the physical structure of data storage. After determining the physical structure of column data order, when writing data, Column Data Page is managed by the Ordinal index, which records the position offset, size, and first data item row number information of each Column Data Page, i.e. Ordinal. Ordinal index uses a sparse index structure, which is like a book table of contents, recording the page number corresponding to each chapter.

Storage structure

The Ordinal index meta information is stored in the OrdinalIndexMeta of each column in the SegmentFooterPB. The specific structure is shown in the following figure.

In OrdinalIndexMeta, the root page address of the index data is stored. When there is only one page of data, the address here can directly point to the unique data page; when one page is not enough, it points to the secondary structure index page of OrdinalIndex type. Each data item in the index data corresponds to the offset position, size, and ordinal row number information of the Column Data Page. The Ordinal index granularity is the same as the page granularity, and the default is 64*1024 bytes.

Column Data Storage

Column data is stored in blocks according to the unit of Page, and the size of each Page is generally 64*1024 bytes. the location and size of the Page in the storage are managed by the ordinal index.

Data Page Storage Structure

DataPage is mainly composed of two parts: Data and Page Footer.

The Data section holds the data of the current Page’s columns. When Null values are allowed, a separate Bitmap of Null values is stored, and the row number of Null values is recorded by RLE format encoding with bool type.

Page Footer contains the Page Type, Uncompressed Size, FirstOrdinal RowId of the first row of the current Page, NumValues for the number of rows of the current Page, NullMapSize corresponds to the size of the NullBitmap.

Data Compression

Different encodings are used for different field types. By default, the following correspondence is used for the different types.

  • TINYINT/SMALLINT/INT/BIGINT/LARGEINT - BIT_SHUFFLE
  • FLOAT/DOUBLE/DECIMAL — BIT_SHUFFLE
  • CHAR/VARCHAR — DICT
  • BOOL — RLE
  • DATE/DATETIME — BIT_SHUFFLE
  • HLL/OBJECT — PLAIN

By default, the data is compressed using the LZ4F format.

Short Key Index

Storage Structure

Short Key Index is an indexing method to query data quickly based on the key (AGGREGATE KEY, UNIQ KEY, and DUPLICATE KEY) sorting.

Short Key Index also uses a sparse index structure, which generates an index item every certain number of rows during the data writing process. This number of rows is the default index granularity of 1024 rows, which can be configured. The process is shown in the following figure.

The KeyBytes hold the index item data and the OffsetBytes hold the offset of the index item in the KeyBytes.

Index Generation Rules

Short Key Index uses the first 36 bytes as the prefix index for this row of data. When a VARCHAR type is encountered, the prefix index is directly truncated.

Application Examples

  • The prefix index of the following table structure is user_id (8Byte) + age (4Bytes) + message (prefix 24 Bytes).
  • The prefix index of the following table structure is user_name (20 Bytes). Even if it does not reach 36 bytes, it is directly truncated and does not continue further because it encounters VARCHAR.

When our query condition is prefixed with a prefix index, it can greatly speed up the query. For example, in the first example, we execute the following query:

SELECT * FROM table WHERE user_id=1829239 and age=20;

This query will be far more efficient than the following query:

SELECT * FROM table WHERE age=20;

So when building a table, the correct choice of column order can greatly improve the query efficiency.

ZoneMap Index

ZoneMap index stores the statistics of Segment and each column corresponding to each Page. These statistics can help speed up the query and reduce the amount of scanned data. The statistics include information about MinMax, MaxMin, HashNull, HasNotNull.

Storage Structure

The ZoneMap index storage structure is shown in the following figure.

In the SegmentFootPB structure, each ColumnIndexMeta holds the ZoneMapIndex data information for the current column.ZoneMapIndex has two parts, SegmentZoneMap and PageZoneMaps. SegmentZoneMap stores the global ZoneMap index information of the current Segment, and PageZoneMaps stores the ZoneMap index information for each Data Page.

PageZoneMaps corresponds to the IndexedColumnMeta structure of the Page information stored in the indexed data, which is not compressed and encoded as Plain. The OrdinalIndexPage in IndexedColumnMeta refers to the offset and size of the root page of the indexed data, which is also optimized for secondary Page optimization. When there is only one DataPage, OrdinalIndexMeta points directly to this DataPage; when there are multiple DataPage, OrdinalIndexMeta points to OrdinalIndexPage first. OrdinalIndexPage is a secondary Page structure, the data items inside are the address offset, size, and ordinal information of the indexed DataPage.

Index generation rules

Doris turns on ZoneMap indexing for key columns by default; when the table model is DUPULCATE, ZoneMap indexing is turned on for all fields. When the column data is written to the Page, the data is automatically compared and the ZoneMap of the current Segment and the ZoneMap index information of the current Page is continuously maintained.

Application Examples

When querying data, the fields that are filtered by range conditions will be selected for scanning according to the ZoneMap statistics. For example, in Case 1, the age field is filtered. The query statement is as follows.

SELECT * FROM table WHERE age > 20 and age < 1000

If the Short Key Index is not hit, the ZoneMap index is used to find the ordinary range of data that should be scanned according to the query condition of age in the condition statement, and the number of pages to be scanned is reduced.

BloomFilter

Doris provides BloomFilter index when some fields can not use Short Key Index and the field exists with a relatively large degree of distinction.

Storage Structure

The storage structure of BloomFilter is shown in the following figure.

The information of BloomFilterIndex stores the Hash policy and Hash algorithm which it produced and the corresponding data Page information after BloomFilter.

The Hash algorithm uses HASH_MURMUR3, the Hash strategy uses BlockSplitBloomFilter, and the expected false positive rate FPP is configured to 0.05 by default.

The storage of BloomFilter indexed data corresponding to the data Page is similar to ZoneMapIndex, and the optimization of the secondary Page is done, so we will not elaborate here.

Index generation rules

BloomFilter is generated at Page granularity, when data is written to a complete Page, Doris will generate the BloomFilter index data for this Page according to the Hash strategy at the same time. Currently, BloomFilter does not support tinyint/hll/float/double types, but all other types are supported. To use it, you need to specify in PROPERTIES the fields of bloom_filter_columns to use for the BloomFilter index.

Application Examples

In a data query, the query criteria are filtered on the fields with the BloomFilter set. When the BloomFilter does not hit, it means that there is no data on the page, so the number of pages to be scanned can be reduced.

For example, the schema of the table is as follows.

The query SQL here is as follows.

SELECT * FROM table WHERE name = ‘zhangsan’

Due to the high discrimination of name, in order to improve the query performance of SQL, the bloomfilter index, properties (“bloom_filter_columns” = “name”), is added to the name data. A large number of pages can be filtered out through the bloomfilter index during query.

Bitmap Index

Doris also provides a BitmapIndex to speed up the querying of data.

Storage Structure

Bitmap storage format is as follows.

The meta information of BitmapIndex is also stored in SegmentFootPB. BitmapIndex contains three parts, the type of BitMap, the dictionary information DictColumn, and the bitmap index data information BitMapColumn. The DictColumn and BitMapColumn both correspond to the IndexedColumnData structure, which stores the Page address offset and size of dictionary data and index data respectively.

The difference with other index storage structures is that DictColumn is compressed with LZ4F, and the first value in Data Page is stored when recording the secondary Page offset.

Index Generation Rules

BitMap is created by CREATE INDEX, and the index of Bitmap is the index of the Column field in the whole Segment, instead of generating a separate one for each Page.

When writing data, a map structure is maintained to record the row number corresponding to each key value, and a Roaring bitmap is used to encode the rowid. The main structure is as follows.

When generating index data, the dictionary data is written first, and the key value of the map structure is written to the DictColumn.

Then, the key corresponds to the rowid encoded by Roaring and writes the data to the BitMapColumn in byte form.

Application examples

In the data query, for data columns with little differentiation and a small column base, the bitmap index can be used for optimization. For example, gender, marriage, geographic information, etc.

For example, the schema of the table is as follows:

The query SQL here is as follows:

SELECT * FROM table WHERE city in (“beijing”, “Shanghai”)

Since city has fewer values, the data dictionary and bitmap can be created to quickly find the matching rows by scanning the bitmap. And after the bitmap is compressed, the amount of data itself is small, so by scanning less data, it is possible to match the whole column accurately.

Query process of index

When querying the data in a Segment, the data is first filtered according to the field indexed according to the query conditions executed. Then the data is read, the overall query process is as follows:

  1. First, a row_bitmap is constructed according to the number of rows in the Segment, indicating the data to be read, and all data to be read if no index is used.
  2. When the key is used in the query condition according to the prefix index rule, it will be filtered by ShortKey Index first, and the ordinal row number range that can be matched in ShortKey Index will be merged into the row_bitmap.
  3. When the BitMap Index exists in the column field of the query condition, the ordinal row numbers that match the condition will be checked directly according to the BitMap Index and filtered with the row_bitmap. The filtering here is exact, after removing the query condition, the field will not be filtered by the index later.
  4. When the BloomFilter index exists in the column field of the query condition and the condition is equal (eq, in, is), it will be filtered by BloomFilter index, here it will go through all indexes and filter the BloomFilter of each Page to find all Pages that can be hit by the query condition. The row_bitmap intersects the filter.
  5. When there is a ZoneMap index in the column field of the query condition, it will be filtered by the ZoneMap index, here we will also go through all the indexes and find out all the Pages whose query condition can intersect with the ZoneMap. intersect the ordinal row number range in the index information with the row_bitmap.
  6. After generating the row_bitmap, we find the specific Data Page by the OrdinalIndex of each Column in the batch.
  7. Each column of the Column Data Page is read in batch. When reading, for the page with a null value, determine whether the current row is null according to the null value bitmap and if it is null, fill it directly.

Summary

Apache Doris currently uses a fully columnar storage structure and provides rich indexes for different query scenarios, providing a solid foundation for Doris’ efficient write and query performance.

Contact US

Apache Doris Website:http://doris.apache.org

Apache Doris Github:https://github.com/apache/doris

Apache Doris Mailing list:dev@doris.apache.org

--

--