The Hierarchy of files in StarRocks Cloud Native

Published in

StarRocks Engineering

7 min readAug 13, 2024

Introduction

In the product of StarRocks’ cloud native, the data file organization format has been redesigned to better adapt to the characteristics of object storage. This article will use SSB lineorder table as an example to illustrate how data files are organized on object storage.

Table Storage Path

When creating a table in a cloud native cluster, you can specify a Storage Volume in the parameters (for more information on Storage Volume, please refer to ). The Storage Volume specifies the storage bucket and sub-path. Once the table is successfully created, all its data will be stored under the specified storage path in the Storage Volume.

The specific storage path rules for the table are as follows:

For older versions (3.1.3 and earlier), the storage path for table data is organized in the following format:

s3://${storage_volume_location}/${cluster_id}/${table_id}/

In intermediate versions (3.1.4 and after 3.2.0), the storage path for table data is organized in the following format:

s3://${storage_volume_location}/${cluster_id}/${table_id}/${partition_id}/

In the latest versions (3.1.8 and after 3.2.3), the storage path for table data is organized in the following format:

s3://${storage_volume_location}/${cluster_id}/db{dbid}/${table_id}/${partition_id}

For example, the Storage Volume information used in our testing environment is as follows:

mysql> DESC STORAGE VOLUME builtin_storage_volume\G
*************************** 1. row ***************************
     Name: builtin_storage_volume
     Type: S3
IsDefault: true
 Location: s3://starrocks-common/skr2veiad-jeff_io_optimize_test-1712281975762
   Params: {"aws.s3.region":"us-west-2","aws.s3.use_instance_profile":"true","aws.s3.use_aws_sdk_default_behavior":"false","aws.s3.endpoint":"xxx"}
  Enabled: true

The data for this table will be stored in the path s3://starrocks-common/skr2veiad-jeff_io_optimize_test-1712281975762.

Next, we will create an SSB lineorder table in this cluster with the following structure:

CREATE TABLE `lineorder` (
  `lo_orderkey` int(11) NOT NULL COMMENT "",
  `lo_linenumber` int(11) NOT NULL COMMENT "",
  `lo_custkey` int(11) NOT NULL COMMENT "",
  `lo_partkey` int(11) NOT NULL COMMENT "",
  `lo_suppkey` int(11) NOT NULL COMMENT "",
  `lo_orderdate` int(11) NOT NULL COMMENT "",
  `lo_orderpriority` varchar(16) NOT NULL COMMENT "",
  `lo_shippriority` int(11) NOT NULL COMMENT "",
  `lo_quantity` int(11) NOT NULL COMMENT "",
  `lo_extendedprice` int(11) NOT NULL COMMENT "",
  `lo_ordtotalprice` int(11) NOT NULL COMMENT "",
  `lo_discount` int(11) NOT NULL COMMENT "",
  `lo_revenue` int(11) NOT NULL COMMENT "",
  `lo_supplycost` int(11) NOT NULL COMMENT "",
  `lo_tax` int(11) NOT NULL COMMENT "",
  `lo_commitdate` int(11) NOT NULL COMMENT "",
  `lo_shipmode` varchar(11) NOT NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`lo_orderkey`)
COMMENT "OLAP"
PARTITION BY RANGE(`lo_orderdate`)
(PARTITION p1 VALUES [("-2147483648"), ("19930101")),
PARTITION p2 VALUES [("19930101"), ("19940101")),
PARTITION p3 VALUES [("19940101"), ("19950101")),
PARTITION p4 VALUES [("19950101"), ("19960101")),
PARTITION p5 VALUES [("19960101"), ("19970101")),
PARTITION p6 VALUES [("19970101"), ("19980101")),
PARTITION p7 VALUES [("19980101"), ("19990101")))
DISTRIBUTED BY HASH(`lo_orderkey`) BUCKETS 1
PROPERTIES (
"replication_num" = "1"
);

Then use the following command to obtain the storage path of the table on S3 (tmp is the database name):

mysql> show proc '/dbs/tmp';
+---------+-----------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+------------------------------------------------------------------------------------------------------------------------+
| TableId | TableName | IndexNum | PartitionColumnName | PartitionNum | State  | Type         | LastConsistencyCheckTime | ReplicaCount | PartitionType | StoragePath                                                                                                            |
+---------+-----------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+------------------------------------------------------------------------------------------------------------------------+
| 26019   | lineorder | 1        | lo_orderdate        | 7            | NORMAL | CLOUD_NATIVE | NULL                     | 7            | RANGE         | s3://starrocks-common/skr2veiad-jeff_io_optimize_test-1712281975762/632add3b-f7e1-477c-a925-f71aab66bbd2/db26010/26019 |
+---------+-----------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+------------------------------------------------------------------------------------------------------------------------+

StoragePath field clearly shows the path of the table on S3:

s3://starrocks-common/skr2veiad-jeff_io_optimize_test-1712281975762/632add3b-f7e1-477c-a925-f71aab66bbd2/db26010/26019

Parsing the path:

s3://starrocks-common/skr2veiad-jeff_io_optimize_test-1712281975762: This part is the prefix path defined in the Storage Volume earlier
632add3b-f7e1–477c-a925-f71aab66bbd2: cluster id, each StarRocks storage and computation separated cluster will generate a globally unique id as part of the path
db26010: db id to which the table belongs
26019: table id

Since the table is a partitioned table with 7 partitions, we can see that there are 7 subdirectories under the table’s storage path:

aws s3 ls s3://starrocks-common/skr2veiad-jeff_io_optimize_test-1712281975762/632add3b-f7e1-477c-a925-f71aab66bbd2/db26010/26019/
                           PRE 26012/
                           PRE 26013/
                           PRE 26014/
                           PRE 26015/
                           PRE 26016/
                           PRE 26017/
                           PRE 26018/

Each subdirectory is named after the Partition Id.

mysql> show partitions from lineorder;
+-------------+---------------+----------------+----------------+-------------+--------+--------------+-------------------------------------------------------------------------+-----------------+---------+----------+----------+-----------------+------------+-------+-------+-------+
| PartitionId | PartitionName | CompactVersion | VisibleVersion | NextVersion | State  | PartitionKey | Range                                                                   | DistributionKey | Buckets | DataSize | RowCount | EnableDataCache | AsyncWrite | AvgCS | P50CS | MaxCS |
+-------------+---------------+----------------+----------------+-------------+--------+--------------+-------------------------------------------------------------------------+-----------------+---------+----------+----------+-----------------+------------+-------+-------+-------+
| 26012       | p1            | 0              | 1              | 2           | NORMAL | lo_orderdate | [types: [INT]; keys: [-2147483648]; ..types: [INT]; keys: [19930101]; ) | lo_orderkey     | 1       | 0B       | 0        | true            | false      | 0.00  | 0.00  | 0.00  |
| 26013       | p2            | 0              | 1              | 2           | NORMAL | lo_orderdate | [types: [INT]; keys: [19930101]; ..types: [INT]; keys: [19940101]; )    | lo_orderkey     | 1       | 0B       | 0        | true            | false      | 0.00  | 0.00  | 0.00  |
| 26014       | p3            | 0              | 1              | 2           | NORMAL | lo_orderdate | [types: [INT]; keys: [19940101]; ..types: [INT]; keys: [19950101]; )    | lo_orderkey     | 1       | 0B       | 0        | true            | false      | 0.00  | 0.00  | 0.00  |
| 26015       | p4            | 0              | 1              | 2           | NORMAL | lo_orderdate | [types: [INT]; keys: [19950101]; ..types: [INT]; keys: [19960101]; )    | lo_orderkey     | 1       | 0B       | 0        | true            | false      | 0.00  | 0.00  | 0.00  |
| 26016       | p5            | 0              | 1              | 2           | NORMAL | lo_orderdate | [types: [INT]; keys: [19960101]; ..types: [INT]; keys: [19970101]; )    | lo_orderkey     | 1       | 0B       | 0        | true            | false      | 0.00  | 0.00  | 0.00  |
| 26017       | p6            | 0              | 1              | 2           | NORMAL | lo_orderdate | [types: [INT]; keys: [19970101]; ..types: [INT]; keys: [19980101]; )    | lo_orderkey     | 1       | 0B       | 0        | true            | false      | 0.00  | 0.00  | 0.00  |
| 26018       | p7            | 0              | 1              | 2           | NORMAL | lo_orderdate | [types: [INT]; keys: [19980101]; ..types: [INT]; keys: [19990101]; )    | lo_orderkey     | 1       | 0B       | 0        | true            | false      | 0.00  | 0.00  | 0.00  |
+-------------+---------------+----------------+----------------+-------------+--------+--------------+-------------------------------------------------------------------------+-----------------+---------+----------+----------+-----------------+------------+-------+-------+-------+
7 rows in set (0.21 sec)

There are also three subdirectories under the above partition path: data/, meta/ and log/, used to store different types of files. The following will explain one by one. Before discussing this, we need to briefly describe the multi-version data technology in cloud native.

Multi-Version Data Files

Currently, StarRocks storage and computation separated tables use multi-version data technology, and the overall storage structure is as shown in the following diagram:

In the above diagram, three data import transactions were generated, including:

Load Txn 1: During the transaction data writing phase, new data files file 1 & file 2 were generated, and after this transaction was committed, Tablet Meta V1 was generated, which records the list of visible files for this version as {file-1, file-2}
Load Txn 2: During the transaction data writing phase, new data files file 3 & file 4 were generated. Upon submission, based on the previous version (i.e., Tablet Meta V1) and the new data files generated in this import transaction (file-3 & file-4), a new Tablet Meta V2 was created, and thus the list of visible files for this version is {file-1, file-2, file-3, file-4}
Load Txn 3: During the transaction data writing phase, a new data file file 5 was generated. Upon submission, based on the previous version (i.e., Tablet Meta V2) and the new data file generated in this import transaction (file-5), a new Tablet Meta V3 was created, and thus the list of visible files for this version is {file-1, file-2, file-3, file-4, file-5}

In addition to user-imported transactions generating new data versions, in cloud native, system background Compaction tasks will also generate new data versions. The purpose of Compaction is twofold: 1) to merge multiple versions of small files into large files, reducing the number of random IO operations during queries, 2) to eliminate duplicate data records, reducing the overall data volume.

In StarRocks cloud native, each Compaction will also generate a completely new version. For the example above, if a new transaction Txn 4 after Txn 3 is a Compaction task and merges the 4 files file1 ~ file4 into file-6, then upon submission of this transaction, the newly generated Tablet Meta V4 will record the file list as {file-5, file-6}.

data file

This directory mainly stores Tablet Data files, and the data file format remains consistent with storage and computation together (still Segment files of columnar storage).The data file naming format is:

{txn_id}-{uuid}.dat

After importing a batch of data into the lineorder table, check the files under the data directory of partition 26012:

aws s3 ls s3://starrocks-common/skr2veiad-jeff_io_optimize_test-1712281975762/632add3b-f7e1-477c-a925-f71aab66bbd2/db26010/26019/26012/data/
2024-05-28 10:59:46   21651388 000000000000271e_225023df-78e4-4d58-89c9-1ec828c44f98.dat

meta file

This directory mainly stores Tablet Meta files. As mentioned earlier, this file is mainly used to record the data file paths contained in a specific version. The file naming format is:

{tablet_id}_{version}.meta

For example, the following shows:

aws s3 ls s3://starrocks-common/skr2veiad-jeff_io_optimize_test-1712281975762/632add3b-f7e1-477c-a925-f71aab66bbd2/db26010/26019/26012/meta/
2024-05-28 10:14:01        823 00000000000065A5_0000000000000001.meta

log file

This directory mainly stores log files generated during the running of import transactions, which mainly record information such as the data file names generated in this import transaction. After the transaction is successfully completed, this file is deleted. As mentioned earlier, this file is mainly used to record the data file paths contained in a specific version. The file naming format is:

{tablet_id}_{txn_id}.log