Analysis of the Storage Mechanism in InfluxDB
Analysis of the design of time series data storage and indexing in InfluxDB
Analysis of the Storage Mechanism in InfluxDB
This article describes the design of time series data storage and indexing in InfluxDB. Because the clustering version of InfluxDB is no longer open source from version 0.12 onward, the content described in this article is related Standalone InfluxDB unless otherwise specified.
1. Evolution of the InfluxDB Storage Engine
In the past three years since the release of InfluxDB, the architecture of its storage engine has experienced several significant changes. This article will briefly describe the evolution process of the InfluxDB storage engine.
1.1. A Brief Evolution History
- Earlier than version 0.9.0
**LevelDB-based LSMTree scheme**
- Version 0.9.0–0.9.4
**BoltDB-based mmap COW B+tree scheme**
- Version 0.9.5–1.2
**WAL + TSM file-based scheme**(TSMFile was generally available in Ver.0.9.6; Ver.0.9.5 only provided the prototype)
- Version 1.3-present
**WAL + TSM file + TSI file-based scheme**
1.2. Evolution Considerations
InfluxDB has tried to use different storage engines, including LevelDB and BoltDB. However, these storage engines cannot perfectly support the following requirements inside InfluxDB:
- After downsampling, time series data will be deleted in large batches.
=> *With LSMTrees inside LevelDB, the cost of deleting data is too high*
- Storing large amounts of data in the standalone environment cannot occupy too any handles.
=> *In LevelDB, lots of small files will be generated over time*
- Data storage requires hot backup.
=> *LevelDB only supports cold backup*
- High throughput is required in big data scenarios.
=> *The B+tree write operation inside BoltDB has the throughput bottleneck*
- Storage requires excellent compression performance.
=> *BoltDB does not support compression*
In addition, for the consistency of the technology stack and the simplicity of deployment (container-based deployment), the InfluxDB team hopes that the storage engine, like the upper-layer TSDB engine, is written in GO. Therefore, the potential RocksDB choice is excluded.
To solve the preceding pain points, the InfluxDB team decided to write their own storage engine.
2. InfluxDB Data Model
Before explaining the storage engine inside InfluxDB, let’s first review the data model in InfluxDB.
In InfluxDB, time series data supports multi-value models. The following shows a piece of typical data at a specific point in time.
A measurement is a metric object (namely, a data source object) Each measurement can have one or more metric values (namely, field described later). In practice, a real object to be test (for example, a CPU) can be defined as a measurement.
Tags are conceptually similar to tags in most time series databases. Generally, tags can uniquely identify data sources. The key and value of each tag must be strings.
A field records the actual metric value. Each metric is called a “field”, and the metric value is the value of that field.
A timestamp is the data timestamp. In InfluxDB, a timestamp can be accurate down to Nanoseconds(ns).
Additionally, InfluxDB has the database concept (in a traditional DBMS) on top of measurement. Logically, a database can have multiple measurements beneath. In the standalone implementation of InfluxDB, each database actually corresponds to a directory under the file system.
A SeriesKey in InfluxDB is similar to the timeline concept in common time series databases. A SeriesKey is represented in memory like the following byte array (with escape comma and space)
A SeriesKey cannot be longer than 65,535 bytes.
2.2. Supported Field Types
A field value in InfluxDB supports the following data types:
In InfluxDB, the field data type must remain unchanged in the following range; otherwise, a type conflict error is reported during data writing.
The same SeriesKey + the same field + the same shard
In InfluxDB, only one retention policy (RP) can be specified for a database. An RP can be used to describe for how long a specified database keeps time series data (duration). The shard concept is derived from the duration concept. Once the duration of a database is determined, the time series data in the duration inside that database will be further sharded by time so that the time series data can be stored as shards.
The following table shows the relation between the shard duration and the duration of an RP.
By default, if no RP is explicitly specified for a newly created database, the duration of RCs is permanent and the shard duration is 7 days.
Note: In the closed-source clustering version of InfluxDB, users can use RPs to specify that data is further sharded by SeriesKey after it has been sharded by time.
3. Analysis of the Storage Engine Inside InfluxDB
A storage engine inside a time series databases should be able to meet the performance requirements in the three following scenarios:
1. Write time series data in large batches.
2. Scan data in a specified timestamp range directly according to a timeline (Serieskey in InfluxDB).
3. Query all the matching time series data in a specified timestamp range indirectly by using measurements and some tags.
Based on the considerations in Section 1.2., InfluxDB announced their own solution — the WAL + TSM file + TSI file solution, which is described in the following section.
Like many other databases, InfluxDB first writes time series data into WALs and then into cache, and then performs disk flush to ensure data integrity and availability. The following figure shows the main procedures for InfluxDB to write time series data.
In InfluxDB, Serieskey data and time series data are separated and written into different WALs. The structure is shown as follows:
WALs of Index Data
InfluxDB supports deleting measurements, TagKeys, and TagValues. Adding new SeriesKeys is also supported as more time series data is written. The WALs of index data will determine what the current operation is. The structure of index data WALs is shown in the following figure.
WALs of Time Series Data
Because InfluxDB always only writes time series data, entries do not need to differentiate among operation types and simply record written data.
3.2. TSM File
TSM files are the storage solution for time series data in InfluxDB. At the file system level, each TSM file corresponds to a shard.
The storage structure of TSM files is shown in the following figure.
In a TSM file, time series data (namely, timestamp + field value) is stored in the data partition; SeriesKeys and Field Names are stored in the index partition, and an index in the file that is created based on SeriesKey + FieldKey and similar to a B+ tree is used to quickly locate data blocks where time series data is stored.
Note: In the current version, the maximum length of a single TSM file is 2 GB. When the actual size exceeds this limit, a new TSM file will be created to save the data even if data is in the same shard. This article does not describe TSM file splitting for the same shard, which is more complicated.
Structure of index blocks
The structure of an index block is as follows:
Where **Index entry** is calleddirectIndex in the source code of InfluxDB. In a TSMFile, index blocks are organized after **sorted** by SeriesKey + FieldKey.After knowing the structure of the index partition of a TSM file, we can easily understand how InfluxDB efficiently scan time series data in the TSM file:1. The specified SeriesKey and Field Name are used to perform binary search in the **index partition** to locate the **index data block** where the specified SeriesKey+FieldKey are.
2. InfluxDB searches the **index data block** by using the specified timestamp range to locate **index entries** where the matching data is.
3. InfluxDB loads the corresponding **time series data blocks** of the matching **index entries** into the memory for further scans.*Note: The three preceding steps are just a brief description of the query mechanism. Practical implementations may include many complicated scenarios, for example, the time range of a scan may span across index blocks.*
- Storage of time series data
Figure 2 shows the structure of time series data: All the Timestamp-Field field pairs of the same SeriesKey + FieldKey are split into two partitions — Timestamps and Value. This storage mode allows using different compression algorithms to store timestamps and field names in actual storage scenarios to reduce the size of time series blocks.
The compression algorithms used in InfluxDB are as follows:
- Timestamp: Delta-of-delta encoding
- Field Value: Because Field Values in a single data block are definitely of different data types, different compression algorithms can be used for different data types.
- Float: the Float compression algorithm of Gorilla
- Integer: Delta Encoding + Zigzag Conversion + RLE / Simple8b / None
- String: Snappy Compression
- Boolean: Bit packing
During querying data, after time series data blocks are located by using the index in a TSM file, data is loaded into memory, and Timestamps and Field Values are compressed to make subsequent queries easier.
3.3. TSI File
TSM files can perfectly meet the requirements of Scenario 1 and Scenario 2 described in the beginning of Part 3 of this article. However, during the process of querying data, how does InfluxDB ensure query performance if a user has specified more complex query criteria instead of specifying query criteria by using SeriesKeys? Generally, an inverted index can be used to solve this problem.
An inverted index inside InfluxDB is reliant on the two following data structures:
map<tagkey, map<tagvalue, List<SeriesID>>>
They are represented as follows in memory:
However, in the actual production environment, because a user’s SeriesKey size will become very large, inverted indexes may use too much memory space. That’s why TSI files are introduced in InfluxDB.
The overall storage mechanism of TSI files is similar to that of TSM files. Like a TSM file, a TSI file is generated for each shard. The storage structure is not described here.
The preceding content is a brief analysis of the storage mechanism in InfluxDB. Since I only see standalone InfluxDB, I do not know whether the clustering version of InfluxDB has some storage differences. However, we still can learn a lot from the storage mechanism of standalone InfluxDB.