Column-oriented database: TSDB Draft (Part 2)

In part 2 we will create a draft for a column-oriented time series database (TSDB). Part 1 explains why going column-oriented is a good idea in some cases and how it is different from a row-oriented approach.

column-oriented approach

On-Disk concept

Our time series database (TSDB) is column-oriented. Therefore we can represent each column as a file. The first column will always be the timestamp. Other attributes can follow.

timestamp       value1     value2
====================================
10:01:00.000 324 34654
11:05:00.000 342 23463
12:17:00.000 446 47232
13:03:00.000 234 60383
14:31:00.000 213 73947
15:14:00.000 342 46286

We make a few very important assumptions here!

Assumptions

First: Entries are sorted by timestamp in ascending order.

Second: Adding new entries always happens at the end of the table. If we always add new entries at the end of the table they are naturally sorted by timestamp.

Third: All our data types are fixed length like integer, float or boolean. We don’t support strings or BLOBs.

Fourth: All entries of one column are of the same type (like in traditional databases).

To make things easy we will use unsigned int (4 bytes) for timestamps and floats (4 bytes) for values in the following post.

Query

The most important part of a query will be selecting on timestamp to limit the amount of data. Remember that everything is sorted by timestamp! Therefore we can use some binary search technique to find the first entry of the result.

Keep in mind that all entries are fixed length so we can calculate the position in every other column for each timestamp!

Big Data?

To support big data we need some kind of partitioning (or segments or sharding). We can use fixed partitions like the day (2014-01-01):

/
2014-01-01/
timestamp.dat
value1.dat
value2.dat
2014-01-02/
timestamp.dat
value1.dat
value2.dat

or dynamic partitions like every N elements and name the partitions as the first included timestamp.

/
10:01:00.000/
timestamp.dat
value1.dat
value2.dat
14:31:00.000/
timestamp.dat
value1.dat
value2.dat

Write concept

We have to write to multiple files simultaneously (timestamp.dat, value1.dat, value2.dat). We musst guarantee that all files have the same “length” (number of entries). Otherwise we could read N entries from timestamp.dat but only N minus M from value1.dat.

To solve this problem we make new assumptions:

Always read from timestamp.dat first. Never read more than length(timestamp.dat) entires from all other files.

Always write to timestamp.dat last.

So we can never run into the problem that length(timestamp.dat) > length(value1.dat). This is far more efficient than some artificial synchronization.

Read concept

To read fast we need to read only what we need. The proposed On-Disk concept allows us to only read the columns we need.

We can further think about compressing the data in each column. If you think of sensor data you can imagine that two data values will not be very different.

We should also ensure that our query engine must not read everything into memory to compute a result. We should make use of streaming algorithms. MapReduce algorithms allows us to read in parallel from multiple IO channels to compute a final result.

The database can rely on the OS to handle IO caching. If the database has to answer queries based on the same files the second query will run much faster because the file is still cached.

Summary

Just using files, directories and a few assumptions we can imagine a quiet powerful time series database. Implementing such a system can be done with modest effort because we rely heavily on the operating system to handle all the hard stuff.