Column-oriented database: Introduction (Part 1)

4 min readJan 27, 2015

What is a column-oriented database and where can I use it? Before we answer those questions we have a look at traditional databases and the row-oriented approach.

row-oriented approach

If we instantiate an object or save a tuple we use a row-oriented approach.

We will use an example of persons in the following post. Each person has the attributes name, sex and age. Our data set looks like this:

name          sex       age
==============================
Alex          male      26
Bettina       female    22
Clara         female    23
Dieter        male      28
Emil          male      29
Frederike     female    27

How is this data represented in memory? Each character is 1 byte (simplified!)

address 00 |  A l e x \0 m a l e \0 26
address 11 |  B e t i n a \0 f e m a l e \0 22
address 26 |  C l a r a \0 f e m a l e \0 23
address 40 |  D i e t e r \0 m a l e \0 28
address 53 |  E m i l \0 m a l e \0 29
address 64 |  F r e d e r i k e \0 f e m a l e \0 27

\0 is the Null character and terminates a string.

Let’s focus on two aspects. Predictability and Locality.

Predictability

Can we predict the address for each tuple? Can we predict the address within a tuple for each attribute? In our example we can’t predict the address where we will find Dieter without looking at all the data. Tuples with a variable length prevent such predictions.

Locality

Are the attributes of two tuples next to each other? Computing the average age of our persons we need to look at address 10, 25, 39, 52 and 63. That’s a problem because we reduce the probability to read the data from memory cache or the same block on disk. We further complicate any forecasts of the CPU what memory area we need next.

On-Disk representation

File data.txt
    Alex;male;26
    Betina;female;22
    Clara;female;23
    Dieter;male;28
    Emil;male;29
    Frederike;female;27

To calculate the average age we need to read the complete file data.txt. Although we are only interested in the age attribute.

column-oriented approach

A different approach is column-oriented. You can compare the approach with an array holding the data for each attribute (column).

How is our data represented in memory now?

address 00 |  A l e x \0 B e t i n a \0 C l a r a \0 D i e t e r \0 E m i l \0 F r e d e r i k e \0
address 40 |  m a l e \0 f e m a l e \0 f e m a l e \0 m a l e \0 m a l e \0 f e m a l e \0
address 77 |  26 22 23 28 29 27

Predictability

We still can’t guess the address where we find Dieter. But if we have found Dieter (Dieter is at array index 3) we can easily calculate where we can find his age by adding 77 + 3= 80 => 28. If we deal with fixed length types (integer, float, boolean) we can predict exactly where we find our data.

Locality

To calculate the average age we only need to look at the memory block 77–82. The probability to hit the cache (or block) is very high during iteration.

On-Disk representation

The data set can be persisted to disk like this:

File name.txt
    Alex
    Betina
    Clara
    Dieter
    Emil
    Frederike
File sex.txt
    male
    female
    female
    male
    male
    female
File age.txt
    26
    22
    23
    28
    29
    27

To calculate the average age we only need to read file age.txt. We only read data that we need to calculate our result!

Summary

A column-oriented approach increases performance. We need to read less. The probability of cache hits are higher. If we are dealing with fixed length types we can easily combine multiple attributes.

TimeSeries.Guru is a column-oriented Time Series Database as a Service
MonetDB is an Open Source column-oriented database
Wikipedia

Part 2

In part 2 we will create a draft for a column-oriented time series database (TSDB).

Column-oriented database: Introduction (Part 1)

row-oriented approach

Predictability

Locality

On-Disk representation

column-oriented approach

Predictability

Locality

On-Disk representation

Summary

Part 2

Written by Michael Wittig