Column-oriented database: Introduction (Part 1)
What is a column-oriented database and where can I use it? Before we answer those questions we have a look at traditional databases and the row-oriented approach.
If we instantiate an object or save a tuple we use a row-oriented approach.
We will use an example of persons in the following post. Each person has the attributes name, sex and age. Our data set looks like this:
name sex age
Alex male 26
Bettina female 22
Clara female 23
Dieter male 28
Emil male 29
Frederike female 27
How is this data represented in memory? Each character is 1 byte (simplified!)
address 00 | A l e x \0 m a l e \0 26
address 11 | B e t i n a \0 f e m a l e \0 22
address 26 | C l a r a \0 f e m a l e \0 23
address 40 | D i e t e r \0 m a l e \0 28
address 53 | E m i l \0 m a l e \0 29
address 64 | F r e d e r i k e \0 f e m a l e \0 27
\0 is the Null character and terminates a string.
Let’s focus on two aspects. Predictability and Locality.
Can we predict the address for each tuple? Can we predict the address within a tuple for each attribute? In our example we can’t predict the address where we will find Dieter without looking at all the data. Tuples with a variable length prevent such predictions.
Are the attributes of two tuples next to each other? Computing the average age of our persons we need to look at address 10, 25, 39, 52 and 63. That’s a problem because we reduce the probability to read the data from memory cache or the same block on disk. We further complicate any forecasts of the CPU what memory area we need next.
To calculate the average age we need to read the complete file data.txt. Although we are only interested in the age attribute.
A different approach is column-oriented. You can compare the approach with an array holding the data for each attribute (column).
How is our data represented in memory now?
address 00 | A l e x \0 B e t i n a \0 C l a r a \0 D i e t e r \0 E m i l \0 F r e d e r i k e \0
address 40 | m a l e \0 f e m a l e \0 f e m a l e \0 m a l e \0 m a l e \0 f e m a l e \0
address 77 | 26 22 23 28 29 27
We still can’t guess the address where we find Dieter. But if we have found Dieter (Dieter is at array index 3) we can easily calculate where we can find his age by adding 77 + 3= 80 => 28. If we deal with fixed length types (integer, float, boolean) we can predict exactly where we find our data.
To calculate the average age we only need to look at the memory block 77–82. The probability to hit the cache (or block) is very high during iteration.
The data set can be persisted to disk like this:
To calculate the average age we only need to read file age.txt. We only read data that we need to calculate our result!
A column-oriented approach increases performance. We need to read less. The probability of cache hits are higher. If we are dealing with fixed length types we can easily combine multiple attributes.
- TimeSeries.Guru is a column-oriented Time Series Database as a Service
- MonetDB is an Open Source column-oriented database