Understanding C-Store: Columnar Databases and Data Intensive Analytics
From what I understand, the first prototype for a columnar database was introduced in a 2005 paper from MIT’s school of computer science and artificial intelligence, called C-Store: A Column Oriented Database. (Stonebraker et al). MIT CSAIL.
The key idea from this paper seems to be that, in columnar databases we store each column’s values sequentially in a database, as opposed to storing each row’s values sequentially. Let’s try to visualize this idea.
Consider the following examples given on Amazon Redshift’s Database Developer Guide.
Some points to note regarding the example above:
- Notice that each block has values of different types (string, integer, etc).
- What happens if the row size is greater than a block? We will need to read more than a block to read a single row.
- Similarly, what happens if the row size is smaller than a block? In this case we will have inefficient use of disk space.
- Note that a data block is the smallest unit of data used by a database. See this link for more information: Introduction to Data Blocks. Oracle.
- This is a ‘write-optimized’ database as it has a very low cost to write a new record to disk. The DB can easily add a new block with the record’s data.
Some points to note regarding the example above:
- Notice that each block has the same data type. This turns out to be a significant advantage as the DB can use compression techniques to significantly reduce the disk usage and I/O.
- This is a ‘read optimized’ database, particularly in a big data analytics context (which is what we use the large datasets in our columnar database for). Consider how most queries only require a few of the column values. So when we have a large dataset, with many rows and columns, only retrieving the columns that we need to execute our analytics query is substantially more efficient than retrieving all the records in the database.
The original CSAIL paper notes that write optimized database systems are naturally ideal write heavy applications (such as Online Transaction Processing applications) whereas read optimized database systems are naturally ideal for read heavy applications (such as Data Warehouses).
It should be evident why columnar databases are well suited for big data analytics architectures. They offer (i) significant data compression, and (ii) highly optimized read operations that are typically used for analytics jobs. My Director, Dan Woicke wrote about the value of using a columnar database to our team’s work here: Cerner Advances Big Data Analytic Capabilities. Dan Woicke. CIO Review. There are more features that columnar databases offer which you can explore in the additional reading section below.
Some additional reading,
(1) C-Store: A Columnar Database. Ameya. Medium.
(2) Rowwise vs Columnar Database? Theory and in Practice. Modi. Medium.
(3) The beauty of column oriented data. Zaks, Maxim. Towards Data Science.
(4) Column Oriented Database Systems. Stavros Harizopoulous. UC Davis Presentation.
(5) Integrating Compression and Execution in Column-Oriented Database Systems. Abadi et Al. MIT.
(6) The Vertica Analytic Database: C-Store 7 Years Later. (Lamb et al). Vertica.