made by me

Deciding between Row- and Columnar-Stores | Why We Chose Both

Alexa Griffith
Bluecore Engineering
8 min readAug 10, 2020

--

Deciding which database to use for a service requires some understanding of how different databases store data, especially at scale. Choosing between a database that stores data as rows or columns, for example, has a big impact on the database performance. For this reason, certain access patterns are better suited for row oriented databases (row-stores), and others for column oriented databases (columnar-stores). At Bluecore, our production infrastructure for analytics includes both Google CloudSQL (row-store) and BigQuery (columnar-store) in our analytics pipeline. We carefully chose these databases after weighing the tradeoffs of each and analyzing our access patterns. Since we process data for the many millions of messages we send each day, choosing the right database for our applications is vital.

Why does this matter?

columnar-stores store data by columns in storage

There are two main types of storage: hard disk drives (HDDs) and solid-state drives (SSDs). We store the database in storage. Data is written to storage in contiguous chunks. We won’t get into the difference in pages, sectors, and blocks here because it is only important to know that the data is grouped together and written/read according to its location in storage. In row oriented databases, these chunks contain a row consisting of all of its column values as a tuple. In column oriented databases, these chunks contain only the values in each row that belongs to that column. Given a certain amount of space on an SSD or HDD, a chunk of rows/columns data is read at once. This difference has a significant impact on performance.

row-stores store data by rows in storage

If the data you need to access is stored mostly in a small number of columns and it is not necessary to query each field in the rows, you may be better off with a columnar-store. On the other hand, if you need many columns in each row to determine which rows are relevant, a row-store may be a better fit. There are a few more distinctions between row and column oriented databases that are important to discuss. (1, 2)

Row oriented databases

Row-stores are considered “traditional” because they have been around longer than columnar-stores. Most row oriented databases are commonly known for OLTP (online transactional processing). This means that row-stores are most commonly known to perform well for a single transaction like inserting, updating, or deleting relatively small amounts of data. (3)

Writing one row at a time is easy for a row-store because it appends the whole row to a chunk of space in storage. In other words, the row oriented database is partitioned horizontally. Since each row occupies at least one chunk (a row can take up more than one chunk if it runs out of space), and a whole chunk of storage is read at a time, this makes it perfect for OLTP applications where a small number of records are queried at a time. (4)

row-stores save to storage in chunks

Using a row-store

Row-stores (ex: Postgres, MySQL) are beneficial when most/all of the values in the record (row) need to be accessed. Row oriented databases are also good for point lookups and index range scans. Indexing (creating a key from columns) based on your access patterns can optimize queries and prevent full table scans in a large row-store. If the value needed is in the index, you can pull it straight from there. Indexing is an important component of row-stores because while columnar-stores also have some indexing mechanisms to optimize full table scans, it is not as efficient for reducing seek time for individual record retrieval than an index on the appropriate columns. Note that creating many indices will create many copies of data, and a columnar-store is a better alternative (see When to Enable Indexing?). (5,6)

row-store concept overview

If only one field of the record is desired, then using a row-store becomes expensive since all the fields in each record will be read. Even data that isn’t needed for the query response will be read, assuming it isn’t indexed properly. Consequently, many seek operations are required to complete the query. For this reason, a columnar-store is favored when you have unpredictable access patterns, whereas known access patterns are well accommodated by a row-store. (7)

Column oriented databases

As more records in a database are accessed, the time to transfer data from disk to memory starts to outweigh the time it takes to seek the data. For this reason, columnar-stores are typically better for OLAP (online analytical processing) applications. Analytical applications often need aggregate data, where only a subset of a table’s attributes are needed. (8)

Column oriented databases are partitioned vertically — instead of storing the full row, the individual values are stored contiguously in storage by column. The advantage of a columnar-store is that partial reads are much more efficient because a lower volume of data is loaded due to reading only the relevant data instead of the whole record.

For example, if a chunk of storage can hold five values and the database has five columns (ex: one row has five values), one row will take up one chunk and be read together. If only one column value is needed for the query response, a columnar-store can read 5x as fast because you will read five column values in one chunk as opposed to one column value in the chunk containing the row. You also avoid reading the other column values that are irrelevant to the query response.

columnar-stores save columns to storage in chunks

Additionally, in column-stores compression is achieved more efficiently than in row-stores because columns have uniform types (ex: all strings or integers). These performance benefits apply to arbitrary access patterns, making them a good choice in the face of unpredictable queries. (8)

Using a columnar-store

example database

Columnar-stores (examples: RedShift, BigQuery) are good for computing trends and averages for trillions of rows and petabytes for data. (8)

Assuming this table continues for millions of rows, what if we wanted to know the sum of the amount spent on online purchases for company A? Well, for company A’s online purchases table, we would need to sum all of the online purchase values. Instead of going through each row and reading the email, type of purchase, and any other columns this table could have, we just need to access all of the values in the “amount” column.

columnar-store concept overview

Practical use

In our analytics pipeline, we use both Google CloudSQL (row-store) and BigQuery (columnar-store). Although the “OLTP” part of our application (row oriented database) is used for reporting data, the analytics pipeline is a good example of how OLTP and OLAP applications typically use the same dataset but different access patterns. The same data is necessary for the result, but the difference in their access patterns/use cases necessitates two different database structures. The analytics data stored in CloudSQL is regularly transferred to BigQuery for use with Looker.

Columnar-store: BigQuery

For part of our analytics pipeline, Looker dashboards are available to our customers to query and analyze their data. For Looker, BigQuery is our preferred choice because we need to aggregate large amounts of data by a specific subset of columns. Additionally, our customers can create their own queries as they wish. Being able to support ad hoc querying is an advantage of column oriented databases. BigQuery bandwidth is great — it scales automatically, allowing for querying of large amounts (petabytes) of data. This is the bandwidth we need since many of our Looker dashboards compare years’ worth of data at a time. (9)

Row-store: CloudSQL

We use MySQL with Google CloudSQL in another part of our analytics pipeline where we provide dashboards to our customers that detail the performance metrics of individual email send types. These preconfigured dashboards allow us to take advantage of fast access via indices and reduce the amount of data scanned for each lookup. A small amount of data is involved in each dashboard, and the queries are known ahead of time. In this use case, the dashboards are more for reporting data than aggregating it. Additionally, being fully managed by Google, just like BigQuery, is advantageous because the maintenance of the database is taken care of by Google. For CloudSQL in particular, the replica and transactional support are helpful, especially for easy rollbacks.

Conclusion

We discussed the differences in column and row oriented databases.

comparing row- and columnar-store characteristics

For analytics, both our CloudSQL and BigQuery databases contain the same data stored in different ways. This is done to accommodate the different access patterns, which impacts performance. We can build better services by weighing the tradeoffs between response latency and data volume and understanding the difference between columnar- and row-stores. When clients rely on our service for data and analytics vital to their company, choosing the right database can save a lot of headaches.

Resources:

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(relevant)

--

--

Alexa Griffith
Bluecore Engineering

Software Engineer at Bloomberg. All opinions are my own! :)