Data Modelling in Columnar Data Store?

Published in

Data Engineering

4 min readApr 6, 2023

This article provides an introduction to columnar data stores, outlines advantages and popular examples. Additionally, the article outlines key considerations for modelling data in a columnar data store and traditional practices to avoid.

A columnar data store is a type of database that organises physical storage of data by columns instead of rows.

Advantages:

Faster data retrieval and processing times for analytical queries that involve filtering or aggregations. This is because the database engine only needs to read and process the relevant columns, rather than scanning and processing the entire row.
Columnar databases can often compress data more efficiently than row-based databases, resulting in less storage space and lower storage costs.

Popular examples:

Apache HBase, Amazon Redshift, Google BigQuery,Apache Kudu

Key Considerations:

Denormalize data: Instead of normalizing your data into multiple tables like in traditional databases, denormalize your data into a single table. This allows for faster queries and more efficient compression.
Choose appropriate data types: this can impact both query performance and compression. For example, using fixed-length data types can lead to more efficient compression.
Data distribution: Columnar data stores are often designed to support parallel processing, where data is distributed across multiple nodes for faster query performance. When modeling data for a columnar data store, it’s important to consider how data will be distributed across nodes and how this will impact query performance. This can be done for example by choosing the right distribution keys.
Data ingestion: In a columnar data store, data ingestion can be a time-consuming process, especially if the data is coming from multiple sources or requires complex transformations. These datasets or columns can be placed in separate tables and combined only at read. It is important to consider how data will be ingested and how the data model can be optimized to streamline this process.
Partition data: this can help improve query performance by allowing for parallel processing of queries. Partitioning can be done based on a variety of criteria, such as date ranges or geographic regions.
Optimize for query patterns: Consider the types of queries will be running on your data and optimize data model accordingly. For example, if aggregate queries are run frequently, consider precomputing those aggregates and storing them in the data model.

What should I do differently while modelling on a columnar data store?

Avoid Normalizing:

Prefer Denormalised datasets instead of dimensional modelling (facts and dimensions). This is because

All data stored in a single table, can reduce the need for complex joins when querying the data. This can lead to faster query performance, especially when dealing with large volumes of data. This was not possible in traditional data stores as retrieving selected columns in large tables was expensive.
In a denormalized data model, redundant data is stored in multiple rows, which can be compressed more efficiently than in a dimensional model where the same data might be stored in multiple tables. This can lead to significant storage savings, which can be particularly important in a columnar data store where storage costs can be high.
In a denormalized data model, data is not split across multiple tables, which can make it easier to understand, maintain and to add new columns or attributes to the model without having to modify multiple tables. This can make it easier to adapt to changing business requirements or data sources.

Avoid Surrogate Keys:

Surrogate keys are often used in traditional databases to enforce referential integrity and simplify join operations between tables. However, in a columnar data store, where data is typically denormalized into a single table.

Unnecessary complexity: Surrogate keys can add unnecessary complexity to the data model, which can make it harder to manage and modify over time. Also the processing overhead to create and maintain surrogate keys.
Instead use Natural keys: Instead of using surrogate keys, columnar data stores often rely on natural keys, which are unique identifiers that already exist within the data, such as a product code or customer ID. This can simplify the data model, improve query performance, and optimize compression.

Other traditional practices to avoid in a columnar data store:

Large numbers of small tables: In a columnar data store, If you have a large number of small tables, this can make it difficult to effectively denormalize the data and may result in a less efficient data model.
Complex hierarchies: While it’s possible to model complex hierarchies in a columnar data store, doing so can add unnecessary complexity to the data model and may impact query performance. Instead, consider flattening the hierarchy or using a simpler hierarchy that can be easily represented in a single table.
Excessive use of null values: Null values can be problematic in a columnar data store, as they can add overhead to the data model and may impact query performance. Instead, consider using default values or alternative data structures to represent missing data.
Unnecessary data types: In a columnar data store, the data types used for each column can impact query performance and storage efficiency. To optimize performance, avoid using data types that are larger than necessary, such as using an integer data type for a column that only contains small values.