Understanding Normalized vs. Denormalized Schemas in Data Warehousing

4 min readJun 1, 2024

While working on an analytics project and using data from table X, have you ever wondered why data from table Y is not included in table X, and why only certain fields are kept in table X?

Or have you ever wondered how data flows into table X?

The flow of data into table X is a separate topic. In this article, we will take a high-level look at how data is stored.

You might have heard many terms about data storage, like facts, dimensions, partitioning, normalization, denormalization, third normal form, and more. This can be very confusing and create a barrier to learning.

I felt the same way while learning about data modeling. In a series of articles, I will do my best to explain these concepts in the simplest way possible.

In this article, we will discuss 2 concepts that we have to deal with while storing data i.e. normalized and denormalized schema.

When designing a data warehouse, one of the critical decisions to make is whether to use a normalized or denormalized schema. Both approaches have their advantages and are suited to different scenarios. Here’s an in-depth look at normalized vs. denormalized schemas:

Normalized Schema

Definition: Normalization is the process of organizing data to minimize redundancy means there should be no unnecessary or duplicate data in the table.

In a normalized schema, data is divided into multiple related tables, each containing a single type of concept.

Example: consider the sales database below, there is one customer table, one orders table, and one seller table instead of a single table.

Table: Customer

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| customer_id   | int     |
| customer_name | varchar |
+---------------+---------+
customer_id is the primary key for this table.
Each row of this table contains the information of each customer in the WebStore.

Table: Orders

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| order_id      | int     |
| sale_date     | date    |
| order_cost    | int     |
| customer_id   | int     |
| seller_id     | int     |
+---------------+---------+
order_id is the primary key for this table.
Each row of this table contains all orders made in the webstore.
sale_date is the date when the transaction was made between the customer (customer_id) and the seller (seller_id).

Table: Seller

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| seller_id     | int     |
| seller_name   | varchar |
+---------------+---------+
seller_id is the primary key for this table.
Each row of this table contains the information of each seller.

Key Characteristics:

Reduction of Redundancy: Data is stored in its most atomic form, reducing duplication i.e. less duplicity of data.
Complex Queries: Since data is in multiple tables, Often requires more complex queries with multiple joins to retrieve data.
Storage Efficiency: Generally requires less storage because of reduced data duplication.
Update Efficiency: Updates are more efficient since data is not duplicated across multiple locations. image if order info is in both the customer's table and in the orders table, then we need to update data at both places.
Data Integrity: Easier to maintain data integrity and enforce referential integrity. for example, if you want to only give customer data to another team, but not the order access to another team😉.

When to Use:

When data integrity is crucial.
When updates, inserts, and deletes are frequent.
When storage efficiency is a priority.
In OLTP (Online Transaction Processing) systems for example in banking.

Denormalized Schema

Definition: Denormalization is the process of combining tables to reduce the number of joins needed in queries, often resulting in some redundancy.

Example: Instead of the 3 tables in the sales data warehouse above, we might have a single big table containing all the details.

Key Characteristics:

Query Performance: Improved read performance with simpler and faster queries due to fewer joins.
Redundancy: Data duplication is common, which can lead to inconsistencies if not managed carefully.
Simplified Queries: Queries are simpler and often faster because they require fewer joins.
Increased Storage: Requires more storage due to duplicated data.
Complex Updates: Updates are more complex and prone to anomalies since multiple copies of the same data need to be kept in sync.

When to Use:

When read performance is a priority.
In data warehousing and OLAP (Online Analytical Processing) systems like hive where read-heavy operations are common.
When complex reporting and data retrieval are required.

Comparison

Conclusion

Choosing between a normalized and denormalized schema depends on the specific needs of your data warehouse:

Normalized Schema: Best suited for environments where data integrity and storage efficiency are paramount, and where the workload involves frequent updates like banking.
Denormalized Schema: Best suited for environments where read performance and query simplicity are critical, and where the workload involves complex reporting and analytics like telecom.

In many real-world scenarios, a hybrid approach is often used, where critical transactional data is kept normalized to ensure integrity, and non-critical, read-heavy data is denormalized to optimize performance.

Not sure what to read next? I’ve picked another article for you:

Asked Python interview questions for data science/engineering.

In this article, I present and share the solution for questions that were asked to me or I had asked in data…

medium.com

Data Wrangling in Pandas vs SQL: A Comprehensive Comparison

Frequently, I find myself navigating between SQL and Pandas in my work. On certain days, I find myself crafting queries…