Data Modelling: Normalization and Denormalization

Ajayi Ayodeji
6 min readApr 25, 2024

--

Data modelling is a foundational concept in Data engineering. Without a good model in place, there can’t be data integrity, and ultimately, the insights derived from such data cannot be relied upon, making the entire end-to-end effort fruitless. Data integrity is simply the accuracy and consistency of data. This article discusses two of the most significant concepts in relational data modelling: normalization and denormalization.

To help you understand better, I’ll unpack the concept of keys in a relational database briefly:

Primary Key: A single column is used to identify each row in a table uniquely.

Foreign Key: A set of attributes in a table that references the primary key of another table, linking these two tables.

Composite Key: A primary key comprises two or more columns.

Candidate Key: Any set of columns with a unique combination of values in each row, with the additional constraint that removing any column could produce duplicate combinations of values.

What is Data Normalization?

Data Normalization is the process of structuring a relational database through a series of normal forms (which will be explained later in the article) to mitigate data redundancy (duplication) and increase data integrity. This includes eliminating redundant and unstructured data and making the data appear similar across all records and fields.

The primary aim of Codd’s first normal form, introduced in 1970, was to enable data to be easily queried and manipulated through a “universal data sub-language” based on first-order logic. The objectives of data normalization or normal forms, are:

1. To prevent unwanted insertion, update, and deletion dependencies on the database

2. To mitigate the need for refactoring the database when we need to introduce new data types

3. To make the relational model more intuitive to others

4. To make the database neutral to query statistics, there is no use of a query-first approach, unlike in some NoSQL Databases like Apache Cassandra

What About Normal Forms?

Normal forms were coined by Edgar Codd. There are up to 6 normal forms (chronological), but only the first three (1st, 2nd and 3rd normal forms) are used in production; the rest are used for academic and research purposes. There are also other normal forms, such as the following:
EKNF: Elementary key normal form
BCNF: Boyce–Codd normal form
ETNF: Essential tuple normal form
DKNF: Domain-key normal form

1st Normal Form

The things you want to look out for before declaring data to be in its first normal form are:

  1. Atomic Values: Each cell should have a single value, and there should be no sets, lists, or collections of items.
  2. There is a primary/composite key which ensures each row is unique
  3. Different relations should have their own table. You do not want a large table encapsulating several entities, but a small one with its own entity.
  4. While grouping entities together (as mentioned in the 3rd point), you also want to keep relationships between tables together with foreign keys.

Let's say we have a table like the one below.

Table 1

From observation, we can see that the table isn’t atomized, i.e. we have a collection of items in both rows. To transform the table into its first normal form, we break up the collections by moving the items in the songs column into individual rows.

Table 2: First Normal Form

2nd Normal Form

The things you want to look out for before declaring data to be in its second normal form are:

  1. Must have reached the first Normal form: Table 2 is already in first normal form.
  2. All columns in the table must rely on the Primary Key: While each record in our table is unique, our primary key album_id, album_name, artist_name, and year all have duplicates due to the repeated instances of the songs column (all its cells are unique). To make album_id a primary key, we must separate songs from the table. That way, we have two tables. Remember that a column with duplicates can’t be a primary key, so you might have to create a primary key for the songs_table, named songs_id.
  3. Removes partial dependencies — In an event with a composite key (primary key containing two or more attributes), other columns should depend on both columns that make up the composite key and not on any of them (partial). Columns dependent on only one of the columns in the composite key are relocated to a new table, with new key labels created that correspond to a primary key. Don’t forget to include a foreign key so the tables are still related, in this case, album_id.

Now, we have two tables:

Table 3: album_table
Table 4: songs_table

3rd Normal Form

The things you want to look out for before declaring data to be in its third normal form are:

  1. Must have reached the second normal form
  2. No transitive dependencies: Non-key columns cannot depend on other non-key columns. Now, we have three tables: A transitive functional dependency is when a change in a column (which is not a primary key) may cause any of the other columns to change.
Table 5: artist table
Table 6: album table
Table 7: song table

What is Denormalization?

Denormalization is the process of trying to improve the read performance of a database at the expense of losing some write performance by adding redundant copies of data or by grouping data. It is used on a previously normalised database to increase performance. Denormalization comes after normalization, and it is a process.

Differences between Normalization and Denormalization

Normalization and denormalization are two processes used in database management to organize data efficiently. Normalization involves structuring data into tables and establishing relationships between those tables according to predefined principles, aiming to reduce redundancy and inconsistent dependencies.

On the other hand, denormalization is the opposite process, resulting in a schema with redundant information compared to the normalized version. This redundancy can enhance performance by utilizing duplicated data and maintaining consistency. Denormalization becomes necessary when an overly normalized structure leads to increased overhead for query processing.

Key differences between normalization and denormalization include:

1. Data Organization:
i. Normalization involves separating data into multiple tables to eliminate redundancy and ensure data integrity.

ii. Denormalization consolidates information into a single table to expedite data retrieval.

2. System Usage:
i. Normalization is typically used in Online Transaction Processing (OLTP) systems to speed up insert, delete, and update operations.
ii. Denormalization is favoured in Online Analytical Processing (OLAP) systems, focusing on accelerating search and analysis.

3. Data Integrity:
Maintaining data integrity is generally easier during normalization compared to denormalization.

4. Redundancy:
Normalization reduces redundant data, whereas denormalization increases redundancy.

5. Table Structure:
i. Normalization typically involves adding tables and joins to the schema.
ii. Denormalization aims to minimize the number of tables and joins.

6. Disk Space Optimization:
i. Denormalization leads to wasted disk space due to duplicated data stored in multiple locations.
ii. Normalized tables optimize disk space usage.

In summary, while normalization focuses on reducing redundancy and ensuring data integrity, denormalization aims to enhance performance by consolidating data and reducing query processing overhead, albeit at the cost of increased redundancy. Each approach has its advantages and trade-offs, and the choice between normalization and denormalization depends on the specific requirements and characteristics of the database system.

If you like my article, you can connect with me on LinkedIn

--

--