Master “Normalization” in SQL

Karan
Learning SQL
Published in
10 min readMay 13, 2024
Photo by Samuel Regan-Asante on Unsplash

Normalization is a critical concept in database management, particularly in SQL (Structured Query Language), where it ensures the efficiency, integrity, and scalability of databases. It’s essentially the process of organizing data in a database to reduce redundancy and dependency. Mastering normalization in SQL is fundamental for creating efficient, scalable, and maintainable databases. In this article, we’ll delve into the principles and techniques of normalization, exploring its various forms and how to implement them effectively.

Understanding Normalization:

Normalization is based on a set of rules that help organize data into tables in such a way that redundancy and dependency are minimized. The process typically involves dividing large tables into smaller, more manageable ones and defining relationships between them.

Why Use Normalization:

Photo by Evan Dennis on Unsplash

Before mastering normalization, we need to understand why we use normalization and what's the need for normalization, so that we can have a reference of “Why I Learn Normalization”.

Normalization is a fundamental concept in database design that is used to organize data efficiently and maintain data integrity. It serves several key purposes, each contributing to the overall effectiveness and reliability of a database system:

1. Minimize Redundancy:

Normalization helps eliminate redundant data by breaking down large tables into smaller, more manageable ones. Redundancy can lead to inconsistencies and anomalies in the database. By storing data in a normalized form, each piece of information is stored in only one place, reducing the chances of inconsistencies.

2. Prevent Update Anomalies:

Update anomalies occur when updating data in one place which leads to inconsistencies or errors elsewhere in the database. Normalization minimizes these anomalies by ensuring that data is stored in a structured manner, reducing the risk of conflicting updates.

3. Enhance Data Integrity:

Data integrity refers to the accuracy and consistency of data stored in a database. Normalization improves data integrity by enforcing constraints and rules that prevent incorrect or inconsistent data from being stored. For example, it ensures that each column contains atomic (indivisible) values and that relationships between tables are properly defined.

4. Improve Query Performance:

Normalized databases are often more efficient for querying. By organizing data into smaller tables with well-defined relationships, queries can be executed more quickly and effectively. This is because normalized tables are typically designed to store data relevant to a specific entity or concept, reducing the amount of data that needs to be processed for each query.

5. Simplify Database Maintenance:

Normalized databases are easier to maintain and update over time. When modifications to the database structure are required, such as adding new fields or tables, normalization ensures that changes can be made without impacting other parts of the database. This makes the database more adaptable to changing requirements and reduces the risk of errors during maintenance activities.

6. Facilitate Scalability:

Normalized databases are inherently more scalable. As the volume of data or the complexity of the database increases, normalization allows for easier expansion without sacrificing performance or data integrity. New tables can be added, and existing tables can be modified or extended to accommodate new requirements without significant disruption to the overall database structure.

Types Of Normalization

Normalization is a process used in database design to organize data into well-structured tables, minimizing redundancy and dependency. There are several normal forms, each with its own set of rules. Let’s explore the different types of normalization:

1. First Normal Form (1NF):

In 1NF, a table is said to be normalized if:

  • Each column contains atomic (indivisible) values.
  • There are no repeating groups or arrays within columns.
  • Each cell in the table holds a single value.

For example, a table where each cell contains a single value without any repeating groups like arrays or lists is in 1NF.

2. Second Normal Form (2NF):

A table is in 2NF if:

  • It is in 1NF.
  • All non-key attributes are fully functionally dependent on the entire primary key.

This means that every non-key attribute must depend on the entire primary key, not just part of it. If a table has a composite primary key, each non-key attribute must depend on the entire composite key, not just part of it.

3. Third Normal Form (3NF):

A table is in 3NF if:

  • It is in 2NF.
  • It does not have transitive dependencies.

In other words, non-key attributes should not depend on other non-key attributes. Any such dependency should be removed by creating separate tables.

4. Boyce-Codd Normal Form (BCNF):

BCNF is a stronger form of 3NF and is achieved when:

  • It is in 3NF.
  • For every non-trivial functional dependency 𝑋→𝑌X→Y, X is a superkey.

Here, a superkey is a set of attributes that uniquely identifies each row in a table.

5. Fourth Normal Form (4NF):

4NF addresses multi-valued dependencies and ensures that a table is free of multi-valued attributes. It is achieved by removing multi-valued dependencies and creating separate tables for them.

6. Fifth Normal Form (5NF):

5NF, also known as Project-Join Normal Form (PJ/NF), deals with join dependencies. It is achieved by decomposing tables to ensure that each table represents a single subject and can be joined with others without redundancy.

Let’s Start the Main Journey

So, until now, we have the basic idea of what normalization is, why normalization, and types of normalization. With all this basic knowledge we can proceed further to master normalization. Let’s start the main journey.

First Normal Form / 1NF

First Normal Form (1NF) is the most basic level of normalization and focuses on ensuring that each table in a relational database contains only atomic values, meaning that each cell holds a single, indivisible piece of data. This helps eliminate repeating groups or arrays within columns. Let’s explore 1NF in more detail with an example:

Example:

Consider a table storing information about students and their courses. This table violates 1NF as it contains repeating groups and non-atomic values:

Issues with the Example:

  1. Repeating Groups: The “Courses” column contains multiple values separated by commas. This violates the atomicity rule of 1NF.
  2. Non-Atomic Values: The values in the “Courses” column are not atomic; they represent multiple pieces of data.

To bring this table into 1NF, we need to break it down into multiple rows, each containing only atomic values.

Restructured Table in 1NF:

Explanation:

  • Each row now contains only one value in the “Course” column, adhering to the atomicity rule.
  • Repeating groups are eliminated, and each student-course combination is represented as a separate row.

Second Normal Form / 2NF

Second Normal Form (2NF) builds upon First Normal Form (1NF) by addressing partial dependency issues. It ensures that every non-key attribute in a table is fully functionally dependent on the entire primary key, not just part of it. Let’s delve deeper into 2NF with an example:

Example:

Consider a table that stores information about books and their authors. Here’s how the table might look:

Issues with the Example:

  • The table contains a composite primary key (BookID, AuthorID), where BookID uniquely identifies a book, but AuthorID does not uniquely identify an author.
  • The “AuthorName” column depends only on AuthorID, not on the entire primary key.
  • This table violates 2NF because AuthorName is functionally dependent on part of the primary key (AuthorID), not the whole key.

Restructured Table in 2NF:

To bring the table into 2NF, we split it into two tables: One for books and another for authors.

Explanation:

  • In the Books table, BookID is the primary key.
  • The AuthorID in the Books table is now a foreign key referencing the Authors table.
  • The Authors table contains AuthorID as its primary key.
  • The Authors table ensures that each author’s name is stored only once, avoiding redundancy.
  • The Authors table allows the AuthorName to be functionally dependent on the entire primary key, as AuthorName is determined by AuthorID, which is the primary key of the Authors table.

Third Normal Form / 3NF

Third Normal Form (3NF) extends the principles of First Normal Form (1NF) and Second Normal Form (2NF) by addressing transitive dependencies. It ensures that non-key attributes are not dependent on other non-key attributes but only on the primary key. Let’s explore 3NF with an example:

Example:

Consider a table that stores information about employees and their departments, including the department manager’s phone number:

Issues with the Example:

  • The table contains a transitive dependency: ManagerPhone depends on Manager, which is not the primary key.

Restructured Table in 3NF:

To bring the table into 3NF, we need to eliminate the transitive dependency. We’ll split the table into three separate tables: Employees, Departments, and Managers.

Explanation:

  • The Employees table contains information about each employee, with EmployeeID as the primary key.
  • The Departments table contains information about each department, with DepartmentID as the primary key.
  • The Managers table contains information about each manager, with ManagerID as the primary key.
  • In the original table, ManagerPhone is dependent on the Manager, which is not the primary key. By splitting the table, we remove this transitive dependency.

Boyce-Codd Normal Form (BCNF)

Boyce-Codd Normal Form (BCNF) is a higher level of normalization that builds upon the principles of Third Normal Form (3NF) by addressing certain types of functional dependencies. BCNF ensures that for every non-trivial functional dependency 𝑋→𝑌XY, the determinant (X) is a superkey. Let’s explore BCNF with an example:

Example:

Consider a table that stores information about courses and the instructors who teach them, including their contact information:

Issues with the Example:

  • The table contains a non-trivial functional dependency: 𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑜𝑟𝐼𝐷→𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑜𝑟𝑁𝑎𝑚𝑒,𝐸𝑚𝑎𝑖𝑙
  • The InstructorID determines both the InstructorName and Email, but InstructorID is not a superkey since multiple instructors can have the same ID.

Restructured Table in BCNF:

To bring the table into BCNF, we need to ensure that every determinant of a non-trivial functional dependency is a superkey. We’ll split the table into two: Courses and Instructors.

Explanation:

  • The Courses table contains information about each course, with CourseID as the primary key.
  • The Instructors table contains information about each instructor, with InstructorID as the primary key.
  • The original table violates BCNF because InstructorID determines both InstructorName and Email, and InstructorID is not a superkey. By splitting the table, we ensure that every determinant of a non-trivial functional dependency is a superkey.

Fourth Normal Form / 4NF

Fourth Normal Form (4NF) is a level of normalization that goes beyond Third Normal Form (3NF) by addressing multi-valued dependencies. It ensures that there are no non-trivial multi-valued dependencies in the database schema. Let’s explore 4NF with an example:

Example:

Consider a table that stores information about projects and the employees assigned to them, along with the tasks each employee is responsible for:

Issues with the Example:

  • There are multi-valued dependencies. For example, the combination of (ProjectID, EmployeeID) determines multiple TaskID and TaskName pairs.
  • The table violates 4NF because it contains non-trivial multi-valued dependencies.

Restructured Table in 4NF:

To bring the table into 4NF, we need to eliminate the multi-valued dependencies by splitting it into separate tables.

Explanation:

  • The Projects table contains ProjectID as the primary key.
  • The Employees table contains EmployeeID as the primary key.
  • The Tasks table contains ProjectID and EmployeeID as foreign keys, along with TaskID as the primary key.
  • Each row in the Tasks table represents a single task assigned to an employee on a project.
  • By splitting the original table into three separate tables, we eliminate the multi-valued dependencies, ensuring that the database schema adheres to 4NF.

Fifth Normal Form / 5NF

Fifth Normal Form (5NF), also known as Project-Join Normal Form (PJ/NF), deals with join dependencies and ensures that a database schema is free from join anomalies. It involves decomposing tables to ensure that each table represents a single subject and can be joined with others without redundancy. Let’s explore 5NF with an example:

Example:

Consider a table that stores information about students, courses, and the grades students received in each course:

Issues with the Example:

  • The table contains join dependencies, where certain attributes are dependent on combinations of other attributes.
  • For instance, CourseName is determined by CourseID, and Grade is determined by both StudentID and CourseID.

Restructured Tables in 5NF:

To bring the table into 5NF, we decompose it into multiple tables to remove join dependencies:

Explanation:

  • The Students table contains StudentID as the primary key and Name.
  • The Courses table contains CourseID as the primary key and CourseName.
  • The Grades table contains StudentID and CourseID as foreign keys, with Grade.
  • By decomposing the original table into three separate tables, we eliminate the join dependencies, ensuring that the database schema adheres to 5NF.

Conclusion:

Mastering normalization in SQL is crucial for designing efficient and scalable databases. By understanding the normal forms and implementing them effectively, you can ensure data integrity, reduce redundancy, and simplify database management. Whether you’re designing a new database or optimizing an existing one, normalization principles are essential for maintaining a robust and reliable data infrastructure.

--

--

Karan
Learning SQL

Senior Software Developer, Tech Geek and little bit of everything. I am here just to help others