Database Normalization 1NF to 5NF

Syamsul Bachri
6 min readDec 4, 2023

--

Database normalization is a crucial process for structuring data, aimed at reducing data redundancy and improving data integrity. Data redundancy refers to the repetition of the same data, which can lead to input errors. Data integrity involves maintaining data that is easy to manage, accurate, and consistent. In this article, we will explore the journey of normalizing data from its Unnormalized Form (UNF) to Normalized Form 1NF, 2NF, 3NF, BCNF, 4NF, 5NF, and DKNF. Unnormalized Form typically exhibit high redundancy and low integrity, making normalization essential to achieve both of these goals. To better understand this process, let’s apply it using a film-related database as an example:

1NF (First Normal Form)

The first normal form (1NF) requires that there are no multivalued columns in the table. In the “director” table, the “phone_no” column with NIK 999 is multivalued because it contains more than one phone number, violating 1NF. This can create complications when a director with NIK 999 changes one of their numbers. To address this issue, we need to normalize the data:

Now, there are no multivalued columns, and the table complies with 1NF.

Key Concepts in Database Normalization

· Candidate Key: A candidate key is one or a combination of several unique columns that identify a set of data representing other columns. In the “director” table, {nik} and {phone_no} are candidate keys.

· Primary Key: The primary key is a unique column that distinguishes individual rows and cannot be null. While candidate keys can be chosen as primary keys, it is not recommended in cases where the values may change. An auto-incremented value is often used as the primary key, while candidate keys are given unique constraints.

· Unique Key: A unique key is a column with unique values but can be nullable or non-nullable.

· Foreign Key: A foreign key is a column that establishes a relationship between tables by referencing data from one table to another.

2NF (Second Normal Form)

The second normal form (2NF) builds on the first and requires that the table meets 1NF criteria and has no partial or functional dependencies. Partial or functional dependency occurs when a table has multiple candidate keys, and a column depends on one candidate key only. In the “director” table, both {nik} and {phone_no} are candidate keys. However, the “name” and “origin” columns depend only on {nik}. To eliminate redundancy and achieve 2NF, we split the table into two:

Now, there is no data redundancy in the non-candidate key columns, and the relationship between the two tables is established through the “director_id” foreign key.

3NF (Third Normal Form)

The third normal form (3NF) requires that a table meets 2NF criteria and has no transitive dependencies. A transitive dependency occurs when a column depends on another column that is not a candidate key, and that column depends on a candidate key. In the “director” table, the “origin_province” column depends on “origin_code,” which is not a candidate key. To achieve 3NF, we create a separate table for “origin”:

Now, there are no transitive dependencies, and the “director” table references “origin” through the “origin_id” foreign key.

BCNF (Boyce-Codd Normal Form)

BCNF, or Boyce-Codd Normal Form, is similar to 3NF and is sometimes referred to as 3.5NF. BCNF requires that no functional dependencies exist in tables with combinations of candidate keys, aiming to avoid redundancy. The key principle here is “Nothing but the key.” To illustrate, consider the following scenario:

In this case, BCNF is violated because the “name” column depends on “nik,” which is not a candidate key. The issue arises because a director can work on multiple movies, and a movie can have multiple directors. To address this, we create a new table:

Now, there is no functional dependency, and redundancy is eliminated. This structure suits many-to-many relationships.

4NF (Fourth Normal Form)

The fourth normal form (4NF) extends the BCNF requirement by eliminating multivalued dependencies. A multivalued dependency occurs when there are more than one dependencies in a table. Consider the following scenario:

In this example, BCNF is met, but there are multivalued dependencies due to two candidate keys: {director_id, movie_id} and {movie_id, genre_id}. To eliminate this issue, we create two separate tables:

Now, there are no multivalued dependencies, and the data is organized efficiently.

5NF (Fifth Normal Form)

The fifth normal form (5NF) requires that a table meets 4NF criteria and has no join dependencies. A join dependency occurs when there is redundancy when joining tables, leading to unwanted data duplication. To illustrate, consider the following scenario:

In this example, 4NF is met, but 5NF is violated because when joining these tables, there is redundancy in subgenre data. To address this, we create an additional table:

Now, there is no redundancy when joining tables, and the data is organized efficiently.

DKNF (Domain Key Normalization Form)

DKNF (Domain Key Normalization Form) specifies that there should be no constraints other than key constraints and domain constraints to prevent anomalies. Key constraints ensure that columns reference unique columns in other tables, while domain constraints are based on business rules. To illustrate, consider the following scenario:

Here, DKNF is violated because there can be an anomaly if a user inserts a budget of Rp1,000,000 with a budget status of “Milyaran,” which should be “Jutaan” according to business rules. To resolve this, we create a table for “budgets” as a domain constraint:

Now, the data is in DKNF, and there are no anomalies because the budget status is sourced from the “budgets” table.

Conclusion

In this article, we have explored the process of normalizing data from Unnormalized Form (UNF) to Normalized Form 1NF, 2NF, 3NF, BCNF, 4NF, 5NF, and DKNF. While 6NF and other normal forms exist, they are rarely needed in practice. Normalized data is suitable for transactional systems that require strong data integrity with minimal redundancy. However, denormalized data is often used for data analytics, focusing on historical data and typically utilizing NoSQL databases. The choice of normalization level depends on the specific requirements of the database design.

--

--