Data normalization and normal forms

Mark Maher
3 min readApr 1, 2019

--

04hutts [Public domain] via Wikimedia Commons

Data normalization is a process by which we can reduce data redundancy, increase data integrity, and reduce data anomalies in relational databases. This allows us to save disk space while ensuring that we don’t corrupt our data while attempting to insert, update, or delete data from our database. These rules are encompassed by consecutive, accumulative normal forms. So, what are they and how can we use them to normalize our database?

Normal forms are just rule sets which, when we design our database schema to adhere strictly to these rules, produce increasingly normalized data. The first normal form — abbreviated 1NF — is the most basic set of rules for normalizing data. It follows four basic tenets:

  1. Data should be atomic — Only 1 piece of data in each cell. No lists or containers
  2. All values in a column should be in the same type — They should all be the same data type (e.g. number, string, date) and should all represent the same thing.
  3. Columns should have unique names
  4. Data should be order agnostic — as long as the columns are labeled correctly and the data is in the right row, the order of the data shouldn’t matter.

If our database follows all of these rules all of the time, congrats! We just made a database in first normal form!

If we want to move beyond the bare minimum and further normalize our data, we can move on to second normal form (2NF). All normal forms accumulate the rules of the forms before them: I.e. the first requirement for 2NF is that the data be in 1NF. The only additional requirement for 2NF is the elimination of partial dependency. Essentially, this means that all columns in a given table should be dependent on that table’s primary key (or what makes it unique). This usually involves destructuring tables into smaller, more specialized tables which each hold references to others. If our table holds the name of our students and their student ID’s, along with the name of their thesis projects and their project info, it is best to break this into 2 tables, one containing student info and one containing project info.

To achieve 3NF, we must again make sure our database has reached 1NF and 2NF. 3NF also requires the removal of all transitive dependencies. Transitive data is data that logically follows another piece of data in the table. If we store our students ages and dates of birth in the same table, it’s easy to see where any changes could result in anomalies, to normalize this data, we can just remove the age column and only store the date of birth.

Beyond 3NF is elementary key normal form (EKNF) and Boyce–Codd normal form (BCNF), and more and more. Normalizing data is a science and we can always normalize further, but at some point, creating more joins and more intricate queries can make your database sower and more cumbersome or introduce more bugs and room for human error. At times, you may even find that it’s best for you to denormalize your database to achieve the results you’re shooting for. Normalization is a goal to strive for, but we shouldn’t let it get in the way of the product we’re delivering.

--

--