From Spreadsheets to Database

Creating A Normalized (and Denormalized) Database

17 min readNov 1, 2023

I love reading books! And I have a lot of them.

The first thing I did when I learnt Excel is creating a spreadsheet with all the details of my books. When I learnt Python, I did a super quick EDA (you can read about it here) and finally with Tableau, I created a beautiful dashboard. I so love my books, that I even didn't stop there.

Last month I completed my PostgreSQL course, and I so wanted to use that spreadsheet of my books to create a database, and it looks cool on Linux too.

So here I am, giving an account of how I turned my boring spreadsheet into a supercool database. But first, some theories…

What is Database Normalization?

According to Wikipedia -

“Database normalization is the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity.”

In short, Normalization is the process of organizing the data in the database. It divides the larger table into smaller ones and links them using relationships.

The reasons for creating a database are -

It reduces data redundancy, which means less data duplication and more efficient storage usage.
Increases integrity and consistency, which means the data is always accurate and unambiguous.
Improves query performance and organization, which means the data is easier to find and use.
Increases security and connection, which means the data is protected and can be linked to other systems.

The con on the other hand is,

You end up with lots of leaf tables that need to be linked together with sophisticated JOINs, resulting in slower read times.
As number of tables increases, maintenance level also increases.
The performance may degrade when normalizing the relations to higher normal forms, i.e., 4NF, 5NF.

Now that we’ve seen what database normalization is and known some of its pros and cons, let’s dive right into building one.

STEP — 1 Loading the data into the Database

Since I had all my data in a compact csv file already (here is the schema), all I need to start working on my database design is to load that into PostgreSQL SQL shell. But first, I need to create a table that’ll hold all the data.

-- Creating a table
DROP TABLE IF EXISTS books_raw ;

CREATE TABLE books_raw(
name VARCHAR(120) NOT NULL,
writer VARCHAR(120) NULL, 
original_language VARCHAR(50) NULL,
genre VARCHAR(50)  NULL,
binding VARCHAR(50) NULL,
publication VARCHAR(150) NULL,
price INTEGER NULL,
transaction_method VARCHAR(10) NULL, 
year INTEGER NULL,
read_unread VARCHAR(25) NULL,
rating VARCHAR(10) NULL,
writer_gender VARCHAR(10) NULL,
author_id INTEGER,
publisher_id INTEGER,
transaction_method_id INTEGER,
read_unread_id INTEGER ,
genre_id INTEGER,
binding_id INTEGER ,
language_id INTEGER,
rating_id INTEGER,
gender_id INTEGER,
created_at TIMESTAMPTZ DEFAULT  NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);

Using the copy command, I loaded the csv file into this table, so that we can start normalizing it.

\copy books_raw(name,writer,original_language,genre,binding,publication,price,transaction_method,year,read_unread,rating,writer_gender ) FROM 'C:\Users\Dell\Desktop\Projects\Do i read all the books i buy\Books_Management.csv' WITH DELIMITER ',' CSV HEADER;

The data looks like this —

The loaded data in books_raw table. Notice the empty columns postfixed with id (author_id, publisher_id etc). They are there to hold the foreign keys of the leaf tables.

Now that the data is loaded into the SQL Shell, we can do some queries to make sure we’ve got all the records.

SELECT COUNT(DISTINCT name) FROM books_raw;

SELECT name, writer, original_language, genre, binding, publication, price, transaction_method, year FROM books_raw LIMIT 5;

Both spreadsheets and databases store and organize data. Even though the data looks just like an excel spreadsheet with rows and columns, the important difference between these two types of vertical representations is that databases can be relational while spreadsheets cannot. Spreadsheets are better suited to self-contained data, where the data exists in one place. A database management system stores data from external tables allowing us to change data in several places by editing in only one place.

The data in a database is spread across multiple tables, which takes up less space than in a spreadsheet, making queries perform faster as data grows.

STEP 2 — Start Normalizing The Data

There are six normal forms of database normalization process.

Normal forms are a series of guidelines that help to ensure that the design of a database is efficient, organized, and free from data anomalies.

For this small dataset, we’ll try to achieve only the first three of them.

First Normal Form:

In the first normal form each field contains a single value. A field may not contain a set of values or a nested record.

To apply first normal form to a database, we ask ourselves the following questions:

Does the combination of all columns make a unique row every single time?
No. There could be multiple books of same name or even multiple copies of same book with similar attributes.
What field can be used to uniquely identify the row?
It could be Book name, but then again, there could be multiple books of same name or multiple copies of a book. Writer name? But a writer may have multiple books, or one book may have multiple writers. ISBN number is unique to each book, but since we don’t have an ISBN record here, we could give them a unique primary key instead.

A primary key is a unique identifier for a database record. It is unique to each row in a database table and used to identify it.

It is important to make primary and foreign keys as integer type. Because they’re faster for indexing and other query optimization processes.

-- SERIAL is a special kind of database object generator in PostgreSQL. 
-- It is used to generate a sequence of integers which are often used as the Primary key of a table.
ALTER TABLE books_raw ADD COLUMN id SERIAL;

ALTER TABLE books_raw ADD CONSTRAINT books_id PRIMARY KEY (id);

The table looks like this.

Original table holds all the raw data that we’ll split into several look up tables

This satisfied the condition for first normal form.

2. Second Normal Form:

The rule of second normal form on a database can be described as:

Fulfil the requirements of first normal form.
Each non-key attribute must be functionally dependent on the primary key.

Functionally dependent means each field that is not the primary key is determined by that primary key, so it is specific to that record.

We now need to determine which column is functionally dependent or independent of the primary key i.e., in our case book id, which represents a book.

name: Yes, this is dependent on the primary key. Each different book id means a different book.
writer: Yes, for time being. Each book is associated with a writer.
original_language: No, this column is not dependent on the book name. Many books may have been written in same original language.
genre: No, genre doesn’t depend on book.
binding: No, many books may be paperback, hardcover or eBook. It doesn’t depend on the book id.
publication: No, same as before.
price: Yes, each book has its associated price.
year: Yes, each book has its associated year of buying.
transaction_method: No, it does not depend on book id.
read_unread: No, same as transaction_method.
rating: Yes, it depends on the name of the book.
writer_gender: No, it doesn’t depend on the name of the book, but it depends on the writer. We’ll take care of it later.

Since we have some columns depending on the book id, we’ll create a separate table named book_title with those columns.

CREATE TABLE book_title (id SERIAL,
name VARCHAR(120) NOT NULL,
writer VARCHAR(120),
price INTEGER,
year INTEGER,
rating VARCHAR(10),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
PRIMARY KEY(id)
);

2. 1 One-to-Many Relationship

Now that we’ve taken care of the functionally dependent columns, we can now deal with functionally independent ones. In case of a language, one language (e.g. English or French) can have multiple books, but one book will have only one language. Thus, it is a one-to-many relationship. The same goes for genre, binding_type, publisher, read_unread, transaction_method and gender tables.

A one-to-many relationship in a database occurs when each record in Table A may have many linked records in Table B, but each record in Table B may have only one corresponding record in Table A.

Language:
It is not dependent on the book name, as many books may have been written in same original language (e.g. — English, French or Swedish). So, we create a different table for it and give a primary key for each of its unique rows.

-- Creating leaf table
CREATE TABLE language (
    id SERIAL,
    language VARCHAR(20) UNIQUE,
    PRIMARY KEY(id)
);

We do that for rest of the 6 tables. We’ll create separate tables for all the columns in books_raw which are not functionally dependent on primary key book id.

-- Creating leaf tables
CREATE TABLE genre (
    id SERIAL,
    genre VARCHAR(25) UNIQUE,
    PRIMARY KEY(id)
);

CREATE TABLE binding_type (
    id SERIAL,
    binding_type VARCHAR(10) UNIQUE,
    PRIMARY KEY(id)
);

CREATE TABLE publisher (
    id SERIAL,
    publisher_name VARCHAR (150) UNIQUE,
    PRIMARY KEY(id)
);

CREATE TABLE transaction_method (
    id SERIAL,
    method VARCHAR(10) UNIQUE,
    PRIMARY KEY(id)
);

CREATE TABLE read_unread (
    id SERIAL,
    read_unread VARCHAR(15) UNIQUE,
    PRIMARY KEY(id)
);

CREATE TABLE gender (
    id SERIAL, 
    gender VARCHAR(20) UNIQUE,
    PRIMARY KEY(id)
);

The data from the csv file is now spread across 8 tables. So far, all our leaf tables look like this-

Book_title table and the 8 look-up tables will be joined by foreign keys

The tables are not linked together yet. So, we’ll need foreign keys now.

A foreign key is a column in one table that refers to the primary key in another table.

Let’s take the genre table first. To link genre with books table, we need to put the primary key from one table to another.

In this case we can ask-

Does one book have many genres? Or, does one genre has many books?

Since its the latter, so we add a genre_id from genre table to books table. This also means that the genre name is stored in one place only and can be added/removed/updated without impacting other tables.

The same goes with binding_type i.e. whether the book is paperback, hardcover or eBook. Since there could be many books for any of the type, so we add a binding_id in the books table.

The same one-to-many relations (One language -> Many books, one publisher -> Many books) can be shown for the rest of the 5 tables. So, we add those foreign keys to the books table.

The tables are joined now, and they look like this —

The leaf tables are now joined together by foreign keys.

queries showing the elements in different leaf tables.

2.2 Many-to-Many Relationship

The writer column in book_title does not look right. Because writer and book name do not have one-to-many relations like all other tables. To check that we can ask this question-

Does a book have many writers, or a writer has many books?

The answer is both.

Let’s take an example, to make it a bit clearer. The following 2 queries give us the 4 books written by J.R.R. Tolkien (one-to-many, one writer has many books) and The Age of AI written by 3 people (one-to-many, one book has many writers).

Showing the problem of book-writer relation.

The problem now arises, how to make sense of this book-writer relation? This is an example of many-to-many relationship in database.

A many-to-many relationship occurs when multiple records in a table are related to multiple records in another table.

Clearly, we need to first create a separate writer table and give every individual writer a unique id. Let’s do that first.

-- Creating leaf table
CREATE TABLE author(
    id SERIAL,
    name VARCHAR(120),
    PRIMARY KEY(id)
);

To link these tables, we cannot simply use foreign keys. In a many-to-many relationship, a row in one table can have many matching rows in another table, and vice versa.

To represent a many-to-many relationship, a third table (called a junction or join table) is needed, whose primary key is composed of the foreign keys from both related tables.

-- Creating junction table
CREATE TABLE author_book_junction(
    book_id INTEGER REFERENCES book_title(id) ON DELETE CASCADE,
    author_id INTEGER REFERENCES author(id) ON DELETE CASCADE,
    PRIMARY KEY(author_id, book_id)
);

book_id is a foreign key to the book_title table, and author_id is a foreign key to the author table. author_id, book_id collectively act as a primary key for author_book_junction table.

author_book_junction table acts as a junction table to book_title and author table

This is how the tables look like.

Not so interesting though! To get the names, we have to write a JOIN to join the 3 tables together.

SELECT author_book_junction.book_id,
       book_title.name AS book_name, 
       author_book_junction.author_id, 
       author.name AS author_name
FROM 
       author_book_junction 
JOIN 
       book_title ON author_book_junction.book_id= book_title.id 
JOIN 
       author ON author_book_junction.author_id = author.id
ORDER BY 
       author_book_junction.book_id  
LIMIT 5;

This is how the result looks like now-

So far, we end up with these tables. They are all in second normal form. All the tables have their own primary keys or a set of primary keys (in case of author-book-junction table) and all the columns in each table are functionally dependent on their respective primary keys.

3. Third Normal Form:

Third normal form is the final stage of the most common normalization process. The rule for this is:

It fulfils the requirements of second normal form.
Has no transitive functional dependency.

Transitive functional dependency means that every attribute that is not the primary key must depend on the primary key and the primary key only.

Let’s take an example.

Column A determines column B.
Column B determines column C.
Therefore, column A determines C.
This means that column A determines column B which determines column C. This is a transitive functional dependency, and it should be removed. Column C should be in a separate table.

In our database, we have two such columns. The book_name (A) determines the author (B) column. It further determines the gender (C). So, we need to remove the gender column from book table. We create a different table and join it with the author table with a foreign key, gender_id.

The book database now looks like this after achieving the third normal form-

It surely looks pretty and (complex)! But it does ensure that-

the same data is not being stored in more than one place (“insert anomaly”)
updates are not being made to some data but not others (“update anomaly”)
data not being deleted when it is supposed to be, or from data being lost when it is not supposed to be (“delete anomaly”)

Step 3 — Performing some CRUD Operations

What’s the point of creating a database if we can’t insert, update or delete an existing row? So, let’s do that now.

3.1 Inserting a new book into the database

Inserting data with the following attributes —

Title of the Book — A Man’s Place, Author — Annie Ernaux, gender — Female, Publisher — Seven Stories Press, price — 181, year — 2023, transaction method — online, read/unread — yes, genre — Nonfiction, binding — Paperback, original language — French, rating — Excellent.

-- first need to enter a new publisher name in publisher table
INSERT INTO publisher(publisher_name ) VALUES ('Seven Stories Press');

-- inserting the new book in the book_title
-- instead of typing out the entire words, I'll add just the ids associated with them
INSERT INTO book_title(name, price,year, transaction_method_id , read_unread_id ,genre_id ,binding_id,language_id,rating_id) VALUES ('A Man''s Place',181, 2023, 3, 3, 3, 3, 3, 2);

-- using subquery to update publisher_id in the book_title
UPDATE book_title SET publisher_id = (SELECT id FROM  publisher WHERE publisher_name = 'Seven Stories Press') WHERE name = 'A Man''s Place';

-- a new author name is added in the author table
INSERT INTO author(name,gender_id) VALUES ('Annie Ernaux', 3);

-- getting the book_id and author_id for inserting into the junction table
SELECT id FROM book_title WHERE name = 'A Man''s Place';
SELECT id FROM author WHERE name = 'Annie Ernaux';

-- inserting the values in junction table
INSERT INTO author_book_junction(book_id, author_id) VALUES (194,141);

Yes, it looks complicated. But whenever I add a new book into the database, I don’t have to replicate it over and over again. I can simply add the name of the book and the ids of the respective columns in book_title table and add the author’s name (if it’s a new entry) into the author table.

3.2 Reading some data

-- showing the number of books by rating levels

SELECT COUNT(book_title.name) AS number_of_books, rating.rating 
FROM book_title JOIN rating
ON book_title.rating_id = rating.id
GROUP BY rating.rating
ORDER BY COUNT(book_title.name) DESC ;


-- showing 10 books by female authors

SELECT book_title.name AS book_name, author.name AS author_name
FROM author_book_junction
JOIN book_title ON author_book_junction.book_id= book_title.id
JOIN author ON author_book_junction.author_id = author.id
WHERE author.gender_id =2
ORDER BY author_book_junction.book_id
LIMIT 10;

3.3 Updating a column

-- Adding a comment in book_title column
UPDATE book_title SET comment = 'lent it to my friend in 2021, still have not received' WHERE name = 'Pride & Prejudice';

SELECT name, comment FROM book_title WHERE name = 'Pride & Prejudice';

3.4 Deleting a book

If I want to delete an entry (which I don’t like to do), I can do it from book_title table, but it’ll only delete the name of the book, not the author. So, I have to delete the author from author table as well.

-- Deleting a book called 'The Alchemist' by Paulo Coelho. 
-- I can use ids as well to delete them. 
DELETE FROM book_title WHERE  name = 'The Alchemist';
DELETE FROM author WHERE name = 'Paulo Coelho';

One interesting point to mention here, when I delete records from book_title and author table, the records from the linking table get deleted automatically. How cool is that!

Update to the Database:

After using the above database for 8 months to manage by books, I found it quite difficult to see all the data at once without writing multiple JOINs. So, I finally decided to “Denormalize” this database i.e., I’ll have a single table that will combine all the data from the look-up table so that I don’t have to JOIN these tables all the time to find relevant information.

The data now may look like the raw data but “denormalization” is quite different from having no normalization at all!

I took the following steps to de-normalize the data:

Creating a ‘demo_library’ table to hold all the joined data

-- creating a new joined table - where i collected all the data from look up tables

DROP TABLE IF EXISTS demo_library;

CREATE TABLE demo_library (
     book_id INT,
     book_name VARCHAR(150),
     author VARCHAR(150),
     price INT,
     year TEXT,
     comment TEXT,
     created_at DATE,
     updated_at DATE,
     publisher TEXT,
     transaction_type VARCHAR(150),
     read_unread VARCHAR(150),
     genre VARCHAR(150),
     binding_type VARCHAR(150),
     language VARCHAR(150),
     rating VARCHAR(150),
     author_book_id INT
);

2. Inserting the joined data into this table

-- Inserting the values into the new 'demo_library' table - 

INSERT INTO demo_library 
(
SELECT 
      b.id AS book_id,
      b.name AS book_name, 
      a.name AS author,
      price, 
      year, 
      comment, 
      created_at, 
      updated_at,
      p.publisher_name AS publisher,
      t.method AS transaction_type,
      r.read_unread AS read_unread,
      g.genre AS genre,
      bi.binding_type AS binding_type,
      l.language AS language,
      ra.rating AS rating,
      author_book_junction.author_id AS author_id 
FROM 
      public.book_title b
JOIN 
      publisher p ON b.publisher_id = p.id
JOIN 
      transaction_method t ON b.transaction_method_id = t.id
JOIN 
      read_unread r ON b.read_unread_id = r.id
JOIN 
      genre g ON b.genre_id = g.id
JOIN 
      binding_type bi ON b.binding_id = bi.id
JOIN 
      language l ON b.language_id = l.id
JOIN 
      rating ra ON b.rating_id = ra.id
JOIN 
      author_book_junction ON b.id = author_book_junction.book_id
JOIN 
      author a ON a.id = author_book_junction.author_id
ORDER BY 
     created_at, 
     b.name, 
     a.name
);

There are some books that I don’t have anymore so I need to delete them from the database to reflect the real world scenario.

3. Deleting some data

-- Removing 4 book entries
DELETE FROM 
    demo_library
WHERE 
    book_id IN (17, 88, 8, 182);

I ran into these problems with the previously normalized tables:

The primary key for the table ‘book_title’ was inconsistent. And I have normalized the table to reflect only one book per author (which means if 3 people authored a book, the database will show 3 rows of the same book with individual authors). This introduces duplicates.
I can’t directly find out the author for a book without joining the book_title table and author table with author_book_junction table, which is too much work!
I also wanted to have a consistency in the categorical data (i.e., having everything in Proper Case)

So, one way I can solve these problems is to have one unique book per row with one primary key. If there are multiple authors for that book, I’ll show the author names together separated by ‘,’. For example, if I have a book ‘XYZ’ which have authors John, Goerge and Lily, I’ll show them as ‘John, George, Lily’.

I can perform some complex operations on the demo_library table to fix all the above problems, but creating a new table is a simpler and cleaner approach.

This is what I’m going to do now.

4. Creating the final table that’ll show the denormalized data i.e., combined data

-- Creating the final table called 'library'
-- This is similar to the demo_library except now I have an autoincrement serial id

CREATE TABLE library (
    book_id SERIAL PRIMARY KEY, -- This is autoincrement Serial Id
    book_name VARCHAR(255),
    author VARCHAR(255),
    price NUMERIC,
    year TEXT,
    comment TEXT,
    created_at TIMESTAMP,
    updated_at TIMESTAMP,
    publisher VARCHAR(255),
    transaction_type VARCHAR(255),
    read_unread VARCHAR(255),
    genre VARCHAR(255),
    binding_type VARCHAR(255),
    language VARCHAR(255),
    rating VARCHAR(255)
);

5. Inserting the data

-- Here I'll aggregate multiple authors grouped by book using STRING_AGG() function
-- Change the cases into Proper case using INITCAP() function

INSERT INTO library (
    book_name, author, price, year, comment, created_at, updated_at, publisher, transaction_type, read_unread, genre, binding_type, language, rating
)
SELECT 
    INITCAP(book_name) AS book_name, --  INITCAP changes the text into Proper case
    INITCAP(STRING_AGG(author, ', ')) AS author, -- Shows the authors separated by ','
    price,
    year,
    comment,
    created_at,
    updated_at,
    INITCAP(publisher) AS publisher,
    INITCAP(transaction_type) AS transaction_type,
    INITCAP(read_unread) AS read_unread,
    INITCAP(genre) AS genre,
    INITCAP(binding_type) AS binding_type,
    INITCAP(language) AS language,
    INITCAP(rating) AS rating
FROM 
    public.demo_library
GROUP BY 
    book_name, price, year, comment, created_at, updated_at, publisher, transaction_type, read_unread, genre, binding_type, language, rating
ORDER BY 
    created_at, 
    book_name;

-- Removing the demo_library table since I don't need it anymore
DROP TABLE 
    demo_library;

6. A few final touches

-- Changed the column header 'read_unread' to read_status
ALTER TABLE library
RENAME COLUMN read_unread TO read_status;

-- Dropped the updated_at column
ALTER TABLE library
DROP COLUMN updated_at;

-- Changed publisher = Iias into IIAS
UPDATE library 
SET publisher = 'IIAS'
WHERE publisher = 'Iias';

-- Changed the genre from Non-Fiction to Fictions for these books
UPDATE library 
SET genre = 'Fiction'
WHERE book_id IN (137, 143,144);

-- updated the rating for these books;
UPDATE library 
SET rating = 'Excellent'
WHERE book_id IN (140, 144);

-- Inserted new entries
INSERT INTO library (
    book_name, author, price, year, comment, created_at, publisher, transaction_type, read_status, genre, binding_type, language, rating
) 
VALUES 
(
'Brief Answers To The Big Questions', 'Stephen Hawking', 317, 2024,'One of my favorite books on Cosmology', NOW(), 'John Murray', 'Online', 'Yes','Non-Fiction','Paperback', 'English', 'Excellent' 
),
(
'Predictably Irrational', 'Dan Ariely', 298, 2024,'' ,NOW(),'Harper','Online', 'No','Non-Fiction','Paperback', 'English', 'None' 
),
(
'The Hundred-Page Machine Learning Book', 'Andriy Burkov', 449, 2024,'','',NOW(),'Online', 'No','Non-Fiction','Paperback', 'English', 'None'
)

Conclusion:

So, the key take aways from this rather long article are as follows —

SQL is super fun!
While creating a database identify which facts need to be stored. Think about what kind of information you want to get from a database system.
When storing data in a database don’t replicate data, but reference data. Ideally data should be stored in one place in a database (but not always, as you can see in the denormalized case).
When grouping facts into tables, consider tables as ‘nouns’ and columns as ‘adjectives’.
Use numbers for primary and foreign keys. Its faster for searching/updating/indexing/deleting data.
Even though there are 6 major normal forms, achieving third normal form is often enough for a good database design.
Everything depends on how you’ll use the database. Sometimes introducing redundant data is more effective than continuously referencing them through foreign keys.

This is not the most perfect database design, still it’s a little better (and exciting) than a simple spreadsheet.

Hope this article was helpful making database normalization a little more comprehensive and fun too!

THANKS FOR READING!

References:

For further reading you can check out my project and the entire SQL script on GitHub.
PostgreSQL For Everybody Specialization on Coursera
Database Normalization Wikipedia
Database Normalization

About the Author:

Arpita Deb wears multiple hats: she’s a Data Analyst, an Electrical Engineer, a leisurely Painter and an avid Reader. She loves to experiment with various data analysis tools, document her data analytics experiences, achievements, and even her failures on platforms like Medium, GitHub, and LinkedIn. Follow her on LinkedIn and Medium to explore her exciting data analytics journey!