SQL Tutorial

When Spreadsheets Aren’t Good Enough: A Lesson in Relational Databases

Database normalization, relational databases, and why you need them

Stephanie Lo
Towards Data Science
11 min readMay 12, 2023

--

Photo by Ivan Liu Hu on Unsplash

Have you ever waited minutes for an Excel spreadsheet to fully load and open? Does it continue to lag and freeze multiple items as you edit the spreadsheet and attempt to extract some insights? It’s not uncommon for this to be — or quickly become — the case for any organization that is constantly collecting data online. But one of the ways you can get results faster is by implementing a structured relational database.

So what exactly is a relational database? It’s a collection of tables. Think of each one like an Excel spreadsheet that stores different pieces of information of your business, but instead of these Excels living separately, imagine if you could connect each of them together.

Spreadsheets live in silos whereas relational databases have established relationships between tables

Relational databases like MySQL or Oracle are useful because they’re designed to organize a heavy load of data, which can help you manage the data more efficiently. If designed correctly, they can save you hours on tasks and analytics.

So if you’re working in a company and you expect the amount of data to grow — whether this be related to orders, customers, purchasing history or marketing engagement — then this article aims to explain how to start thinking about restructuring your current data practices to convert to a relational database.

What are the advantages of a relational database?

Here are some advantages of implementing a relational database and the challenges a business could overcome with them:

  • Efficient analysis: Relational databases allow information to be retrieved quicker to then be analyzed with SQL (Structured Query Language), to then run queries.
  • Centralized data management: Since relational databases often require a certain type or format of data to be input into each column of a table, it’s less likely that you’ll end up with duplicate or inconsistent data.
  • Scalability: If your business is experiencing high growth, this means that the database will expand, and a relational database can accommodate an increased volume of data.

You still may be wondering whether a relational database would be a suitable course of action for your company. Let’s consider an example to put this into context.

Example: E-commerce retailer experiencing high-growth

Photo by Iga Palacz on Unsplash

You work for a local e-commerce retailer that sells second-hand furniture, and every week you provide your team a report that summarizes sales data which helps the buying team know what they should be sourcing more of.

So far, since the inventory has been input into Excel, you carry out most of your analyses on spreadsheets. It’s likely you use pivot tables or vlookups to carry out week-on week, month-on month or even year-on year percentage changes of sales of certain products or brands. Recently, you’ve been noticing since there is more inventory being sold, it takes a long time to not only open your spreadsheet, but to perform simple analyses in them.

In a situation where a company is in a growth stage and therefore has more records of purchases, customers and inventory, Excel may not be efficient enough to carry out your usual tasks anymore. That’s where relational databases can help, by being able to store information at scale where pieces of data can be extracted, often by using SQL.

If you relate to these situations, then you might be wondering how you can shift your current data setup to a relational one. A good place to start thinking about structuring your database is to carry out a process called database normalization.

The basics of database normalization

Database normalization is the process of modifying existing data schema so that it complies to a series of progressive normal forms. More simply put, you are ensuring that data is organized in a structured way. It gives you a set of rules to be able to start categorizing your data and forming a layout that works for you.

By establishing structure in a database, you are able to help establish a couple of important things: data integrity and scalability. Data integrity ensures that data is entered correctly and accurately, and scalability ensures you have organized the data in a way that it is more computationally efficient when you start to run SQL queries.

But what do we mean when we say normal forms?

A quick breakdown of the normalization process

Every stage of the organization process is called “normal forms”.

Edward J Codd, who first introduced the concept here, originally described seven normal forms. However, to keep this article concise we will explain and walk through the first three, especially since databases are often considered as “normalized” if it meets the third normal form. A more comprehensive overview of the forms with a great tutorial can be found here.

First normal form (1NF)

  • Every value in each column of a table must be reduced to its most simple value, also known as atomic. An atomic value is one where there are no sets of values within a column.
  • There are no repeating columns or rows within the database.
  • Each table should have a primary key which can be defined as a non-null, unique value that identifies each row insertion.

Second normal form (2NF)

  • Conforms to first normal form rules.
  • Adjust columns so that each table only contains data relating to the primary key.
  • Foreign keys are used to establish relationships between tables.

Third normal form (3NF)

  • Conforms to both first and second normal form rules.
  • Necessary to shift or remove columns (attributes) that are transitively dependent, which means they rely on other columns that aren’t foreign or primary keys.

Now that we’ve gotten to grips with some definitions, let’s apply these normal form rules in a working example to let these concepts sink in.

Tutorial: Segmenting audiences in a Canadian-based retailer

Photo by charlesdeluvio on Unsplash

For this example, I used MySQL to carry out the queries, where each script should be adapted should you want to follow along with an alternate SQL server such as PostgreSQL or Oracle. If you are newer to MySQL I would suggest reading these articles here for installation instructions and here to get started. Just a caveat that the follow-along code assumes some foundational SQL knowledge.

The following data is not based on actual e-commerce/subscription data and should only be considered for demonstrative purposes only. Synthetic data has been created in order to explore the concepts discussed in the article (in a real life enterprise, you would be working with thousands of rows of data and not just twenty rows). To follow along with the tutorial please check out my Github repository here, which includes a complete transformation of the data including data creation, data transformation and queries.

Database Normalization in practice

Let’s say you’ve just been hired as an analyst at an e-commerce retailer. Your company is running an important sale and you’ve been assigned to carry out the following task:

Separate out our current customers and email subscribers so that we can direct different email offers to each segment in order to drive purchases during our sale period.

When you access the database, a few immediate things stand out. Most notably, email subscriptions and customer information are housed in separate tables without any established relationship, which poses a challenge in carrying out the assigned task. However, there is a silver lining in that you have relatively complete data, and with a few tweaks relationships can be established between the two to carry out the task. Here’s a breakdown of the current database:

The tables outlined above contain a multitude of variables relevant to the customer and email subscribers, including their geographic location, the emails they have previously received, as well as associated email metrics.

First Normal Form (1NF)

The first thing we can do is to take a look at the data and data types contained in each column after we have uploaded each file onto an SQL server. We can do this by performing the following queries:

It looks like we have a long way to go to conform to the first normal form! Recall that the requirements to achieve first normal form concerns proper row identification and grouping data correctly. Our current set up is in violation of all three rules of first normal form, which is mainly concentrated in the email_newsletter table:

  • Currently email_newsletter doesn’t have a primary key. This forces a database user to search for each subscriber using their email address, which is not considered good practice due to the possibility of users updating their email address and privacy concerns.
  • The data in email_newsletter is not in its most reduced form. Upon further analysis, column contents can be separated into 2 columns, such as subject_line and promo_code.
  • In email_newsletter, the repeating group is the email send and content. One user can receive many emails, so the best course is to separate this from the table.

To solve our first, we can add a unique primary key to email_newsletter in the first column of the table and run the following query, where email_sign_up_id is the column name and our data type is INTEGER :

With this query, we added a unique identifying primary key that automatically increments with each new data record.

For the second revision, we need to split out contents into their respective field of subject_line and promo_code, we see that as a general pattern each field is separated by a comma which we can use to split out each one:

Our result is that we have each column separated out into its most reduced form so we are able to see the categories of each column more clearly.

Lastly, we can remove repeating groups by organizing all of the email content-related data into a new table named email_distribution , don’t forget to add the data in from the original tables! (See GitHub here for full SQL including data loading):

With the query above, we have separated out the email contents into a separate table, to make sure that we don’t lose any information. We now need to connect this to email_newsletter table using a foreign key using the following queries:

This helps us conform to the first normal form, where as a final step we can drop any redundant columns and look at our layout before moving onto the second:

Second Normal Form (2NF)

To adhere to the rules of the second normal form, we need to achieve conformity of the first normal form ✓, establish relationships between tables and ensure non-key columns depend on the primary key. Here’s how we can approach transforming our set up into second normal form:

  • The customer table is currently disconnected from the other two tables, where there are missing customers who have given their email consent in the column email_opt_in who are currently not being reached.
  • There are unconnected columns to the primary key in email_newsletter, where email metrics such as email_open and click_through are not super relevant to their personal information, so we could consider separating these out.

We want to establish a relationship between the customer and email_distribution table in order to have our two segments of email prospects and current customers. However, if we drew a connection as it stands it may seem messy since one customer can receive many emails and disturb the primary key. Therefore, a logical intermediary step is to create a new table that can be placed between the two which we will name customer_newsletter_metrics to allow for these attributes to live in a table of their own:

This table helps to connect links between our entire database since we already stated the foreign keys in the creation of the table, with each column in the table being relevant and dependent on the primary key.

The set up we have now is almost complete but a little unbalanced — since we have an established metrics table for customers but not one for email prospects. By applying the same treatment to this group this could help mitigate future issues since if a prospect’s email changes it would require less revisions if this information lived in a separate table (again, please follow along with the full code on GitHub here).

After having renamed the email_newsletter to email_newsletter_metrics and created our email_prospect table here is our database after the 2NF transformation:

Third Normal Form (3NF)

The final step of our example is to follow third normal form rules which consist of completing first and second normal form rules ✓ and lastly, splitting out non-key columns that are dependent on one another.

Overall, we’ve done a good job so far to mitigate interdependency between non-key columns although, if we were to take a closer look at the customer table we can see city and postal_code are dependent on another since the postal code is related to where the customer lives, but is also related to where the city where the customer resides. This may present problems in the long run in situations where the customer moves, one column may be updated but maybe not the other.

With this being said we can create a new table named postal_code which separates out postal_code and city from the customer table and form connections between these two tables:

The last block of code above now completes our normalization process (yay!) where our final transformed database can be visualized below:

Our initial ask revisited: using a relational database

So we have our relational database ready through conforming to the first, second and third normal forms, why don’t we check out the original request that was given to us:

Separate out our current customers and email subscribers so that we can direct different email offers to each segment in order to drive purchases during our sale period.

Let’s form two simple query statements to grab the information of each segment:

The queries will result in two tables, and have an output like the one below:

Now you’re ready to send out your next sale email with confidence that there is no duplication or missed contacts between the tables. What’s more, we’ve structured the database in such a way that makes it easier for data analysis and ready and equipped for more customers.

Some closing thoughts

By recognizing pain points in your daily work routine such as a slow or complicated database, this article aims to offer implementation of a relational database as a solution. By using techniques like first to third normal form, you can break down your data into manageable chunks that are easier to work with and less prone to errors.

As a friendly final reminder, for a complete breakdown of the code used please visit my GitHub repository here. I hope this article may have inspired a newfound interest in relational databases ✨. Thanks for reading and following along!

All images unless otherwise noted are by the author.

References

  1. E.F.Codd, A relational model of data for large shared data banks, June 1970, Communications of the ACM.
  2. E.burns, A complete guide to database normalization in SQL, Feb 2021 Medium.
  3. C.Andreou, An introduction to SQL , Aug 2019 Medium.

--

--