Creating a Database: Converting a Spreadsheet to a Relational Database (Part 1)

Yohan Jeong
The Startup
Published in
7 min readAug 28, 2020

Part 1: Creating an Entity Relational Diagram (ERD)

Photo by Caspar Camille Rubin on Unsplash

A Relational Database Management System (RDMS) is a program that allows us to create, update, and manage a relational database. Structured Query Language (SQL) is a programming language used to communicate with data stored in the RDMS. The SQL skill for using a RDMA is required for many data-related positions these days. In the social media forums like Quora or Reddit, there are many people who search for a public database for practicing their SQL querying skills. However, although there are many public data sets in a single spreadsheet, there are not many public databases online. Even if you found a data set for the topic you have interest in, the format of the data is usually just one spreadsheet, not a database for most cases. Therefore, it will be very useful to know how to convert a data set in one spreadsheet to a database with multiple tables fitting a relational database format. Knowing the process of this conversion can give us many chances to practice SQL querying skills with a variety of databases.

This is the first article in a two part series. The goal of this series is to show how to create a relational database for SQL. The whole process is to convert a data in one spreadsheet to a relational database for SQL. In this first article I create an Entity Relational Diagram (ERD) which is a graphical representation showing the relationships between entities.

The Data Set

Let’s find a data set for this practice. There are many public data on the Kaggle dataset. Among a bunch of data sets, I selected a data set named Sample Sales Data. The following pictures show the original format of the data which is contained in one spreadsheet.

Sample Sales Data from the Kaggle Dataset

The data has 25 columns and 2824 rows including headers. The list of the headers is as follows:

  • ORDERNUMBER: the identification number for each order
  • QUANTITYORDERED: the quantity ordered
  • PRICEEACH: the actual price paid for the transaction in terms of percentage of the MSRP (variable across transactions)
  • ORDERLINENUMBER: the number of the order line
  • SALES: the amount of sales
  • ORDERDATE: the order date
  • STATUS: the shipping status (Shipped, Resolved, Cancelled, On Hold, Disputed, and In Progress)
  • QTR_ID: the quarter of the order date
  • MONTH_ID: the month of the order date
  • YEAR_ID: the year of the order date
  • PRODUCTLINE: the category of products
  • MSRP: the manufacture’s suggested retail price (constant across transactions)
  • PRODUCTCODE: the identification code for each product
  • CUSTOMERNAME: the names of customers
  • PHONE: the phone numbers of customers
  • ADDRESSLINE1: addressline 1 for customers
  • ADDRESSLINE2: address line 2 for customers
  • CITY: city names for customers
  • STATE: state names for customers (only for customers located in the US)
  • POSTALCODE: postal codes for customers
  • COUNTRY: countries for customers
  • TERRITORY: the regional names of each country (NA, EMEA, Japan, and APAC)
  • CONTACTLASTNAME and CONTACTFIRSTNAME: the last and first names of customers
  • DEALSIZE: the deal sizes of orders

Creating an Entity Relationship Diagram (ERD)

The information contained in the data can be basically divided into three entities: Customers, Products, and Orders.

Customers

Based on the columns in the data, the attributes to be included in the entity for Customers are CUSTOMERNAME, CONTACTLASTNAME, CONTACTFIRSTNAME, PHONE, ADDRESSLINE1, ADDRESSLINE2, POSTALCODE, CITY, STATE, COUNTRY, and TERRITORY. CustomerID is added into the entity as the primary key. And for convenience sake, the cases of the attributes are changed as follows:

The problem of keeping City, State, Country, and Territory in the entity for Customers is that there will be many duplicate rows in the table for Customers. Therefore, entities for City, Country, and Territory are separately created, and then the links between them and Customers are made using foreign keys. The reason why the entity for State is not created is that there are many null values in State so that states do not determine countries and territories in this data.

In the ERD, there are three types of relationships between entities: One-to-One, One-to-Many(or Many-to-One), and Many-to-Many. The relationships between Territories and Countries, Countries and Cities, and Cities and Customers are One-to-Many relationships. For example, each customer lives in one city in this data while one city has many customers. Therefore, using the Crow’s Foot Notation, the ERD between Customers, Cities, Countries, and Territories is as follows:

The Crow’s Foot Notations for Customers, Cities, Countries, and Territories

Products

The attributes to be included in the entity for Products are PRODUCTCODE, PRODUCTLINE, MSRP, and ORDERNUMBER. Among these attributes, PRODUCTCODE can be a candidate for the primary key in the entity since it is unique and does not have null values. However, in order to keep the format of the primary keys consistent across entities, ProductID is added into the entity and is set to be the primary key. The letter cases are changed as follows:

As we saw in the case for the entity for Customers, including Productline in this entity generates duplicate rows as well. Therefore, the entity for ProductLines is created separately, and is linked to the entity for Products using a foreign key. Since each product is classified into one product line while each product line has many products, the relationship between ProductLines and Products is One-to-Many.

The Crows’ Notation for ProductLines and Products

This is not the final version for Products since OrderNumber will be included in the entity for Orders as well. Including OrderNumber in the entity for Orders will make some changes in the entity for Products. This will be covered after the entity for Orders is created.

Orders

The entity for Orders should include ORDERNUMBER, ORDERDATE, PRODUCTCODE, STATUS, QUARTER_ID, MONTH_ID, YEAR_ID,and CUSTOMERNAME as its attributes. OrderID is added into this entity as the primary key. The names and cases of the attributes are changed as follows for convenience sake:

In order to remove duplicate rows generated by including Status and CustomerName in this entity, the entities for Status and CustomerNames are created separately and are linked to the entity for Orders using foreign keys. The entity for CustomerNames (which is equivalent to the Customers) was already created previously. The relationships of Orders with Status and Customers are Many-to-One, respectively.

The Crow’s Foot Notations for Orders, Customers, and Status

Orders and Products

Each order contains a list of products in it. This means that Orders and Products are related to each other. What is the relationship between these two entries? Each product can be ordered by multiple orders and each order can contain multiple products. Therefore, the relationship is many-to-many.

The Crow’s Foot Notation for Orders and Products 1

The way to create tables which are in the many-to-many relationship is to create a junction table between the two original tables and then to make one-to-many relationships between the original tables and the junction table. The junction table is the table for the combination of Orders and Products, and then one-to-many relationships are made between Orders and the junction, and between Products and the junction. The primary key for this junction entity should be the composite of OrderID and ProductID.

The remaining columns which are not included in any entity so far are QUANTITYORDERED, PRICEEACH, ORDERLINENUMBER, SALES, and DEALSIZE. These columns cannot be determined by Orders or Products by themselves. For example, the quantity of a product in an order cannot be identified by an order or a product alone. Only the combination of an order and a product can identify it. Therefore, those columns are included in the junction entity as its attributes. The entities and their relationships for Orders and Products are as follows:

The Crow’s Foot Notations for Orders and Products 2

Due to duplicate rows, the entity for DealSizes is separately created and linked to the entity for Orders_Products using foreign key for SizeID.

The Entire ERD

Now, we completed creating all the entities for this data and making relationships between them. The entire ERD is as follows:

The Final Version of the ERD for Sample Sales Data

The usefulness of the ERD is that it makes us to easily understand the relationships between the columns (or the entities) in the data set and to see the entire structure of the data set at a glance.

What’s Next?

This article covered how to create an Entity Relationship Diagram using a data set. The data for this practice is Sample Sales Data obtained from the Kaggle dataset. It is briefly covered how to divide the original data set into entities and make relationships between the entities.

In the next article, the practice of converting a data in a single spread sheet to a relational database by using MySQL will be covered.

--

--

Yohan Jeong
The Startup

Business Analyst at Samsung Electronics America. PhD in Economics from University of California, Davis.