THE DESIGN AND ANALYSIS OF COVID-19 DATA USING SQL PART-ONE

10 min readApr 21, 2020

In this phase, We will be writing on the design and creation of the COVID-19 database while the analysis of the COVID-19 data using SQL will be discussed in the next phase.

What is COVID-19?

COVID-19 is also known as coronavirus. Coronaviruses are diseases transmitted between animals and humans. They are caused by a new strain of coronavirus (SARS-CoV-2). It was first reported to the World Health Organisation(WHO) on the 31st of December, 2019 in Wuhan, China. You can visit this link to read more about the disease.

What is database design?

Database design is a group of rules that enable you to decide on a list of correlated data elements to be stored in your database and future implementation plans on this database. This process helps to generate accurate, efficient, fast and cost-effective database systems.

What is SQL?

SQL means Structured Query Language. This is a language used to access and manipulate databases.

In this tutorial, we will be designing a database for the COVID-19 data obtained from the European Centre for Disease Prevention and Control(ECDC). The design will be written in our own language. Also, we will be using SQL to create our database and tables.

To start with, it is often a good practice to give a description report of our data before use. This will give first-time users an insight into our existing data structure without accessing the data.

Data format: Comma-Separated Values(CSV)
Method of Data capture: Download
Data size: 10 columns, 11328 rows
Data types: Integer and Text
Data source: European Centre for Disease Prevention and Control(ECDC)
Data collection date: 17th of April 2020

*You can click the link on the Data Source above to download the data and save the file as download.csv.

Why Information, Not Data

Below is the diagram that represents each column in our existing COVID-19 data.

Fig 1: COVID-19 Data columns

Data are raw facts while information is organized data. Individually, data elements such as dateRep(17/04/2020), country or number of deaths may appear vague. However, when dateRep is correlated to other data elements such as the country and number of deaths then it makes a lot of sense because you can deduce information from it. Information is what end-users especially the non-technical stakeholders would like to see. To illustrate, the Director of Health in a United States-based health organization could request to see the information about the coronavirus death toll in the United States for the 17th of April, 2020. This is an example of the information every stakeholder would like to see which differs from database design for storage purposes. In this article, the collection of this variety of information using SQL is our end goal. However, in order to reach this goal, we must ensure our data is logically stored. At the moment, the existing COVID-19 table is large and it contains some redundant data. Hence the reason for database design.

Database design is a vital concept in the path to successful information retrieval. We will employ two database design techniques;

Entity Relationship Modeling
Normalization

Entity Relationship Modeling

This is a graphical representation of database architecture and its requirements analysis. This representation is called the Entity-Relationship Diagram(ERD) or the Entity-Relationship Model. Entities in ERD are what eventually become tables in a database. You might consider creating your models on ’Draw.io’.

Normalization

This is a database design technique that organizes tables in a manner that reduces redundancy and dependency of data. We are going to be looking at the first three forms of normalization because they are popular and of great interest to an average user.

Goals of Database Design

To the minimum, here are goals a Data analyst should achieve at the end of successful design of a database:

A database that permits easy and quick modification without data integrity issues.
A database that permits logical groupings in which each group describes a little part of the whole.
A database organized in such a way that it only allows modification of a unit at one point rather than several modifications.
A database with reduced duplicate data.

An analogy between a Building Architect and a Data Analyst role

Mr. Johnson employed the services of a Building Architect for the design of a five-bedroom duplex apartment. Mr. Johnson gives his rules and requirements such as the number of windows, number of toilets/bathrooms, number of sitting rooms, size of a standard room, positioning of the garage and many more. The architect takes note of all requirements and utilizes them in designing a befitting building plan for his client.

Mrs. Smith employed the services of a Data Analyst for the design of a COVID-19 database. She gives her requirements such as the number of confirmed cases of coronavirus per country, the number of confirmed cases of coronavirus per date, number of deaths per country, number of deaths per date and many more. The analyst takes note of all requirements and utilizes them in building a COVID-19 database design for his client.

The comparison between these two professions shows how related they are when it comes to design.

Corrections to note about the COVID-19 table and column naming

The column geoId can be mistaken for a primary key by some and should be renamed.
Table and column names should be short and descriptive so some of the field names would be changed accordingly.
In addition to the table and column naming styles, some people prefer camelCase or PascalCase to distinguish two-worded fields/tables. Whichever works for you, just ensure you practice a consistent naming convention. I prefer using an underscore in order to avoid case sensitivity issues and also have a common naming standard across all Database Management systems(DBMS).
The change in naming styles would affect columns such as ‘geoId’, ‘dateRep’, ‘countriesAndTerritories’, ‘countryterritoryCode’, and ‘popData2018’.
In the end, the affected columns are better renamed as two_digit_country_code, occurred_at, countries_territories, three_digit_country_code, population_2018 respectively.

Normalizing with ERD

Without any normalization, our data is stored in a single table. This table is large and consists of repeated data. Table covid_19 has ten attributes. It is represented below;

Now let’s apply each normal form rule with ERD to further decompose the table.

First Normal Form (1NF)

Rules:

Each cell needs to be atomic: its data values have no repeating groups.
Entries in each column are of the same type.
Each record needs to be unique in the table.

Methodology:

Identify repeated array of data in any cell. Refactor into more rows/columns as appropriate.
Identify columns with mixed data types. Choose the most appropriate data type to represent the data.
Identify and remove duplicate records.

Result:

Each cell is atomic. There are no repeating groups in data.
The data in each column is of the same type.
No duplicate records in data.
For our goal, It appears to be an overkill having ‘two_digit_country_code’ and ‘three_digit_country_code’ in the table. Let us stick to the former. Since we now have a single country code representation then we could change the ‘two_digit_country_code’ to ‘country_code’.
Also, let’s eliminate attributes: day, month and year. The ‘occurred_at’ attribute stores the summary of the day, month and year. Keeping all the columns is redundant, so we will keep the ‘occurred_at’ attribute and discard the rest.

Second Normal Form (2NF)

Rules:

The table should be in the first normal form.
The table should not have partial dependency; compose a single-column primary key.

Methodology:

Check for partial dependency: a non-key attribute in a table which depends on only a part of the primary key and not on the whole key.
Extract the partially dependent fields into their own tables and link the tables via foreign key references.

Result:

Looking at the content of the covid_19 data, there are two fields(‘occurred_at’ and the ‘country_code’) which are composite keys(whose combination guarantees uniqueness for every record). Each row represents the data for a specific country on a particular date. However, the table defaults 2NF rule. The population_count is only dependent on the ‘countries’ table and not on the ‘cases’ table. So, there is a need to create two entities from the covid_19 table. Let’s start with the ‘countries’ table, and then add the ‘name’ ‘population_count’, and ‘code’ fields.
The ‘country_territory’ is the country's ‘name’. We can assign ‘country_code’ as the primary key of the ‘countries’ table but we might desist from doing that if we consider scalability. For instance, What happens if the International Organisation for Standardization(ISO) changes any country’s code in the future and we already made the ‘country_code’ our primary key? It breaks our Database application, so we had better play smart and safe by generating a numeric primary key(‘id’) for the ‘countries’ table.
Also, we create our second table(‘cases’) with a foreign key attribute ‘country_id’ and the other fields — ‘occurred_at’, ‘confirmed_case’, and ‘death_toll’. The foreign key helps in enforcing referential integrity in the database. To illustrate if a user inputs a country id that does not exist in cases it flags a referential integrity error.

Third Normal Form(3NF)

Rules:

The table should be in the second normal form
There is no dependency among non-key attributes in the form of transitive dependency.

Methodology:

At this point, you might be tempted to split the cases entity more. Considering the fact that each country has death records with respect to date but if we create a separate entity for ‘deaths’ then we are introducing redundancy. The death toll is a result of the coronavirus infection and its a vital attribute people would want to see alongside the number of confirmed cases in order to gain an insight about the threat this disease can pose to human lives. Supposing the world has a death toll as a result of a hybrid of diseases such as diabetes, cancer, hypertension, and coronavirus then, in that case, we can have a separate entity for ‘deaths’ to include attributes such as ‘death_type’ and ‘death_toll’. Since we have ‘countries’ table referenced as foreign keys in the ‘cases’ table we can easily obtain ‘countries' data associated with cases and the death_toll in that manner. So we don’t need to make any change in our entities in this phase.

Result:

Check for transitive dependency: changing a non-key column might cause any of the other non-key columns to change. However, this condition does not exist.

Now we have reached a point that our entities cannot further be decomposed to attain higher forms of normalization. There are higher levels of normalization aside these three but those levels are usually needed in more complex databases.

Next, we define relationships between entities, associate keys(primary and foreign keys) as we already know them, define the attribute’s data type.

Identifying relationships

Countries → Cases; 1 country can record several confirmed coronavirus cases. 1:N
Countries ← Cases; 1 confirmed coronavirus case belongs to 1 country at a time. 1:1

In all, there are two entities, 2 relationships, the cardinality consists of one optional to many optional relationships. Below is the complete ERD;

Next, let us create our covid19 database tables using the data model above. We will be using the MYSQL database management system. Prior to that, you can go through the prerequisites to create your database.

Prerequisites

Download and install MySQL; Windows/macOS
Download and install MYSQL Workbench for database administration and development.
Go to your terminal or command prompt. Create your MYSQL database(covid19), you can follow this link to learn how to create and manage MySQL databases, tables, and users.

Now that we have created our database as represented below;

Figure 5: COVID19 Database in MYSQLWORKBENCH

As you can see that there are no tables in the database at the moment. We will be writing SQL to create all the required tables and values from our single covid19 table obtained from ECDC.

Import the table via Table Data Import Wizard as ‘download’ on MYSQLWorkbench. We need to include an id with auto-incremented values so we can track the total number of items in the table and uniquely identify each record. Using the query below;

ALTER TABLE covid19.download 
      ADD id INT NOT NULL AUTO_INCREMENT 
      PRIMARY KEY FIRST;

Here is the table generated from the query;

Figure 6: Autogenerated Primary Key Downloaded Data

Next, we create our two tables(countries and cases) from the download table using SQL. Let’s start with our independent table which is ‘countries’. You can run this query below on your query panel to create the ‘countries’ table.

USE covid19;
CREATE TABLE countries
       AS(SELECT DISTINCT countriesAndTerritories name, geoId code,   
          popData2018 population_count FROM covid19.download);
ALTER TABLE covid19.countries
       ADD id INT NOT NULL AUTO_INCREMENT 
       PRIMARY KEY FIRST

Next, run the query below to create the ‘cases’ table

USE covid19;
CREATE TABLE cases
      AS(SELECT cases confirmed_case, deaths death_toll, dateRep   
         occurred_at, DENSE_RANK() OVER (ORDER BY    
         countriesAndTerritories)country_id
      FROM covid19.download);
ALTER TABLE covid19.cases
      ADD id INT NOT NULL AUTO_INCREMENT
      PRIMARY KEY FIRST;

Now that we have extracted the two tables from our download table, we may discard the download table. At this point, you should have the two tables in your database as represented below;

Good job on following this article to this point. In the next phase of this article, we will be performing an analysis on the COVID19 database using SQL. Please feel free to drop your feedback and comments.

THE DESIGN AND ANALYSIS OF COVID-19 DATA USING SQL PART-ONE

Written by Akande Imisioluwa