COVID-19 Global Health Data Exploration

A closer look at the global infection rate.

Brittany Phelps
brittany-phelps-analytics
5 min readNov 19, 2023

--

Photo by Martin Sanchez on Unsplash

Quick Links:

Dataset:

COVID Deaths

COVID Vaccinations

SQL:

Queries

Tableau Queries

Visualization:

Tableau Dashboard

Tools Used: Excel, SQL, Tableau

About this Project

In this short project I am exploring international and domestic public health data related to the COVID-19 pandemic between January 2020 and October 2021. In particular I am looking at the infection rates and death counts of the population by both country and continent.

My goal with this dataset is to showcase data organization, data cleaning, and data manipulation using Excel and SQL.

I will be organizing the data for visualizations in Tableau.

Let’s begin!

Checking Data + Initial Setup

*All SQL queries can be found on my GitHub, here*

I opened both .xlsx files in Excel:

  • After importing, I filtered everything for nulls, mispelled wording, and inappropriate spaces.
  • Removed unnecessary columns with any data I won’t be analyzing, using SELECT, CTRL+ALT+RIGHT CLICK, DELETE.
  • Cut the “population” column and inserted the cut cells closer to the “date” column at the beginning of the dataset for organizational purposes. I would like to sort the data by location, date, and compare against population. So this is to keep the data organized simpler when having to join the two tables later on.

Data Cleaning

These are VERY large datasets, so SQL it is!

*Note: After running the query to find the largest amount of deaths, I noticed a few errors in the resulting table. “World” was set as a location when querying for country data — as well as North America, Africa, South America. I added “where country is not null” to all prior queries as well as any future ones.*

I imported both datasets in SQL Management Studio:

  • Taking a look at the entire table and grouping by location and date
  • Selecting which data I’ll be using for queries:

Data Analysis

Now that all of the data is cleaned and sorted by location and date, I start my analysis:

1. Total Cases vs. Total Deaths (By Country)

I added “like ‘%states%’ to this query out of pure curiosity since I’m based in the United States. It grouped the data by country which helped in this case, so I’ll negate that part of the query later on when exploring the global statistics.

2. Total Cases vs. Population (By Country)

Left: Percentage of the population that contracted covid Right: Highest number of infections in each country
For whatever reason my original query wasn’t giving me the correct results for the highest number of deaths per country. After looking at the the data, “total_deaths” was a nvarchar(255) datatype. I converted it to an integer using the cast function in order to calculate the max number of deaths.

3. By Continents Instead of Countries

I am going to run the exact same 4 queries as above with the following changes:

  • SELECT location → SELECT continent
  • GROUP BY location → GROUP BY continent
  • negate “where location like ‘%states%’ since I am now exploring data based on continents instead of countries.

4. Total Cases vs. Total Deaths (Global)

Negated “group by date” so I could look at total amount of cases globally.

Data Merging

Time to join both tables on continent, location, date, population, and new vaccinations.

1. Total Population vs. Amount of Vaccinations

  • dea = covid death table
  • vac = covid vaccinations table
I ordered by location AND date instead of just location because partitioning by location only resulted in the total sum of vaccinations by location instead of a continuing count to find the max amount of people vaccinated.

I can’t calculate the percentage of vaccinations because the vaccination count (RollingPeopleVaccinated) column is newly created and can’t be used in that same query. To resolve this, I’ll make both a temp table and a common table expression(CTE) to see which works best in this case.

*I’m making both types of tables for my own learning purposes*

Left: CTE Right: TEMP TABLE

I ran the query before negating “where dea.continent is not null” which isn’t really giving me the organization I’m going for. When I run it again after removing that portion, it gives me an error because that table already exists. Nothing a little DROP TABLE if exists function can’t fix!

2. Creating a View

Creating a view to store the percentage of vaccinations data for visualizations I’ll be making in Tableau.

Queries for Tableau

My Tableau queries can be found here.

I’m creating 4 tables for data visualization in Tableau.

1. Total Cases vs. Total Deaths (Globally)

2. Total Cases vs. Total Deaths (by Country)

Removing “World,” “European Union, and ”International” from the query since I didn’t include them in previous queries.

3. Highest Infection Rates (by Country)

4. Highest Infection Rates per Country (by Date)

Data Cleaning, Again

The tables had a LOT of nulls which means I’ll be using Excel again to clean the data in the Tableau queries before importing them. All 4 tables were saved separately in order for them to all be able to be tied together be a common relationship in the visualization program.

  • changed all NULLS to a numeric datatype by making them “0” with the Find and Replace feature.
  • the date was showing as “00:00.0” which is the incorrect format. I changed it to the shortdate date format.

Visualizations

I am posting a screenshot of my COVID-19 Public Health Dashboard below, but it can also be found here on Tableau Public.

You can view my Tableau Portfolio on Tableau Public as well.

Key Takeaways

  • Europe had the highest continental death count overall, with North America following in second place.
  • Even though Europe had a larger amount of deaths, the US maintained a higher average rate of infection.
  • Africa and Oceania had the lowest death counts.
  • China’s and India’s rate of infection increased slightly between March 2020 and October 2020, but maintained a continuous lower rate of infection for the rest of the year. This could be due to the introduction of the vaccine.
  • The US’ and Europe’s infection rate rapidly and continuously increased for the rest of 2021, even after the vaccine was distrubited.

--

--