Drawing Insights from a Covid-19 Data Set
A step by step approach on how Exploratory Data Analysis was used to clean, visualize and draw insight from a Covid-19 data set.
Covid-19 pandemic has been a global threat since its outbreak in Wuhan, China, in 2019. Surveys, interviews have been carried out to get data from certain Cities in different Countries and Continents in other to draw insights.
To a data scientist or anyone who is enthusiastic about data science, data are characteristics or information either quantitative(numerical) or qualitative(categorical) that are collected from persons, organisations etcetera in other to draw insights, improve sales, predict future occurencies and so on.
Before drawing insights from a data set, always remember that there is no one size fits it all. Data exploration which includes sorting(rearranging data in ascending or descending order, data filtration(creating a subset of available data), data processing(aggregation and statistical operations), data cleaning and preparation takes up to 75% of the total time for the project.
The steps I took to analyse the data set are thus:
- Definitions
- Data Cleaning
- Visualization
- Drawing Insights
Introduction
Definitions of Columns:
The data set was gotten from She Code Africa and contains the cases of Covid-19 . It includes the Country, Total_Cases- the total number of cases a particular Country, New_Cases-the recent cases as at time of data collation, Total_Deaths-number of deaths so far, New_Deaths- recent cases of deaths recorded, Total_Recovered-those who have recovered from the disease, Active_Cases-cases on ground in action, Serious/Critical -those cases that are so serious, Tot_Cases-total cases,
Deaths-number of deaths, Total_Tests-the number of tests conducted in all, Tests-tests taken, Continents-the continents of the countries.
I started by importing modules and reading the file.
I viewed the head and tail of the data set to get a better understanding.
The describe() function is used to provide a statistical summary of all the quantitative variables. I checked the shape and information of the data. All of this is to be able to understand what I will be working with.
Checking for null values. I found that some values were missing and checked the sum of the missing values in each column.
I replaced the white spaces with nan and saved it in a new variable called new_ data to help me handle white spaces.
Data cleaning
Data cleaning is simply finding removing inaccurate records from a table or database. One of the commonest operations one might do while cleaning the data or doing exploratory data analysis in doing data science is manipulating/fixing the column names or row names, removing the rows or columns.
A big challenge in data cleaning is the identification and treatment of outliers, which are observations that are different from other data points.
I always like to check the information for as many times. This helps to know the object type as well.
Yes !! the null values have been replaced with zeros.
Data Visualization
It is easier for the human brain to process information that is in form of a picture than numbers. Imagine presenting a summary to individuals who do not have the expertise in quickly processing numbers — difficult right? Every Data Scientist should be able to present their data in form of graphs or plots like Bar graph, Histograms, Scatter plots etc.
Here I converted the data to float for visualizations and used bar plots to represent the data.
Total Cases
Total Deaths
Total recovered
Active cases
Serious/Critical Cases
Deaths recorded
Total Tests Taken
Drawing Insights
Here I used the barplot to represent the countries with the Highest number of Cases, Those who have taken the highest number of tests and have higher recorded number of cases.
From the observation, The United States has the highest number of recorded cases.
Recovered Cases
The United state also has had the highest number of recovered cases.
Active Cases
There are most numbers of Active Cases are in the United States.
Serious/ Critical Cases
The United States had the highest number of Serious or Critical Cases.
Deaths
So many deaths were recorded in San Marino followed by Belgium with the Netherlands having the lowest number of Deaths recorded.
Summary of our observations
I observed that as of the time this data was collected, Africa has had more number of Tests, Critical Cases, Active Cases , New Deaths and New Cases followed by Europe and Asia, South America, with Australia having the least records.
Here is the link to the source code and data set.
Some tools for visualization in Data Science aside python tools include:
- Tableau : Also called the master of visualization. It is fast and easy to use because there is no prior knowledge of programming required.
- Power BI : It was developed by Microsoft and it provides business intelligence and Analytics need.
- Chartblocks : is an online tool and it also does not require any coding. It prepares visualizations for live feed, spreadsheets, databases and so on.
- Google charts : The charts created are interactive and some also have the option to zoom, to check data. You should try using it.
- QlikView : it can pull in data with help of associative dashboards, It keeps the data in RAM of server for users and has features that speeds up development.
Thank you for reading!