Exploratory Data Analysis on COVID-19

M. Husni Nur Fadillah
5 min readJan 22, 2022
Photo by Lukas Blazek on Unsplash

Hi everyone, I hope you all are not bored in reading the articles that I published previously 😊. This time I will again discuss data science, more precisely one of the steps that data scientists or data analysts must go through, namely Exploratory Data Analysis (EDA). I’ll explain it with a case study, namely COVID-19 in and the real data I got from kaggle.

Introduction

In this article I’ll explain how to do exploratory data analysis in cases of COVID-19. More precisely, COVID-19 in every province of Indonesia.

Before practice, I’ll explain a little about EDA itself so that you know what we are practicing in this article.

EDA is an essential step in any research analysis. The primary aim with exploratory analysis is to examine the data for distribution, outliers dan anomalies to direct specific testing of your hypothesis. It also provides tools for hypothesis generation by visualizing and understanding the data usually through graphical representation¹.

There are many EDA techniques, each technique used depending on the problem at hand. For example, if we have continuous univariate data, we can use visualization techniques such as line plots and histograms, if categorical we can use descriptive statistics, and so on.

Of course, this article will not implement all of these EDA techniques, maybe I’ll explain it another time.

Practice

Alright, I think that’s enough theory, it’s time to practice.

For the code you can check my github’s repository. The data used here is the data of the province of covid-19 in Indonesia, you can get it at this link. So this data contains variables province_name, island, iso_code, capital_city, population, population per km, confirmed, deceased, released, longitude, and latitude.

the first five rows of data

First, we need to know information about the data we have, such as what variables this data contains, what type of data the variable is and whether there is a missing value in the data. This information is important before conducting EDA so that there is no misinformation from the results of the EDA technique that we have implemented.

data information

object data type can be interpreted as a string. There are several cases that we can explore in this data, such as patients who have been confirmed to have COVID-19 (confirmed), patients who have recovered (released) and patients who have died (deceased). It is also important to get information about descriptive statistics such as mean, median, quantiles and so on in every numerical variable that we have.

descriptive statistics

Next i will check if there are outliers in confirmed variable, to do that i use box plot.

It can be seen that there is one outlier in the confirmed variable, namely in the province of Jakarta, meaning that Jakarta has the highest number of people confirmed to have COVID-19, which is 598 people.

This is much more than other provinces. This is quite reasonable because Jakarta is the capital city of Indonesia, where many tourists enter the country. This could be investigated further, but let’s explore something else.

One of the goals of EDA is to test the hypothesis that we have, one method that can be used to solve this problem is the spearman method. With this method we measure the strength and direction of the relationship between two variables.

spearman correlation

The relationship between the confirmed variable and the population is quite large compared to other variables, which is 0.62. In addition, the direction of the relationship between the two variables is positive, which means that the larger the population of a province, the greater the number of confirmed cases of COVID-19.

I’ll to find out about the handling of COVID-19 in DKI Jakarta by visualizing the proportion of COVID-19 cases, which are confirmed, released and deceased and then compare it to the Jawa Barat province, which is the second province with the most COVID-19 confirmed case.

Percentage of each case in DKI Jakarta
Percentage of each case in Jawa Barat

There were 8.53% who deceased of people affected by COVID-19 in Jakarta and 5.18% who recovered. Meanwhile in Jawa Barat province there were 14.3% who deceased and 5.1% who recovered. From the visualization above, a hypothesis can be made which states that DKI Jakarta province is better than West Java in handling Covid-19 cases. This hypothesis can be tested by several EDA methods, for the test itself I did not do it in this article.

Conclusions

From some of the EDA techniques we did earlier, we got some insights, first, Jakarta is the province with the most covid cases with 598 cases, far from West Java province which is the second province with the most Covid cases with a difference of 500 cases. Second, it can be seen the correlation between the number of confirmed positive residents, and thirdly related to the handling of covid cases in DKI Jakarta and Jawa Barat, where we can make a hypothesis stating that DKI Jakarta Province is better at handling COVID-19 cases than Jawa Barat, which can later be tested.

References

[1] Chong Ho, Yu. 2010. “Exploratory data analysis in the context of data mining and resampling”. Internatioanal Journal of Pychhological Research. 3(1), 9–22.

--

--