Using Kmeans Clustering on Covid Data

John Lehner
3 min readMay 5, 2023

--

Introduction

In 2020, COVID-19 took the world by storm. It spread and it spread fast. Everyone was affected. Due in part to this rapid spread, there was little initial knowledge of what measures to take in terms of policy to deter its further spread. “Does masking work?”, “How long until a vaccine?”, “Who is at risk?” are a few of the questions being asked. But what country’s policies/public opinion actually produced the best or worst results?

Data Collection

To accomplish this goal I found a dataset here that houses different statistical data in each country worldwide. The dataset was downloaded as a CSV file which I then read into a data frame in Python using the Pandas library. The initial data frame contained a lot of information that was extraneous to what I was interested in analyzing so I did some cleaning so that the information which I’d analyze were the number of deaths, active cases, and new cases in each country. The CSV was 14.4 KB and the file in which I performed data manipulation and analysis resulted in 38.4KB at the end.

Analysis

The first thing I needed to do was to determine which K value to use when clustering. To accomplish this goal I used the elbow method.

We can see from this graph that the line starts to level out significantly around 4 or 5. I chose to go with 4 as I felt the resulting clusters I found later were significant enough.

Next, I used the sklearn.cluster library to import Kmeans to fit the model and create the clusters. The result was four distinct clusters.

Now all that was needed was to print a sample of each cluster so I could try to gain some insight into their respective significance.

Finally, I wanted to create a visualization as well to better understand what I was seeing in these clusters. I imported truncatedSVD from Sklearn which I used to reduce the dimension of the data frame and create a scatterplot.

Conclusion

By using the cluster information and the scatterplot we can determine that the US had the highest deaths, active cases, and new cases. They were followed closely by India and Brazil and those in cluster 3 after them. Countries in cluster 0 had the least according to this but there are a few limitations that should be considered. First, this is only data that has been reported and can not be entirely accurate from country to country. Second, the size of each country needs to be taken into account. Countries with large populations are of course expected to have higher rates. For a truly good measure, a proportion should be taken in each category compared to that country's total population. This would better reflect the effect of COVID on each country.

Here is the Git

--

--