Clustering Based on COVID Cases Around the World

Alejandro Tarazona
INST414: Data Science Techniques
4 min readMay 12, 2022

While searching the web, I ran into a fantastic Kaggle database with many information about COVID cases worldwide. Something so relevant to society would be fun to play around with. If you want to see the dataset, you can find it here. This reported stored information on the columns shown below.

For this data, I needed to import pandas, numpy, sklearn, and matplotlib. These are the four critical libraries I had to use to complete this assignment. Then I used google. colab import files function to be able to upload the file that I downloaded from Kaggle.

After looking at most of the data, I noticed that the two data sets that would be more adaptive and presentable would be the cases per 100 cases data, which would include “Deaths / 100 Cases” and “Recovered / 100 Cases”. Bellow will be a quick showing of what the data looks like.

I wanted to find the number of clusters I needed to use in my K-means classification. I wanted to know how many clustered my data would have. To do this, I needed to run the Sklearn Kmeans to figure out the K-mean of my data. I built a line graph that would allow me to know the cluster amount. Inertia_ gives you the sum of squared errors, which I then used to see how well your classification fits my data set. This can be seen by the elbow method, where the data changes direction drastically.

With this information, I was able to go ahead and divide the data into two categories, which will be shown below.

I went on to find some more information on the topic of COVID. You will see the number of recoveries based on active cases, as shown below. This is interesting to see some of the outliers within the data set. The rate between active and recovered is mostly less than one. Again all this data could understand how some counties could maintain their low rates.

Problems

The only major problem that I ran into with this assignment had to be finding the correct data to use. There was such good data in this database. While most of it is applicable, you have a large chunk of data that could be unused. As you can see from the image below, there was some great information about the number of deaths and the confirmed death. It was interesting to know that they didn’t match my expectations. Bellow will be an example of that.

Conclusion

It wasn’t surprising to see all the cases around the world. I think it is safe to say that everyone got hit differently. The most important information that I noticed was that most of the information I pulled generated only two categories. So their K-Mean always ended up being two. I want to do more in-depth research on why it was like that. Find similarities within the counties. With the end goal of finding out what did those countries do to keep the COVID rate low.

Link to Code: https://github.com/Atarazona11/INST-414/blob/main/Assignment_4_COVID.ipynb

--

--