Analyze Public Data with Python

Yuli Vasiliev
Analytics Vidhya
4 min readApr 3, 2020

--

This article discusses how you can analyze official COVID-19 cases data with Python, employing the Pandas library. You’ll see how you can glean insights from actual datasets, discovering information that may not be so obvious at first glance. In particular, the example provided in the article illustrates how you can derive information about the rate of spread of the disease in different countries.

Preparing Your Working Environment

To follow along, you need to have the Pandas library installed in your Python environment. If you don’t have it yet, you can install it with the pip command:

pip install pandas

Then, you’ll need to pick up an actual dataset to work with. For the example provided in this article, I needed a dataset that includes information about total confirmed cases of COVID-19 by country and date. Such a dataset can be downloaded from https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases as a CSV file: time_series_covid19_confirmed_global_narrow.csv

Loading the Data and Preparing It for Analysis

Before reading the downloaded CSV file into a pandas dataframe, I manually removed the second line as unnecessary:

#adm1+name,#country+name,#geo+lat,#geo+lon,#date,#affected+infected+value+num

Then I read it into a Pandas dataframe:

>>> import pandas as pd
>>> df= pd.read_csv("/home/usr/dataset/time_series_covid19_confirmed_global_narrow.csv")

Let’s now take a closer look at the file structure. The simplest way to do it is with the head method of the dataframe object:

>>> df.head()
Province/State Country/Region Lat Long Date Value
0 NaN Afghanistan 33.0 65.0 2020–04–01 237
1 NaN Afghanistan 33.0 65.0 2020–03–31 174
2 NaN Afghanistan 33.0 65.0 2020–03–30 170
3 NaN Afghanistan 33.0 65.0 2020–03–29 120
4 NaN Afghanistan 33.0 65.0 2020–03–28 110

Since we are not going to perform a complex analysis that takes into account how close the affected countries are located to each other geographically, we can safely remove the geo latitude and geo longitude columns from the dataset. This can be done as follows:

>>> df.drop("Lat", axis=1, inplace=True)
>>> df.drop("Long", axis=1, inplace=True)

What we have now should look as follows:

>>> df.head()
Province/State Country/Region Date Value
0 NaN Afghanistan 2020–04–01 237
1 NaN Afghanistan 2020–03–31 174
2 NaN Afghanistan 2020–03–30 170
3 NaN Afghanistan 2020–03–29 120
4 NaN Afghanistan 2020–03–28 110

It would be also interesting to learn how many rows are in the dataset before we start to delete the unnecessary ones:

>>> df.count
[18176 rows x 4 columns]>

Condensing the Dataset

Looking through the rows in the dataset, you may notice that the information for some countries is detailed by region, for example, for China. But what you need is the consolidated data for the whole country. To complete this consolidation step, you might apply the groupby operation to the dataset as follows:

>>> df = df.groupby(['Country/Region','Date']).sum().reset_index()

This operation is supposed to reduce the number of rows in the dataset, eliminating, as well the Province/State column:

>>> df.count
...
[12780 rows x 3 columns]

Performing the Analysis

Suppose you need to determine the rate of spread of the disease in different countries, at the initial stage. Say, you want to know how many days it took for the disease to reach 1500 cases from the day when at least 100 cases have been reported.

To begin with, you need to filter out the countries, which have not been affected so much and where the toll of confirmed cases has not reached large numbers. This can be done as follows:

>>> df = df.groupby(['Country/Region'])
>>> df = df.filter(lambda x: x['Value'].mean() > 1000)

Then, you can retrieve only those rows that satisfy the specified conditions:

>>> df = df.loc[(df['Value'] > 100) & (df['Value'] < 1500)]

After these operations, the number of rows should be reduced significantly.

>>> df.count
… Country/Region Date Value
685 Austria 2020–03–08 104
686 Austria 2020–03–09 131
687 Austria 2020–03–10 182
688 Austria 2020–03–11 246
689 Austria 2020–03–12 302
… … … …
12261 United Kingdom 2020–03–11 459
12262 United Kingdom 2020–03–12 459
12263 United Kingdom 2020–03–13 802
12264 United Kingdom 2020–03–14 1144
12265 United Kingdom 2020–03–15 1145
[118 rows x 3 columns]

At this point, you might want to look at the entire dataset. This can be done with the following line of code:

>>> print(df.to_string())Country/Region Date Value
685 Austria 2020–03–08 104
686 Austria 2020–03–09 131
687 Austria 2020–03–10 182
688 Austria 2020–03–11 246
689 Austria 2020–03–12 302
690 Austria 2020–03–13 504
691 Austria 2020–03–14 655
692 Austria 2020–03–15 860
693 Austria 2020–03–16 1018
694 Austria 2020–03–17 1332
1180 Belgium 2020–03–06 109
1181 Belgium 2020–03–07 169

All that is left is to count the number of rows for each country.

>>> df.groupby(['Country/Region']).size()
>>> print(df.to_string())
Country/Region
Austria 10
Belgium 13
China 4
France 9
Germany 10
Iran 5
Italy 7
Korea, South 7
Netherlands 11
Spain 8
Switzerland 10
Turkey 4
US 9
United Kingdom 11

The above listing answers the question how many days it took for the disease in a certain country to reach approximately 1500 confirmed cases from the day when at least 100 cases have been reported.

Disclaimer: It’s important to realize that the rate of officially confirmed cases may deviate from the actual situation. The problem is, the dataset used in the above analysis ignores the latency.

--

--