Working with COVID-19 data from Johns Hopkins University

Thawfeek Varusai
4 min readJun 10, 2020

--

The most popular source of COVID-19 confirmed and dead report are from Johns Hopkins University. The university provides a convenient way to computationally access the data via well documented APIs. Moreover, there is an impressive data coverage because of integrating several external resources. In this article, we shall see how to programmatically use this data.

Types of data

We first need to understand the data available in this resource. Table 1 shows the different types of information provided for various countries. There is a separate category for data originating in the US. The second ‘global’ category contains all countries including the US. For some countries like China and Australia there is more fine grained information on each province.

Table 1

Path of data

Data from JHU is hosted at https://github.com/CSSEGISandData/COVID-19 that contains organized folders. We shall use the ‘csse_covid_19_data’ folder to access the global COVID-19 data. To use the API, we set the path in Python using the following code line.

path = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/'

Note the master folder that has been added to the path as a requirement for the API access.

Libraries required

Data is provided in comma separated values (csv) format and a convenient way to read csv files is by using the pandas library in Python. Pandas can read data in csv format into a handy pandas.dataframe format.

import pandas as pd

Access data

As mentioned above there, are several types of data available in this portal. You might want to access the confirmed cases just in the US or death counts all over the world. To write a generic code that you can tailor for your needs, I’ve split the API endpoint into two parts. The first part is the path, which points to the master folder and the second part is the specific data that can be appended to the path.

For instance, for confirmed cases all over the world we can have a variable cases:

cases = 'csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'

Or, for death counts all over the world we can have a variable deaths:

Deaths = 'csse_covid_19_time_series/time_series_covid19_deaths_global.csv'

This can then be appended to the path while reading the csv file.

Cases_DF = pd.read_csv(path+cases,index_col=1)

This line reads the csv file available at ‘path+cases’ using the read_csv module within pandas and sets the second column as row names. Figure 1 shows Cases_DF. To read deaths, we just need to use ‘paths+deaths’.

Figure 1

Explore data

We have now imported the data into Python. The next step is to explore the nature of this data. We see that the different countries are along the rows and daily reports are the columns. We can also notice that some countries such as ‘China’ and ‘United Kingdom’ have province information whereas others like Norway and Switzerland don’t. Depending upon our requirement we will have to process this information.

Process data

Let’s consider a scenario where we’re interested in the global COVID-19 confirmed cases. For this, we would want to slightly rearrange the data. To go for a country-focussed data, let’s make the countries as the columns.

Cases_DF = Cases_DF.transpose()

Next, let’s create a dictionary in python with country names as keys and corresponding time series confirmed cases as values. If there is information on provinces, they can be considered separately.

Country_Dict={}

for i in Cases_DF:

j=list(Cases_DF [i].iloc[0:1])

if len(j)>1: #when there is province information

for k in range(len(j)):

m=str(list(Cases_DF[i].iloc[0])[k])

n=str(i)+':'+m

p=np.asanyarray(Cases_DF [i].iloc[3:]).T[k]

Country_Dict [n]=p

else: #no province information

n=i

q= Cases_DF [i].keys()[3:]

r= Cases_DF [i][q]

Country_Dict [n]=r

This piece of code will have created a dictionary called Country_Dict that will have our desired format. Province information is named after the country with a colon. For instance, ‘China:Hubei’. We can now plot the data for any country or province of interest using the matplotlib library (Figure 2).

import matplotlib.pyplot as plt

Y = list(Country_Dict ['China:Hubei'])

plt.plot(Y)

plt.xlabel('Time (days)')

plt.ylabel('#Infected')

plt.title('COVID-19 cases (China:Hubei)')

Figure 2

In this article, we learnt to import and process COVID-19 data from JHU. This highly dynamic and powerful dataset can be used to perform some interesting analyses. More details on the code can be found here: https://github.com/vthawfeek/SARS_COV2_JHCSSE/blob/master/SARS_COV2_DataAnalysis.ipynb. Hope this read was useful to you to start your own investigation!

Like this post?

If you enjoyed this read, you might also be interested in similar topics at my website: https://rokpayprsizors.wordpress.com/

--

--

Thawfeek Varusai

I’m a life science enthusiast with applied mathematics skills. I’ve a PhD in Systems Biology and currently work as a data analyst in a bioinformatics company.