DIY Data analysis of COVID-19

Ajaykumaar S
Analytics Vidhya
Published in
5 min readApr 22, 2020

--

Source: Google Images

The COVID-19 pandemic has brought the world to a dystopian situation.

The Government and medical specialists all around the world and are taking drastic measures to contain the virus spread and every news channel updates the day-to-day status of the spread in the country using graphs and plots like the ones above. Ever thought of plotting such graphs or analyzing/visualizing the statistics of the spread at your home?

In this post, I’ll show you how to plot graphs using the COVID-19 dataset with step-by-step instructions and explanations. We’ll be using the dataset from Kaggle which can be downloaded from the link below and we’ll use Google Colab to run our code. So open Google Colab and create a new Python3 notebook.

To start with, we’ll install the necessary libraries for our analysis.

!pip install fastai2 -q
from fastai2.tabular.all import *

This installs the fastai2 library along with pandas and matplotlib and the next line imports the tabular module of Fastai2 which we’ll be using.

Now we need to upload our downloaded dataset to the instance’s memory. We can do this by clicking Files -> Upload from the left menu bar or by using the below code which will create a file upload GUI. For now, we’ll use only the ‘complete.csv’ file.

from google.colab import files
files.upload()

Once uploaded our data can be viewed using Pandas’ read_csv function and df.head() displays the first 5 rows of the data frame.

df=pd.read_csv('complete.csv')
df.head()

We’ll now import seaborn, a library exclusively used for data visualisation. We can refer to seaborn as ‘sns’ whenever we access the library.

import seaborn  as  sns

Initially, we’ll make a scatter plot between the States and the total confirmed cases.

sns.relplot(x="Total Confirmed cases",y="Name of State / UT",height=10,data=df).set(title="State-wise spread scatter plot")

The important and pretty difficult step is transforming the data to suit our purpose, i.e. pre-processing. As in our dataset, positive cases detected on a particular date at two different places are recorded as two different rows. We have to create a data frame in such a way that each row corresponds to a single date. We’ll do this by grouping the contents of the data frame by date and sum all the detected positive cases on that date. Using the same procedure we’ll get the deaths and cured data and create a new data frame.

grp=df.groupby(by="Date")
dates=df['Date'].unique()
date_list=[]
for i in dates:
date_list.append(i)
ddf=pd.DataFrame(date_list, columns=['Date'])

Now that we’ve created a data frame containing the dates only, we can add the required data to this data frame.

n=1
count_list=[]
labels=['Death','Total Confirmed cases','Cured/Discharged/Migrated']
for l in labels:
for i in dates:
d=grp.get_group(i)[l]
count=d.sum()
count_list.append(count)
ddf.insert(n,l,count_list)
n+=1
count=0
count_list=[]

And now our new data frame(ddf) is ready to be viewed.

Also, I happened to notice that the recordings corresponding to the date 13/04/2020 is wrong and hence we’ll replace the value with the average of the previous and next values.

def avg_val(l,val=74):
new_val=(array(ddf[l])[val-1] + array(ddf[l])[val+1] )/2
return int(new_val)
change_label=['Death','Total Confirmed cases','Cured/Discharged/Migrated']
for l in change_label:
ddf.replace(to_replace=array(ddf[l])[74], value=avg_val(l) ,inplace=True )

Now our data frame is ready to be plotted. We’ll use Seaborn’s relplot() function to create a line plot with Date on the x-axis and total cases on the y-axis.

plot=sns.relplot(x="Date",y="Total Confirmed cases" ,data=ddf,kind="line",height=10, estimator=None).set(title="Death Rate")
plot.fig.autofmt_xdate()

Similarly, we can plot graphs of Deaths data and Cured data versus Dates by changing the ‘y’ value.

From the graph, it’s evident that despite the 21 days lockdown, the curve seems to increase steadily and this necessitates the extension of the lockdown in India to effectively contain the virus spread.

To compare the trends of all these three parameters we can plot them in a single graph by using multiple lineplot() functions.

plt.figure(figsize=(15,15))
sns.lineplot(x="Date",y="Death",data=ddf,estimator=None)
sns.lineplot(x="Date",y="Cured/Discharged/Migrated",data=ddf,estimator=None)
sns.lineplot(x="Date",y="Total Confirmed cases",data=ddf,estimator=None)

The size of the graph can be adjusted by changing the x and y values in figsize.

To plot state-wise spread we need to create a new data frame, such that the data frame contains the list of all the states and the number of confirmed cases corresponding to that state.

To do this we’ll group the data frame by states and get the latest data recorded corresponding to the state and create a new data frame.

state_group=df.groupby(by='Name of State / UT')
states=df['Name of State / UT'].unique()
state_case,state_death=[],[]
for s in states:
c=df.loc[df['Name of State / UT']==s].iloc[-1]['Total Confirmed cases']
d=df.loc[df['Name of State / UT']==s].iloc[-1]['Death']
c=int(c)
d=int(d)
state_case.append(c)
state_death.append(d)
c,d=0,0

Now that we have got our data, we can create a new data frame- state_df

state_df=pd.DataFrame(states, columns=['States'])
state_df.insert(1,'case_count',state_case)
state_df.insert(2,'Deaths',state_death)

With this data frame, we will create a bar graph using the catplot() function.

sns.catplot(y="States",x="case_count",data=state_df, kind="bar", height=10).set(title="state-wise spread")

Our data set has a few glitches, for one, the data set has multiple copies of the same states/UT. To plot the spread intensity in the India map we have to delete these duplicates.

for d in (7,11,18): state_df.drop(index=d,inplace=True)

To plot this data on the India map, we have to get the shapefile of India map. It can be downloaded from the link below.

https://www.arcgis.com/home/item.htmlid=cf9b387de48248a687aafdd4cdff1127

We will import the geopandas library to plot the data on the map.

!pip install geopandas 
import geopandas as gpd
import matplotlib.pyplot as plt

Upload the downloaded .shp file to the Colab instance the same way we uploaded the ‘complete.csv’ file. Once uploaded get the path of the file and save it in a variable and open the file with read_file() function.

gpath="/content/INDIA.shp"
map_df=gpd.read_file(gpath)
mad_df.head()
map_df.plot(figsize=(5,5))

This should output an image of the India map. Next, we’ll create a copy of our data frame containing only the list of states and case_count.

gdf=state_df[['States','case_count']]

The shapefile we downloaded has a list of states and the corresponding geometry for plotting the states and this state list should match with the list of the state in our data frame so that they can be merged. Hence, we’ll replace the states’ names in our data frame to match the shapefile.

index=(22,1,14,20,28,15)
value=['Jammu And Kashmir','Nct Of Delhi','Orissa','CHANDIGARH','ANDAMAN AND NICOBAR ISLANDS','Pondicherry']
for i,v in zip(index,value):
gdf=gdf.replace(to_replace=gdf['States'][i], value=v)

Now the shapefile and our data frame can be merged.

merge=map_df.set_index('ST_NAME').join(gdf.set_index('States'))
merge.head()

Our merged data frame is all ready to be plotted!!

merge.plot(column=var, cmap='Reds',legend=True, figsize=(10,10),linewidth=5,legend_kwds=
{'label': "Corona spread intensity",'orientation':"horizontal"})

And our final output should be like the one below:

You can get the full code from the GitHub link below.

Also, to learn more about the Fastai library check out Jeremy Howard’s course- ‘Deep Learning for Coders’ from the link below.

Thanks and keep learning ;)

--

--