What I have learnt so far about COVID-19 — Part 1

Being EDA for awareness.

Olaniyan Oluwasegun
Analytics Vidhya
6 min readMar 27, 2020

--

A pneumonia of unknown cause detected in Wuhan, China was first reported to WHO office in China on 31, December 2019. The outbreak was declared a Public Health Emergency of International on 30, January 2020. WHO announced a name for this outbreak (corona virus) as COVID-19.

According to wikipedia, Corona viruses are a group of related viruses that cause diseases in mammals and birds. In humans, coronaviruses cause respiratory tract infections that can be mild, such as some cases of the common cold (among other possible causes, predominantly rhinoviruses), and others that can be lethal, such as SARS, MERS, and COVID-19. Symptoms in other species vary: in chickens, they cause an upper respiratory tract disease, while in cows and pigs they cause diarrhea. There are yet to be vaccines or antiviral drugs to prevent or treat human coronavirus infections.

Currently, the world is facing an outbreak as declared by the WHO to be pandemic (COVID-19)

Let’s go straight to the data analysis. The dataset was created by John Hopkins CSSE, and can be downloaded here. The link to the code can be found on my github page.

Note :The dataset is updated daily and the one i’ll be using is from 2020–01–22 to 2020–03–23.

Importing the Libraries

# linear algebra
import numpy as np
# data processing
import pandas as pd
# for visualization
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_dark"
%matplotlib inline

Getting the data

# only the train data will be used
df = pd.read_csv('train.csv', na_filter=False)

The na_filter parameter ignores all empty data, for instance the Province/State column as empty data.

Exploratory Data Analysis

df.columns.tolist()
The list of columns in the data

There are 8 features/columns which include ID: an identifier for each Case, Province/State: State of various country, Country/Region: Country/Region of each State, Lat & Long: Latitude & Longitude for each Region respectively, Date: Period when each case/fatality occurs, ConfirmedCases: Number of cases for each Country, Fatalities: Number of death case.

df=df.drop(['Id'],axis=1)
df.head(10)
Top ten data

From the table above, we can note few things. Firstly, the ID column had been removed using the .drop() method because it is irrelevant for the Analysis. Secondly, the empty data in the Province/State column has been ignored.

df['Country/Region'].unique().tolist()# rename 'Gambia, The' as 'The Gambia'
df['Country/Region'] = df['Country/Region'].replace('Gambia, The','The Gambia')

The first line show the list of all affected Countries. The other line replace Gambia, The to The Gambia.

# no of affected country
affected_country = df['Country/Region'].nunique()
earliest_entry = f"{df['Date'].min()}"
last_entry = f"{df['Date'].max()}"
print('There are {a} number of affected country within {b} and {c}'.format(a=affected_country,b=earliest_entry, c=last_entry))

From the above code snippet,there are 162 number of affected country within 2020–01–22 00:00:00 and 2020–03–23 00:00:00

Going Deeper

Let’s make analysis on the confirmed cases

  1. Which country has the highest case of the virus.
# confirmed cases as at 23-03-2020
cc = df.drop('Province/State',axis=1)
current = cc[cc['Date'] == max(cc['Date'])].reset_index()
current_case = current.groupby('Country/Region')['ConfirmedCases','Fatalities'].sum().reset_index()
highest_case = current.groupby('Country/Region')['ConfirmedCases'].sum().reset_index()fig = px.bar(highest_case.sort_values('ConfirmedCases', ascending=False)[:10][::-1],
x='ConfirmedCases', y='Country/Region',
title='Confirmed Cases Worldwide (23-03-2020)', text='ConfirmedCases', height=900, orientation='h')
fig.show()
Top Ten Countries with confirmed cases.

China has the highest confirmed case of the virus and Iran being the most affected Asian country other than China.

2. World wide cases over time.

ww_case = df.groupby('Date')['ConfirmedCases'].sum().reset_index()fig = px.line(ww_case, x="Date", y="ConfirmedCases", 
title="Worldwide Confirmed Cases Over Time")
fig.show()
Displaying Worldwide Confirmed cases over time

The growth of the virus is still in it’s peaks and that is not good at all.

3. Confirmed Cases of Random Countries.
Lets make analysis on random countries

# China
ch= df.loc[df['Country/Region'] == 'China'].reset_index()
ch_group =ch.groupby('Date')['Date','ConfirmedCases'].sum().reset_index()
# Italy
it = df.loc[df['Country/Region'] == 'Italy'].reset_index()
it_group =it.groupby('Date')['Date','ConfirmedCases'].sum().reset_index()
# USA
us= df.loc[df['Country/Region'] == 'US'].reset_index()
us_group =us.groupby('Date')['Date','ConfirmedCases'].sum().reset_index()
# plotting confirmed cases of the random countries
plot_titles = ['China', 'Italy', 'USA']
# China
fig = px.line(ch_group, x="Date", y="ConfirmedCases",
title=f"Confirmed Cases in {plot_titles[0].upper()} Over Time",
color_discrete_sequence=['#F61067'],
height=500
)
fig.show()
# Italy
fig = px.line(it_group, x="Date", y="ConfirmedCases",
title=f"Confirmed Cases in {plot_titles[1].upper()} Over Time",
color_discrete_sequence=['#91C4F2'],
height=500
)
fig.show()
# USA
fig = px.line(us_group, x="Date", y="ConfirmedCases",
title=f"Confirmed Cases in {plot_titles[2].upper()} Over Time",
color_discrete_sequence=['#6F2DBD'],
height=500
)
fig.show()
Displaying China confirmed cases over time.
Displaying Italy Confirmed cases over time
Displaying USA Confirmed cases over time

Looking at the plot of China’s cases, it is clear that they’ve been a slow spread of the disease since March, which is a good news.
Unlike Italy, by the looks of it. They are getting affected badly. We should also take note of USA’s situation that they’ve been an increase of confirmed case in the past weeks.

4. Map of all countries affected by the virus as at 23–03–2020.

Note: current variable takes information of the virus as at 23–03–2020.

fig = px.choropleth(current_case, locations="Country/Region", 
locationmode='country names', color="ConfirmedCases",
hover_name="Country/Region", range_color=[1,5000],
color_continuous_scale="peach",
title='Countries with Confirmed Cases')
# fig.update(layout_coloraxis_showscale=False)
fig.show()
Displaying geographical map of confirmed cases.

The above graph shows how the virus spread out across different countries. Countries above 20000 cases are: China,Germany, Iran, Italy, Spain, USA.

5. Which country has the highest death case as at 23–03–2020?

highest_death = current.groupby('Country/Region')['Fatalities'].sum().reset_index()fig = px.bar(highest_death.sort_values('Fatalities',ascending=False)[:10][::-1],
x='Fatalities',y='Country/Region',
title='Death Cases Worldwide (23-03-2020)', text='Fatalities', height=900, orientation='h')
fig.show()
Displaying Death case worldwide.

Italy has the highest death rate overshadowing China and the rest of the world .

6. Death rate over time.

death_case = df.groupby('Date')['Fatalities'].sum().reset_index()fig = px.line(death_case, x="Date", y="Fatalities", 
title="Worldwide Fatalities Over Time")
fig.show()
Displaying worldwide death over time.

The plot above shows the rapid growth of death rate globally.

7. Geographical map of death cases globally.

fig = px.choropleth(current_case, locations="Country/Region", 
locationmode='country names', color="Fatalities",
hover_name="Country/Region", range_color=[1,5000],
color_continuous_scale="peach",
title='Countries with Confirmed Cases')
# fig.update(layout_coloraxis_showscale=False)
fig.show()
Geographical map showing the death rate of different countries

Italy has the highest death rate of over 6077 people, and countries with over 1000 death rate are China, Iran, Italy, and Spain.

Let’s check the death rate of countries in Africa.

# all affected african country as at 23-03-2020africa = list(['Nigeria','Ethiopia','Egypt','Republic of the Congo','Tanzania','South Africa','Kenya','Algeria','Sudan','Morocco',
'Ghana','Cameroon','Cote d\'Ivoire','Burkina Faso','Zambia','Senegal','Somalia','Guinea','Rwanda','Benin',
'Tunisia','Togo','Congo (Brazzaville)','Congo (Kinshasa)','Liberia','Central African Republic','Mauritania',
'Namibia','The Gambia','Gabon','Equatorial Guinea','Mauritius','Eswatini','Djibouti','Seychelles'])
africa_death_rate = current[current['Country/Region'].isin(africa)]
# africa_death_rate.head()
# plotting the the death rate geographically
fig = px.choropleth(africa_death_rate, locations="Country/Region",
locationmode='country names', color="Fatalities",
hover_name="Country/Region", range_color=[1,2000],
color_continuous_scale='peach',
title='African Countries with Confirmed Death Cases', scope='africa', height=800)
# fig.update(layout_coloraxis_showscale=False)
fig.show()# the death cases of african countries as at 23-03-2020
africa_dr_group=africa_death_rate.groupby('Country/Region')['ConfirmedCases', 'Fatalities'].sum().reset_index()
africa_dr_group.sort_values('Fatalities', ascending=False)[['Country/Region', 'Fatalities']][:15].style.background_gradient(cmap='Reds')
Displaying Geographical map of Confirmed death cases in Africa.
Fatality rate in Africa.

Egypt as the highest death rate in Africa as at 23–03–2020 with 19 death case and 366 confirmed case.

Lets check how the death rate is spreading globally.

# create death_spread variable
death_spread = df.groupby(['Date', 'Country/Region'])['ConfirmedCases', 'Fatalities'].max()
death_spread = death_spread.reset_index()
death_spread['Date'] = pd.to_datetime(death_spread['Date'])
death_spread['Date'] = death_spread['Date'].dt.strftime('%m/%d/%Y')
death_spread['Size'] = death_spread['Fatalities'].pow(0.3)
fig = px.scatter_geo(death_spread, locations="Country/Region", locationmode='country names', color="Fatalities", size='Size', hover_name="Country/Region", range_color= [0, 100], projection="natural earth", animation_frame="Date", title='COVID-19: Deaths Spread Over Time Globally (2020–01–22 to 2020–03–23.)', color_continuous_scale="peach")fig.show()
Displaying death spread caused by the virus globally.

The plot above shows that the virus started in China causing an increasing amount of death case and started spreading on the 02–02–2020 to philippines and other countries.

Although the covid-19 affects mostly the old, it could also affect the young. Even if you don’t get sick, the choices you make about where you go could be the difference between life and death for someone else. Please lets stay safe and follow WHO preventive measures to contain the virus. More update will be posted.

To be continued…

Reference

  1. https://en.wikipedia.org/wiki/Coronavirus
  2. https://www.kaggle.com/c/covid19-glohttps://www.kaggle.com/c/covid19-global-forecasting-week-1bal-forecasting-week-1
  3. https://www.who.int/emergencies/diseases/novel-coronavirus-2019/events-as-they-happen
  4. https://www.kaggle.com/abhinand05/covid-19-digging-a-bit-deeper

--

--

Olaniyan Oluwasegun
Analytics Vidhya
0 Followers
Writer for

I’m an aspiring data scientist with a background in Computer Science and Mathematics.