COVID-19 Data Analysis with Plotly

Published in

Data Science Student Society @ UC San Diego

8 min readJun 9, 2020

Introduction

Hello, everyone! Today we are going to cover the Plotly library and utilize it to do some data analysis on COVID-19. To begin with, Plotly is an interactive graphical library, providing a lot of features. Users can zoom in, zoom out, save the screenshot, and see hidden data by moving the mouse around.

Some may ask: “So what?” There may be a point in time when you need to put a lot of data into matplotlib just get it messy and unorganized. Imagine you want to plot the US city population on a US map and mark the number everywhere. With matplotlib, after all, every piece of data would just get overlapped with each other. Sometimes you want to view only parts of the image instead of the whole one. You have to rewrite python code and regenerate, which would be time-consuming and make for an inefficient workflow

Well, with the help of the Plotly library, everything gets resolved! Because the plot can be zoomed in and out on to see details or a general picture. And you don’t need to worry about overlapping marks as well as it would only show when you place the mouse on it.

Another advantage of Plotly is that it is portable to any website, because after all Plotly graphs are represented by Javascript code. You can ship the javascript code wherever you want. You can generate the plotly graph in the backend and ship the Javascript code to the frontend, such as the Flask website, etc.

Tutorial

Great, so let’s dive into the amazing Plotly library. The COVID-19 data will be from the Kaggle website. The link is here.

First, let’s take a look at the data and do some basic data exploration(not shown here).

Box Plot

Let’s visualize the relationship between gender and age among those infected and see see what the outcome is.

import plotly.express as px
import numpy as np
import pandas as pd# Load Datasetdata = pd.read_csv('./novel-corona-virus-2019-dataset/COVID19_open_line_list.csv')
sub_data = data[['age','sex','outcome']]
sub_data = sub_data.dropna()# Data cleanning these three columns.
age_col = list(sub_data['age'])
for i,a in enumerate(age_col):
    age = str(a).split('-')
    if len(age) > 1:
        low,high = age
        # take the median
        age_col[i] = (float(low)+float(high))/2
    else:
        try:
            age_col[i] = float(age[0])
        except:
            age_col[i] = np.nan
sub_data['age'] = age_col# Drop rows with N/A values
sub_data = sub_data.dropna()# combine the uppercase and lowercase version of sex
sub_data['sex']  = [s.lower() for s in sub_data['sex']]
outcome_list = list(sub_data['outcome'])
for i,o in enumerate(outcome_list):
    if o in ['died','death']:
        outcome_list[i] = 'death'
    elif o.lower() in ['discharged','discharge','recovered']:
        outcome_list[i] = 'discharged'
    else:
        outcome_list[i] = 'isolated'
sub_data['outcome'] = outcome_list
# Draw the plotfig = px.box(sub_data, x="sex", y="age",title='Coronavirus confirmed cases until May 2020',points='all',color ='outcome')
fig.show()

Great! We have finished our first interactive plot. As we see here, when we place the mouse somewhere, it would clearly show the exact metrics on the plot and it would disappear if the mouse moves outside.

Another great thing about the Plotly plot is that you can select which bar you want to show. Now all three outcomes are shown. If you want to better compare distribution between discharged people aNd isolated people, you just need to click the death label on the right side and hide the death label, like this:

A brief intro for plotly top right buttons

As you may notice, there are some buttons on the right top of the plotly plot and you might wonder what they are used for.

The first one is the screenshot button to download this image. The Second one is zoom in a specific part of the image. The third one, the cross sign, is to drag the plot around. The fourth one, the dashed box is to select some part of the picture without zooming into. The plus and minus signs are buttons that zoom in and zoom out, respectively The next one is the button to reset the scale back to when it was first generated. The home button to reset all settings. The rest three weren’t used frequently. If you are interested, you can try it yourself! :)

Pie Chart

We can also generate a pie chart using this library. In this example, I am going to count the number of symptoms of COVID-19 and represent them in one pie chart.

from collections import defaultdict
import re
sub_data = data['symptoms'].dropna()
symp_counts = defaultdict(int)
for s in sub_data:
    symps = str(s).split(',')
    for symp in symps:
        if symp[0] == ' ':
            symp = symp[1:]
        if 'fever' in symp.lower():
            symp = 'fever'
        elif 'cough' in symp.lower():
            symp = 'cough'
        elif 'weak' in symp.lower() or 'fatigue' in symp.lower():
            symp = 'weak/fatigue'
        elif 'headache' in symp.lower():
            symp = 'headache'
        symp_counts[symp] += 1# Make plotsimport plotly.graph_objects as golabels = list(symp_counts.keys())[:-28]
values = list(symp_counts.values())[:-28]fig = go.Figure(data=[go.Pie(labels=labels, values=values,title = "Symptoms after infected by coronavirus")])
fig.show()

This is pretty straightforward. As I mentioned before, with plotly, you don’t need to worry that you have don’t need to worry about having too much data to squeeze into one visual, because you can always drag the mouse around and only look at the details of one specific part.

Package Difference

Some may notice that I have used two different packages so far. One is plotly.graph_objects and another is plotly.express. As you can tell by the name, express is an API workflow to draw the plots. It is easy because you only need to pass in the data, but you can’t change the layout. The graph_objects can provide a more primitive workflow. You can specify how far the label and plot should be from each other, where the title should be, etc. It requires more work but is more flexible.

Bar Plot

Great! Now we have covered two classic plots. How about we plot another one, the bar plot? In this example, I am going to plot the cumulative confirmed cases across all the states of the U.S.

confirmed = pd.read_csv('./novel-corona-virus-2019-dataset/time_series_covid_19_confirmed_US.csv')
confirmed = confirmed.groupby('Province_State').sum()# As the first 60 didn't have too many cases, I select the dates after to make plots better looking. 
dates = confirmed.columns[60:]
states = list(confirmed.index)fig = go.Figure(data=[ go.Bar(name=states[i], x=dates, y=list(confirmed.iloc[i,60:])) for i in range(len(confirmed))])# Change the bar mode
fig.update_layout(barmode='stack',title_text='Time Series Confirmed Cases across States')
fig.show()

Bubble Plot on the map

Some may also like to view confirmed COVID-19 cases from a geographical lens. Plotly provides us with this feature.

df = pd.read_csv('./novel-corona-virus-2019-dataset/time_series_covid_19_confirmed_US.csv')# Use the last date as the total confirmed cases
df['total'] = df['5/17/20']# Sort the total cases from high to low
df = df.sort_values(by=['total'],ignore_index=True)# Text that will be shown once you place your mouse on some place. And the text here is basically html code. 
df['text'] = df['Province_State'] + ' ' +df['Admin2'] + '<br>Confirmed Cases ' + (df['total']).astype(str)+' cases'# Divide the cities into 5 level severity.
limits = [(0,600),(600,1200),(1200,1800),(1800,2400),(2400,len(df))]# Specify which color should be for which level severity
colors = ["white","lightblue","lightseagreen","royalblue","darkblue"]
cities = []# scale constant that controls size of bubble
scale = 1000fig = go.Figure()# for each severity, plot the bubble iteratively
for i in range(len(limits)):
    lim = limits[i]
    df_sub = df[lim[0]:lim[1]]
    # Add trace here means adding all points for one group, one of the five groups we have on the right.
    fig.add_trace(go.Scattergeo(
        locationmode = 'USA-states',
        lon = df_sub['Long_'],
        lat = df_sub['Lat'],
        text = df_sub['text'],
        marker = dict(
            size = np.log2(df_sub['total']+1)*5,
            color = colors[i],
            line_color='rgb(40,40,40)',
            line_width=0.5,
            sizemode = 'area'
        ),
        name = '{0} - {1}'.format(lim[0],lim[1])))fig.update_layout(
        title_text = '2020 COVID-19 US city Infected Cases',
        showlegend = True,
        geo = dict(
            scope = 'usa',
            landcolor = 'rgb(217, 217, 217)',
        )
    )fig.show()

The deeper the color, the more severe infected cases. You may notice that the right five labels are actually index start and end position in this dataframe. They are confusing, but there is a way to fix this. I will leave this as a challenge to you.

Animation

Moreover, plotly can also generate an animated graph for you. For example, you want to look at how the coronavirus confirmed cases grew around the world over time. Well, that is also feasible in plotly.

df = px.data.gapminder()
iso_alpha_mapping = {}
for i in range(len(df)):
    row = df.iloc[i]
    iso_alpha_mapping[row['country']] = row['iso_alpha']timeline = pd.read_csv('./novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv')timeline = timeline.groupby('Country/Region').sum()
timeline['total'] = timeline[timeline.columns[4:]].sum(axis = 1)timeline_confirmed = pd.DataFrame(columns=['Country/Region','Date','Confirmed','iso_alpha'])
num_contries = len(timeline)# There are hundreds of countries around the world, each has ~300 data. In total would be too much data for the plot and it takes time to generate plot. So I only select those with more than 30000 cases.threshold = 30000
for i in range(num_contries):
    row = timeline.iloc[i]
    if row['total'] < threshold:
        continue
    if row.name not in iso_alpha_mapping:
        continue
    
    for j in timeline.columns[4:-1]:
        timeline_confirmed = timeline_confirmed.append(
            {'Country/Region':row.name,
             'Date':j,
             'Confirmed':int(row[j]), 
             'iso_alpha':iso_alpha_mapping[row.name]},
            ignore_index=True)timeline_confirmed['Confirmed'] = pd.to_numeric(timeline_confirmed['Confirmed'])timeline_confirmed['Text'] = 'Cases: '+timeline_confirmed['Confirmed'].astype(str)timeline_confirmed['Size'] = timeline_confirmed['Confirmed']+1fig = px.scatter_geo(timeline_confirmed, locations="iso_alpha",color = 'Country/Region',
                     hover_name="Country/Region", size="Size",
                     animation_frame="Date",text='Text',
                     projection="natural earth")
fig.show()

The bottom scroll button allows you to see how the number changes over time.

A close look at the EU:

Discussion

There’s one thing you should be aware of before wrapping up today. Plotly does not scale very well. The reason is that all these plots are HTML and Javascript code on your browser. Therefore, when you put hundreds of thousands of floats onto this website, you might find that it may not run smoothly It depends on how powerful your hardware core is.

End

That’s it! You are more than welcome to check out other plots on the plotly website and apply them on your WebApp!