Choropleth Map with JHU COVID-19 Dataset

Dangtvivian
4 min readApr 8, 2020

Over 370,000 COVID-19 cases have been confirmed in the US, and more than 12,000 people have died from the disease. However, this highly infectious respiratory disease has not slowed down. To keep up with the COVID-19 trends, Johns Hopkins University Center For Systems Science and Engineering (JHU CSSE) created a GitHub repository, earlier this year, for sharing updated COVID-19 case reports for research and educational purposes. In this blog, I will provide a quick guide on their COVID-19 repository and introduce some python tools for data analysis and visualization.

JHU CSSE COVID-19 Repository

Let’s begin by clicking here to take a look at the repo. To obtain daily reports and time series CSV files, click on the folder ‘csse_covid_19_data’. In this folder, you will find a folder for daily reports and another folder for time series data.

In the time series folder, there will be 5 CSV files for:

  • Confirmed cases in the US and Globally
  • Death cases in the US and Globally
  • Recovered cases globally

The time series dataset provides the following information:

  • Country/Region, Province/State, Latitude and Longitude for location identification
  • Dates ranging from January 22, 2020 to the present
  • Each row represents a unique location with the number of new cases updated each day

In the daily report folder, there will be a new CSV file every day, from January 22, 2020 to the present:

  • Each CSV reports the number of confirmed, death, and recovered cases by location (province/state/countries/regions) for that date.
  • Below is a description of the columns in the daily reports:

Map Time!

Let’s grab a daily report CSV file and create a choropleth map of confirmed COVID-19 case across the United States!

Step 1: Using Jupyter notebook, import tools

import pandas as pd import plotly.express as px
from urllib.request import urlopen
import json

Step 2: Using pandas (pd), import daily report CSV file. Here, I am using April 5th’s report.

df = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/04-05-2020.csv')

Step 4: Using Pandas method, assess and process the data frame (df)

df.head() 
#shows the top 5 rows
df.info()
#shows the number of columns and rows, data type, non-null value count
df[‘column_name’].value_counts()
#shows unique values and their quanity in the column

Note: For mapping, FIPS must be a 5 digit object without trailing zeros.

df['FIPS'] = df['FIPS'].astype(str)
#convert FIPS from a float to a string
FIPS = []
for value in df['FIPS']:
FIPS.append(value.replace(".0",""))
df['FIPS'] = FIPS
#remove trailing zeros
df['FIPS'] = df['FIPS'].apply(lambda x: '{0:0>5}'.format(x))
#add leading zeros for FIPS with length less than 5

Step 5: Using urlopen and json, load in a map template from Plotly

with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)

Step 6: Using plotly (px), create a choropleth map with the columns ‘FIPS’ and ‘Confirmed’ from the data frame.

fig = px.choropleth(df, geojson= counties, locations ='FIPS', color='Confirmed', color_continuous_scale="Viridis", range_color=(0,1000), scope="usa", labels={'Confirmed':'Number of Confirmed Cases'})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
Source: https://viviandng.github.io/COVID-19-Map/index.html

In the map above, we graph the number of confirmed cases by county. If you hover over the map, you’ll see FIPS code (Federal Information Processing Standard Publication 6–4 code which uniquely identifies counties and county-equivalents in the United States) and the corresponding number of confirmed cases.

In this choropleth map, areas shaded in purple have between 0 to 200 confirmed cases, which is relatively low compared to COVID-19 hotspots denoted by the yellow regions. Areas shaded in yellow have over 1,000 confirmed cases. These areas may need more support from the government. They may have a shortage of ventilators, personal protective equipment, and hospital beds. Their local healthcare system may also be at or over capacity. Note that some areas are unshaded — the dataset did not include any information for those FIPS codes.

Conclusion

This visualization can draw attention to which counties are most affected by the COVID-19 pandemic. With the help of this dataset, the federal government can identify high impact counties and direct more aid to them. Also, this can increase awareness of the spread of the disease to residents of those counties. Residents of those counties would then know to exercise more caution and to follow CDC guidelines more closely. I hope this blog has shed some light on some of the capabilities of the Johns Hopkins COVID-19 dataset. I’m excited to see what others can do with this publically available data!

--

--