Analytics Vidhya
Published in

Analytics Vidhya

COVID-19 Data Analysis

Visualizations and Predictions on COVID-19 pandemic data

SARS-CoV-2 Structure Structure (Source: Scientific Animations under CC License)


The COVID-19 pandemic also known as coronavirus pandemic is the ongoing outbreak of coronavirus disease (COVID-19). It is caused by a coronavirus called severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2).

The outbreak was identified in Wuhan, China, in December 2019. The World Health Organization declared the outbreak a Public Health Emergency of International Concern on 30 January, and a pandemic on 11 March.

It is a respiratory disease and is thought to spread mainly through close contact from person-to-person in respiratory droplets from someone who is infected. People who are infected often have symptoms of illness. Some people without symptoms may be able to spread virus. People may also become infected by touching a contaminated surface and then touching their face.

Common symptoms include fever, cough, fatigue, shortness of breath, and loss of smell. Complications may include pneumonia and acute respiratory distress syndrome. The time from exposure to onset of symptoms is typically around five days, but may range from two to fourteen days. There is no known vaccine or specific antiviral treatment. Primary treatment is symptomatic and supportive therapy.

This blog presents analysis, visualizations and predictions on COVID-19 pandemic data.

This analysis is divided into three parts:

  1. Data Preparation
  2. Visualization
  3. Prediction

To follow along, please see the code that can be found in my GitHub profile.

Data Preparation

As there are many data sources available online, the one used in this blog is provided by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). This data is updated daily and for current visualizations, please run the notebook.

As the data is updated daily, for convenience, instead of downloading the files, files are directly loaded into pandas dataframes from the online source.

Loading Data from online source

After loading the data, for visualizing and prediction purposes, data is sorted based on the columns and stored into lists.

Storing Data into various data structures for visualization and prediction

Of the different lists, one is for storing mortality rate and another is for storing recovery rate.

Mortality rate can be defined as the ratio of number of deaths recorded against the total number of cases recorded and this is calculated using the following formula:

Mortality Rate Formula

Recovery rate can be defined as the ratio of number of recovered patients recorded against the total number of cases recorded and this is calculated using the following formula:

Recovery Rate Formula

Now, as the data is sorted into various lists and stored, it is time for visualizations.


As the data is loaded, prepared, and stored; the worldwide stats are plotted first. The data used here was recorded from January 22, 2020 (and is being updated on a daily basis).

Worldwide statistics

Let’s start with a couple of graphs showing the overall situation of the total cases.

The images are related to data updated on 5th May, 2020.

Plots for World wide (a) total cases (b) current active cases (c ) deaths (d) recoveries

This is how a pandemic looks like, with a huge growth of the positive cases and the relative outcomes, with count of deaths fortunately less than that of the count of recoveries.

Now, let’s see a breakdown of the day-wise counts.

Day wise plots (a) confirmed cases (b) deaths (c ) recoveries — Worldwide

This shows how random the day wise growth of the cases and relative outcomes is. Let’s look at changes in mortality rate and recovery rate of the deadly COVID-19 pandemic.

(a) Mortality Rate of COVID-19 (b) Recovery Rate of COVID-19

Now, let’s plot the US pandemic data.

Day wise plots (a) confirmed cases (b) deaths (c ) recoveries — USA wide

As we have have the plots for day wise counts in USA, let’s see the top-10 states/regions with the most confirmed cases. The remaining states are grouped into “others” category

10 states in the USA with the most confirmed cases

Now, we move onto predicting the future cases.


In this section, we’ll be predicting the rise in cases for the next fifteen days using variants of Linear Regression algorithm of Python’s scikit-learn library.

The first one we use is a basic Linear Regression model.

Linear Regresson Predictions

From the plot it can be seen that basic Linear Regression model’s predictions are no where near the test data.

The next variant we’ll try is the Polynomial Regression. That is the we convert the features into polynomial features.

For this we use scikit-learn’s PolynomialFeatures class. This class generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a², ab, b²]

(a) Polynomial Regression of Degree-2 (b) Polynomial Regression of Degree-3

It can be seen that of the three models, Polynomial Regression of Degree-3 performed well on the test data. But, this model’s future predictions are very large numbers. Hence, we average predictions of Polynomial Regression of Degree-2 &3 models.

Average of Polynomial Regression models of Degree — 2 & 3

The future predictions by the three algorithms and the average of polynomial models can be seen in the graph below.

Future Predictions of worldwide COVID-19 cases
Number of Cases Predicted by the Average Model

Note: This is just a simple model and the results are not accurate. For more accurate predictions, try using other regression or deep learning techniques.


In this article, we discussed about COVID-19 pandemic and have done some analysis on the data provided by Johns Hopkins University.

We also have seen some visualizations of data that is currently available, and also tried to predict the number of cases that are likely to occur in future.

These are just simple examples of possible reports that can help to comprehend more easily the magnitude of what is happening.

There are many factors that can change these predictions. How the data is collected could affect the predictions and in this dataset, currently, there are missing features that could help, like statistics about age intervals of positives, recovers and deaths.

Hope you gained some knowledge reading this article.

Code can be found in my GitHub profile.




Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Recommended from Medium

The Positive class? What should it be? in a machine learning binary classification problem?

The Top 10 Podcasts for Entrepreneurs and Business Leaders

Data Visualization with Python and Seaborn — Part 1: Loading Datasets

CHFJPY Sellers Appeared At Blue Box Area

Top Skills to Ace Every SQL Interview Question

Chapter-4 Knowledge from the data and Data Exploration Analysis

BigQuery 101, how to tame the beast?

How to spend your time when you are waiting for a Data Analysis Output

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Chaitanya Krishna Kasaraneni

Chaitanya Krishna Kasaraneni

Software Engineer — Data at Egen

More from Medium

Exploring auto EDA packages in R (part 1)

Permutation and Combinations for Data Analytics

Ocean Scientist… Data Scientist… Ocean Data Scientist?

“Always” start your bar charts at zero