Data Visualization through Tabulation on India’s COVID-19 Daily Case Count

Soumyajit Pal
Analytics Vidhya
Published in
8 min readJul 30, 2020
Visualizing the latest trends of the pandemic that has changed the world. Forever.

Data Science is an amalgamation of the following steps : collecting data, storing data, processing data, describing data and modeling data. Data Visualization deals with describing data.

Taking the entire data set and narrowing down our focus on exactly those portions of data which we want to deal with is what constitutes describing data. There are two ways to go about it : computing statistics (mean, median, mode and standard deviation among others) and plotting graphs to gain visual insights into data.

The ability to compress data and communicate the same using meaningful symbols is a measure of intelligence of not only humans but also AI agents.

Goals of Data Visualization

Reducing perceptual error : Encode information visually

Discovering insights : Key information is often extracted from plots which otherwise remain hidden even from statistical inferences

Communicating insights : Effectively conveying of discovered insights to naive users is aided by graphs and charts

Packages for Data Visualization

matplotlib : Inspired from plotting tools of MATLAB

seaborn : Python data visualization library built on top of matplotlib

There is an integration of the above plotting libraries with pandas, thus enabling computation and plotting to go hand in hand.

Let us import the necessary packages:

Packages imported and renamed appropriately

Collecting Data

To be able to visualize and draw insights, we first need to get hold of the data. From initial inspection, we find state-wise daily case count to be stored in a JSON file organized as a dictionary with a single key “states_daily”. Each day’s report, starting from March 14, 2020, is stored as a collection of three dictionaries with abbreviation of name of state / Union Territory as key and count of number of patients infected, recovered and deceased as values. The process of procurement of this data starts with defining the URL of the webpage containing the file:

Storing the URL as a string object

Next, we need to incorporate the request package from urllib library:

Helps to retrieve web data

With the following line of code, we will be able to not only fetch the required data, but also store it in a file for future visualization:

Get all data from webpage whose link is found in string url and store it in data.json

The semi-colon at the end of the statement shown above suppresses unwanted warning statements.

Storing Data

We can now read the JSON file using the following one-liner:

read_json( ) is from pandas library

Just to be sure that the data has been successfully read and all our observations regarding its structure is correct, it would be wise to print the contents:

Indeed it is a collection of dictionaries as stated earlier

Preprocessing Data

The format of the data as shown above is difficult to grasp and work with. We come up with an alternative approach where we use the json library to load the procured data:

import json

First, we need to open the JSON file in read mode:

fh = open("data.json", mode = "r")    # fh is the file handler

Next, we link the file handler to load our data from file:

data = json.load(fh)

A snapshot of the loaded data is as shown below:

Snapshot : Nested dictionary

The “states_daily” key attribute can be referred in order to extract the content as a single-level dictionary:

Snapshot : Single-level dictionary

Working with dictionary data as shown above is cumbersome. A viable alternative is to convert the JSON dictionary into pandas DataFrame:

DataFrame object covid_data created by flattening the JSON file stored in data

Inspecting Data

The path to understanding data can begin by finding the number of rows and columns. For each date, we have 3 entries — one each for confirmed, deceased and recovered patients:

shape attribute returns the 414 rows and 41 columns

414 entries amount to 138 days’ data (414 / 3 = 138). For every state and union territory in India, there is an abbreviated column name. To broaden our outlook on the nature of data, we take a look at the columns:

columns attribute returns a list of column names

With little amount of research, we can build a table that maps region abbreviations to corresponding names:

States and Union Territories of India

The above table contains names of 37 regions. The remaining four columns are:
status — label defining a row (confirmed, deceased, recovered)

Categorical variable status with its unique set of 3 values

date — string denoting particular date of reporting

138 days data

tt — total count of confirmed / deceased / recovered patients

Categorical data which is converted to numerical data later

un — appears to contain anomalous data points and needs to be dropped:

Case count cannot be negative!

Since the column un contains spurious values, we remove it altogether:

The drop method removes un column (axis = 1) and this change is reflected in covid_data because inplace is set to True

Just to be doubly sure of this change, we query the shape of covid_data:

We are down to 40 columns now!

Converting date attribute from string to timestamp aids in better data handling:

to_datetime( ) aids in this process

We want to focus only on the set of infected patients. As a result, we subset the entire DataFrame on the condition that status attribute is “Confirmed”:

Narrowing our focus is what constitutes describing data

Let us be sure that the above operation executed correctly:

One-third of the original data set containing only 138 rows

Since we have managed to extract information pertaining only to infected patients, we have no use of the status column. All 138 rows have the same status value:

Above claim reasserted with status returning only one unique value : Confirmed

We can drop it now:

Tried suppressing the warning (pink) using semicolon

It is always good to confirm our claims:

Left with 39 columns after dropping status attribute

Getting an overview of the data at hand by examining first five rows:

The head( ) returns first five rows by default

It is clear we are dealing with day-to-day data. Thus, it would be wise to index the rows based on the date column:

The date column acts as an index for covid_data

By the looks of it, case counts are positive integers. However, the data type of each column is contrary to our expectations:

Snapshot of output produced by info( )

In order to be able to apply visualization tools, our data values have to be numeric:

The to_numeric( ) method is passed to apply( ) method thereby changing data type of every attribute to int

For the uninitiated, it is always best to start visualizing data on a small-scale level. We narrow our focus to latest week’s data (July 23 to July 29):

The tail(7) method returns data for the last 7 days

Before going ahead with styling tabulation, let us drop the total case column:

tt denoting total count is done away with

Styling Tabulation

We write a function which colours cells having 0 count with green and others with red:

Please note that it is color and not colour

The style.applymap( ) helps apply the above function on every cell and colour codes them accordingly:

Snapshot showing Daman and Diu (dd) and Lakshadweep (ld) reporting zero cases last week

We can proceed to drop these columns:

35 out of 37 regions have been known to report fresh cases till July 29

If we are able to colour code the maximum case count, we can get a picture of states which are reporting lower count lately and as a result are in the process of recovery. An insight is also found regarding states which are reporting higher counts lately and need special attention:

highlight_max( ) does the trick for us
Snapshot : Maximum values are in red

Andhra Pradesh (an), Arunachal Pradesh (ap), Madhya Pradesh (mp) among others have been reporting higher counts lately, whereas Jammu and Kashmir (jk), Odisha (or), Nagaland (nl) among others seem to be in recovery mode

The minimum case count can also be highlighted:

Snapshot : Reporting minimum values off late is a good sign

We can highlight minimum as well as maximum values at the same time:

Snapshot : Green for minimum and Red for maximum

A function to display maximum values in bold:

font-weight property is set to bold for maximum values in columns and left unchanged for others

Let us highlight maximum values in bold and red and minimum values in green:

Snapshot : Max values are bold and red whereas Min values are green

It becomes important to know which region is reporting highest number of cases on a daily basis:

Maharashtra (mh) seems to be in an alarming situation — reporting highest number of cases on a daily basis

Instead of two colour coding, we can colour the background in a gradient according to data— deeper the shade, higher the value, more alarming the situation:

Snapshot : Distribution of shade according to regions

Shades of colour can also be had along rows (date-wise):

Snapshot : Maharashtra (mh), Andhra Pradesh (ap) and Tamil Nadu (tn) [shown below] report substantial number of cases on a daily basis
Tamil Nadu (tn) is not far behind

Since mh, ap and tn appear to be covid hot-spots, we focus on these three regions using bar plots where length of bar is proportionate to the magnitude of number of cases:

Greater the length of the bar, higher is the number of cases recorded for that day

The visual appeal can be enhanced if we use three different colours — one for each state:

Clear picture distinguishing the three most affected regions

I hope you found this post helpful. Please feel free to leave your comments, feedback, criticism, thoughts and everything else that comes along with them. See you soon!

--

--