Data Visualization through Tabulation on India’s COVID-19 Daily Case Count
Data Science is an amalgamation of the following steps : collecting data, storing data, processing data, describing data and modeling data. Data Visualization deals with describing data.
Taking the entire data set and narrowing down our focus on exactly those portions of data which we want to deal with is what constitutes describing data. There are two ways to go about it : computing statistics (mean, median, mode and standard deviation among others) and plotting graphs to gain visual insights into data.
The ability to compress data and communicate the same using meaningful symbols is a measure of intelligence of not only humans but also AI agents.
Goals of Data Visualization
Reducing perceptual error : Encode information visually
Discovering insights : Key information is often extracted from plots which otherwise remain hidden even from statistical inferences
Communicating insights : Effectively conveying of discovered insights to naive users is aided by graphs and charts
Packages for Data Visualization
matplotlib : Inspired from plotting tools of MATLAB
seaborn : Python data visualization library built on top of matplotlib
There is an integration of the above plotting libraries with pandas, thus enabling computation and plotting to go hand in hand.
Let us import the necessary packages:
Collecting Data
To be able to visualize and draw insights, we first need to get hold of the data. From initial inspection, we find state-wise daily case count to be stored in a JSON file organized as a dictionary with a single key “states_daily”. Each day’s report, starting from March 14, 2020, is stored as a collection of three dictionaries with abbreviation of name of state / Union Territory as key and count of number of patients infected, recovered and deceased as values. The process of procurement of this data starts with defining the URL of the webpage containing the file:
Next, we need to incorporate the request package from urllib library:
With the following line of code, we will be able to not only fetch the required data, but also store it in a file for future visualization:
The semi-colon at the end of the statement shown above suppresses unwanted warning statements.
Storing Data
We can now read the JSON file using the following one-liner:
Just to be sure that the data has been successfully read and all our observations regarding its structure is correct, it would be wise to print the contents:
Preprocessing Data
The format of the data as shown above is difficult to grasp and work with. We come up with an alternative approach where we use the json library to load the procured data:
import json
First, we need to open the JSON file in read mode:
fh = open("data.json", mode = "r") # fh is the file handler
Next, we link the file handler to load our data from file:
data = json.load(fh)
A snapshot of the loaded data is as shown below:
The “states_daily” key attribute can be referred in order to extract the content as a single-level dictionary:
Working with dictionary data as shown above is cumbersome. A viable alternative is to convert the JSON dictionary into pandas DataFrame:
Inspecting Data
The path to understanding data can begin by finding the number of rows and columns. For each date, we have 3 entries — one each for confirmed, deceased and recovered patients:
414 entries amount to 138 days’ data (414 / 3 = 138). For every state and union territory in India, there is an abbreviated column name. To broaden our outlook on the nature of data, we take a look at the columns:
With little amount of research, we can build a table that maps region abbreviations to corresponding names:
The above table contains names of 37 regions. The remaining four columns are:
status — label defining a row (confirmed, deceased, recovered)
date — string denoting particular date of reporting
tt — total count of confirmed / deceased / recovered patients
un — appears to contain anomalous data points and needs to be dropped:
Since the column un contains spurious values, we remove it altogether:
Just to be doubly sure of this change, we query the shape of covid_data:
Converting date attribute from string to timestamp aids in better data handling:
We want to focus only on the set of infected patients. As a result, we subset the entire DataFrame on the condition that status attribute is “Confirmed”:
Let us be sure that the above operation executed correctly:
Since we have managed to extract information pertaining only to infected patients, we have no use of the status column. All 138 rows have the same status value:
We can drop it now:
It is always good to confirm our claims:
Getting an overview of the data at hand by examining first five rows:
It is clear we are dealing with day-to-day data. Thus, it would be wise to index the rows based on the date column:
By the looks of it, case counts are positive integers. However, the data type of each column is contrary to our expectations:
In order to be able to apply visualization tools, our data values have to be numeric:
For the uninitiated, it is always best to start visualizing data on a small-scale level. We narrow our focus to latest week’s data (July 23 to July 29):
Before going ahead with styling tabulation, let us drop the total case column:
Styling Tabulation
We write a function which colours cells having 0 count with green and others with red:
The style.applymap( ) helps apply the above function on every cell and colour codes them accordingly:
We can proceed to drop these columns:
If we are able to colour code the maximum case count, we can get a picture of states which are reporting lower count lately and as a result are in the process of recovery. An insight is also found regarding states which are reporting higher counts lately and need special attention:
Andhra Pradesh (an), Arunachal Pradesh (ap), Madhya Pradesh (mp) among others have been reporting higher counts lately, whereas Jammu and Kashmir (jk), Odisha (or), Nagaland (nl) among others seem to be in recovery mode
The minimum case count can also be highlighted:
We can highlight minimum as well as maximum values at the same time:
A function to display maximum values in bold:
Let us highlight maximum values in bold and red and minimum values in green:
It becomes important to know which region is reporting highest number of cases on a daily basis:
Instead of two colour coding, we can colour the background in a gradient according to data— deeper the shade, higher the value, more alarming the situation:
Shades of colour can also be had along rows (date-wise):
Since mh, ap and tn appear to be covid hot-spots, we focus on these three regions using bar plots where length of bar is proportionate to the magnitude of number of cases:
The visual appeal can be enhanced if we use three different colours — one for each state:
I hope you found this post helpful. Please feel free to leave your comments, feedback, criticism, thoughts and everything else that comes along with them. See you soon!