Make your Covid-19 Dataset

Abhinav
Analytics Vidhya
Published in
7 min readApr 23, 2020

The world is facing an unprecedented threat from coronavirus and we need to stay well informed about the latest developments on the spread of this virus. Data is an important resource, retrieving data is easy these days- we just have to google it, but getting involved in the process of extracting it is an activity that can develop insights and put things into perspective.

Although there are many websites containing datasets, readily available to us, you can make your dataset by web scraping relevant tables containing covid-19 related pieces of information. This is my first post and together we will try to build our dataset!

src: google images

Following are the important Python libraries you require (you can visit the links if you want to find out more information about the library):

  1. Pandas- You will require it to manipulate the data frames which we will generate after web scraping (yes, we will get our dataset from a reliable website).
  2. Plotly- A more interactive and comprehensive method to visualize out data.
  3. BeautifulSoup- Python library for pulling data out of HTML and XML files.
  4. urllib- A package that collects several modules for working with URLs.

But wait, what is web scraping?

Web scraping a web page involves fetching it and extracting data from it. Fetching is the downloading of a page (which a browser does when you view the page). Therefore, web crawling is the main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on.

More on web scraping using beautifulsoup here

It is recommended to use Jupyter Notebook for writing codes because of its so many flexibilities.

Now that we have our setup, let us embark on our journey…

Photo by Campaign Creators on Unsplash

Step 1: Importing what’s necessary

As always our first step will be importing crucial libraries.

After importing necessary modules, you should specify the URL containing the dataset and pass it to urlopen() to get the HTML of the page.

Step 2: Make a Beautiful Soup object

This is done by passing the html to the BeautifulSoup() function. The Beautiful Soup package is used to parse the html, that is, take the raw html text and break it into Python objects. The second argument ‘’html.parser’’ is the html parser whose details you do not need to worry about at this point.

The soup object allows you to extract interesting information about the website you’re scraping such as getting the title of the page.

You can use the find_all() method of soup to extract useful html tags within a webpage. Examples of useful tags include < a > for hyperlinks, < table > for tables, < tr > for table rows, < th > for table headers, and < td > for table cells. For example, the code below shows how to extract all the hyperlinks within the webpage.

Step 3: Extract table rows from a website

The table contains official records of covid-19 cases of India. You can google for websites containing a regularly updated table that has relevant information regarding covid-19 cases.

Table HTML tags

The table I used has information about the state-wise cases of covid-19 in India.

you should get all table rows in list form first and then convert that list into a data frame. Below is a ‘for’ loop that iterates through table rows and prints out the cells of the rows.

The output above shows that each row is printed with html tags embedded in each row. This is not what we want. You can use remove the html tags using Beautiful Soup or regular expressions.

using beautiful soup

The easiest way to remove html tags is to use Beautiful Soup, and it takes just one line of code to do this. Pass the string of interest into BeautifulSoup() and use the get_text() method to extract the text without html tags.

Using regular expressions is highly discouraged since it requires several lines of code and one can easily make mistakes. It requires importing the ‘re’ (for regular expressions) module. The code below shows how to build a regular expression that finds all the characters inside the < td > html tags and replace them with an empty string for each table row.

First, you compile a regular expression by passing a string to match to re.compile(). The dot, star, and question mark (.*?) will match an opening angle bracket followed by anything and followed by a closing angle bracket. It matches the text in a non-greedy fashion, that is, it matches the shortest possible string. If you omit the question mark, it will match all the text between the first opening angle bracket and the last closing angle bracket. After compiling a regular expression, you can use the re.sub() method to find all the substrings where the regular expression matches and replace them with an empty string. The full code below generates an empty list, extracts text in between html tags for each row, and appends it to the assigned list.

using regular expressions makes the code lengthy

Step 4: convert the list into a data frame

The data frame is not in the format we want. To clean it up, you should split the “0” column into multiple columns at the comma position. This is accomplished by using the str.split() method.

This looks much better, but there is still work to do. The data frame has unwanted square brackets surrounding each row. You can use the strip() method to remove the opening square bracket on column “0.”

The table is missing table headers. You can use the find_all() method to get the table headers. Similar to table rows, you can use Beautiful Soup to extract text in between html tags for table headers.

You can then convert the list of headers into a pandas data frame. Similarly, you can split column “0” into multiple columns at the comma position for all rows.

The two data frames can be concatenated into one using the concat() method

Below shows how to assign the first row to be the table header.

Also, notice how the table header is replicated as the first row in the data frame. It can be dropped using drop() command.

At this point, your data set should be looking fine.

sample final table

Step 5: Data Visualization

This is the last step as we have our data with us and only have to present it beautifully on our machine. There are many tutorials and articles that you can refer to at this point to make use of your extracted data and present it as the way you like it. I used Plotly for a simple bar graph and pie chart representation. More information can be found here.

My final representation of the data looks like this:

With this, we have reached the end of our little experiment. I hope this article helped to enhance your learning.

Stay home, stay safe!

--

--