A Beginner’s Guide to learn web scraping and data visualization with python

Prabhat Pathak
Analytics Vidhya
Published in
4 min readJun 23, 2020

In this tutorial, I am using libraries such as Beautifulsoup and texttable

Photo by Nicolas Picard on Unsplash

Why Web Scraping?

Web scraping is used to collect large information from websites for analysis.

Imagine you have to pull a large amount of data from websites and you want to do it as quickly as possible. How would you do it without manually going to each website and getting the data? Well, “Web Scraping” is the answer. Web Scraping just makes this job easier and faster.

Libraries used for Web Scraping

  • BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and XML documents. It creates parse trees that are helpful to extract the data easily.
  • Pandas: Pandas is a library used for data manipulation and analysis. It is used to extract the data and store it in the desired format.

Libraries used for Data visualization

  • Matplotlib
  • Seaborn

In this tutorial, I am using Worldometers to extract the number of COVID cases and then we will do data analysis and create some visualizations.

let’s start

# importing modules 
import requests
from bs4 import BeautifulSoup

# URL for scrapping data
url = 'https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/'

# get URL html
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

data = []

# soup.find_all('td') will scrape every element in the url's table
data_iterator = iter(soup.find_all('td'))
# data_iterator is the iterator of the table

# This loop will keep repeating till there is data available in the iterator
while True:
try:
country = next(data_iterator).text
confirmed = next(data_iterator).text
deaths = next(data_iterator).text
continent = next(data_iterator).text

# For 'confirmed' and 'deaths', make sure to remove the commas and convert to int
data.append((
country,
(confirmed.replace(', ', '')),
(deaths.replace(',', '')),
continent
))
# StopIteration error is raised when there are no more elements left to iterate through
except StopIteration:
break

After running above code you are able to extract the data from the website, now we will be creating a pandas data frame for further analysis.

# Sort the data by the number of confirmed cases 
data.sort(key = lambda row: row[1], reverse = True)
import pandas as pd
dff=pd.DataFrame(data,columns=['country','Number of cases','Deaths','Continment'],dtype=float)
dff.head()
dff['Number of cases'] = [x.replace(',', '') for x in dff['Number of cases']]
dff['Number of cases'] = pd.to_numeric(dff['Number of cases'])
dff
Data frame
dff.info()
Data type

Creating a new column Death_rate :

dfff=dff.sort_values(by='Number of cases',ascending=False)dfff['Death_rate']= (dfff['Deaths']/dfff['Number of cases'])*100
dfff.head()

Data Visualization

Photo by Franki Chamaki on Unsplash
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
rcParams['figure.figsize'] = 15, 10
from matplotlib.pyplot import figure
figure(num=None, figsize=(20, 6), dpi=80, facecolor='w', edgecolor='k')
sns.pairplot(dfff,hue='Continment')
Pair plot
sns.barplot(x='country',y='Number of cases',data=dfff.head(10))
Bar plot
sns.regplot(x='Deaths',y='Number of cases',data=dfff)
Regplot
sns.scatterplot(x="Number of cases", y="Deaths",hue="Continment",data=dfff)
Scatterplot
sns.boxplot(x='country',y='Deaths',data=dfff.head(10),hue='Continment')
Box plot
dfg=dfff.groupby(by='Continment',as_index=False).agg({'Number of cases':sum,'Deaths':sum})
dfgg=dfg[1:]
df1=dfgg.sort_values(by='Number of cases',ascending=False)df1['Death_rate']=(df1['Deaths']/df1['Number of cases'])*100df1.sort_values(by='Death_rate',ascending=False)
sns.barplot(x='Continment',y='Death_rate',data=df1.sort_values(by='Death_rate',ascending=False))
import texttable as tt# create texttable objecttable = tt.Texttable() 
table.add_rows([(None, None, None, None)] + data) # Add an empty row at the beginning for the headers
table.set_cols_align(('c', 'c', 'c', 'c')) # 'l' denotes left, 'c' denotes center, and 'r' denotes right
table.header((' Country ', ' Number of cases ', ' Deaths ', ' Continent '))

print(table.draw())
Table

Conclusion

Web scrapping is really helpful in pulling large amounts of data from websites as quickly as possible. and then do the data analysis and create insightful visualizations based on your business need.

I hope this article will help you and save a good amount of time. Let me know if you have any suggestions.

HAPPY CODING.

Prabhat Pathak (Linkedin profile) is an Associate Analyst.

Photo by Austin Chan on Unsplash

--

--