Web Scraping made easy using Python

Mamtha
Analytics Vidhya
Published in
7 min readNov 26, 2019

--

In this article, we will learn about Web scraping techniques which allows us to extract useful data from any website using python’s BeautifulSoup library.

What is web scraping?

Web scraping is a mechanism of collecting large amounts of data from the webpage and store the data into any required format which further helps us to perform analysis on the extracted data. Beautifulsoup — Python’s package for parsing HTML and XML documents, helps us to extract the data easily.

Below are the steps we will follow to extract the data using python:

  • We will first find the website URL to scrape.
  • Inspect the page.
  • Look for the data which we want to extract.
  • Write our python code and run the code.
  • Store data in the required format.

BeautifulSoup is the most advanced library for web scraping which parses the HTML/XML content from any webpage. Check out the BeautifulSoup official page here.

In this learning process, we will go over below two web scraping sections where we will explore more functionalities of BeautifulSoup:

1️. Scraping a retail webpage (Bigbasket Grocery website) to extract product information and then storing the data to CSV/JSON file.

2️. Scraping tabular data from a website and make it ready as Dataframe using pandas.

Before moving forward, I would like to explain the basic concepts of HTML(Component of a Webpage) and how to inspect a webpage manually.

HTML is a markup language used to structure the web page. It provides tags like <li> for list, <div> for division, <p> for paragraph etc.

Inspecting Webpage : HTML code

Sample HTML Document :

Steps to inspect webpage:

  • First, we will open the website URL in a browser (Chrome for this example)
  • Right-click on the page and then select ‘Inspect’.
  • ‘Chrome DevTools’ window opens up at the side of the page where we can look at the HTML content of this webpage.
Right-Click on Bigbasket website to see HTML content

In our code, we will use BeautifulSoup to download this HTML content from the above website and then do our data extraction.

Let’s get started!

1️. Scraping Bigbasket Website :

In this section, I will walk you through the step by step process of extracting product details like Product Name, Brand Name, Product Quantity, Price and Product Description from this website using BeautifulSoup and finally store the data to a CSV file in a readable format.

Note: Please note that I have gathered EAN codes(unique codes) corresponding to grocery items which we will use in our code to get the above product details.

Step 1: Installing and Importing required libraries to Jupyter notebook.

pip install BeautifulSoup

pip install requests

from bs4 import BeautifulSoup as bs 
import requests # importing requests module to open a URL

Step 2: Define our EAN code list for which we need to extract data and assign it to a variable called ‘eanCodeLists’

eanCodeLists = [126906,40139631,40041188,40075201,40053874,1204742,40046735,40100963,40067874,40045943]

Let’s first check for one EAN code-40053874 to get Product Name, Brand Name, Product Quantity, Price and Product Description and then later use for loop to iterate over the above list to get all products details.

Step 3: Open the URL using requests.get() method , which makes HTTP request to a web page.

urlopen = requests.get('https://www.bigbasket.com/pd/40053874').text

Step 4: Use BeautifulSoup to parse HTML and assign it a variable called ‘soup’

soup = bs(urlopen,'html.parser')

Output:

Few lines from HTML code

Step 5: Now let’s open the URL(https://www.bigbasket.com/pd/40053874) in our browser and right-click on the content which we need and get the corresponding HTML tags. We will use these tags in our code to get the required data.

Here, let’s right-click on the field ‘Weikfield Chilli Vinegar, 200 g’ to get the tag names. This will give us the Brand name, Product Name, and Quantity. Refer below image

<h1 class=”GrE04" style=”-webkit-line-clamp:initial”>Weikfield Chilli Vinegar, 200 g </h1>

Now let’s use beautifulsoup to refer these tags and assign to a variable ‘ProductInfo’

ProductInfo = soup.find("h1", {"class": "GrE04"}).text  # .text will give us the text underlying that HTML element

Step 6: Now, we can use split() method to get as below:

Here, split(' ‘,1)[1] gives ‘Chilli Vinegar, 200 g ‘ and split(',')[0] splits using ‘,’ and gives ‘Chilli Vinegar’

strip() method trims whitespaces

Step 7: To get Price and Product description

Price field tags: <td data-qa=”productPrice” class=”IyLvo”>Rs <! — →35</td>

Product description field tags: <div class=”_26MFu “><style …

Thus we now have,

ProductName= Chilli Vinegar
BrandName= Weikfield
ProducQty = 200 g
ProductPrice= Rs 35
ProductDesc = The spiciness of a fresh green chilli diffusing its heat into sharp vinegar makes this spicy vinegar a unique fusion of spicy and sour notes.

Step 8: We can now get all EAN codes info using the above code and for loop

Step 9: Using pandas to store data to Dataframe

DataFrame data

Step 10: Finally saving data to CSV and JSON files ( on to our local directory)

JSON format

2️. Scraping tabular format data :

In this section, we will scrape the website (https://www.ssa.gov/OACT/babynames/decades/names2010s.html) which has tabular format data of ‘200’ popular names of male and female babies born during the period 2010–2018 in the United States. (This is sample data based on Social Security card application data as of March 2019).

Step 1: Importing libraries and using BeautifulSoup to parse HTML content

import requests
from bs4 import BeautifulSoup as bs
url = requests.get('https://www.ssa.gov/OACT/babynames/decades/names2010s.html').text
soup = bs(url,'html.parser')

Step 2: Let’s use <table class=”t-stripe”> to extract table data

table_content = soup.find('table',{'class':'t-stripe'})

Here, we see tag names ‘td’ which represents table data(data cell), ‘th’ (table header) and ‘tr’ (table rows). We will now use ‘tr’ tag from this ‘table_content’ which has a combination of ‘td’ and ‘th’

data = table_content.findAll('tr')[0:202] #returns all 200 rows including header

Step 3: Now let’s use for loop to iterate over these 200 rows and get the data into a ‘list’ variable called ‘rows_data’.

First, let’s check the length of the ‘data’ using len(data)

and here is the code…

Step 4: Now let’s use pandas to store data to Dataframe

Step 5: We can now perform some operations to this data and get few insights

Let’s check how many number of times name ‘Samuel’ is used.

df[df['Male_Name'] == 'Samuel'][['Male_Name','Male_Number','Rank']]

Conclusion

In this way, we can use web scraping using python to scrape any website and extract some useful data that can be used to perform any kind of analysis. Some of the use cases for web scraping are:

☑ Weather forecasting
☑ For Marketing
☑ For Businesses / eCommerce: Market Analysis, Price Comparison, Competition Monitoring
☑ Media companies
☑ Gathering data from multiple sources for analysis
☑ For getting updated news reports
☑ Travel companies use to collect live tracking details

and many more…

Thanks for reading and happy web scraping! 🙂

I look forward to your comments, please feel free to leave them below

--

--

Mamtha
Analytics Vidhya

Data warehouse technologies | Big data | Data Science