Web Scraping made easy using Python
In this article, we will learn about Web scraping techniques which allows us to extract useful data from any website using python’s
BeautifulSoup
library.
What is web scraping?
Web scraping is a mechanism of collecting large amounts of data from the webpage and store the data into any required format which further helps us to perform analysis on the extracted data. Beautifulsoup — Python’s package for parsing HTML and XML documents, helps us to extract the data easily.
Below are the steps we will follow to extract the data using python:
- We will first find the website URL to scrape.
- Inspect the page.
- Look for the data which we want to extract.
- Write our python code and run the code.
- Store data in the required format.
BeautifulSoup is the most advanced library for web scraping which parses the HTML/XML content from any webpage. Check out the BeautifulSoup official page here.
In this learning process, we will go over below two web scraping sections where we will explore more functionalities of BeautifulSoup:
1️. Scraping a retail webpage (Bigbasket Grocery website) to extract product information and then storing the data to CSV/JSON file.
2️. Scraping tabular data from a website and make it ready as Dataframe using pandas.
Before moving forward, I would like to explain the basic concepts of HTML(Component of a Webpage) and how to inspect a webpage manually.
HTML is a markup language used to structure the web page. It provides tags like <li> for list, <div> for division, <p> for paragraph etc.
Sample HTML Document :
Steps to inspect webpage:
- First, we will open the website URL in a browser (Chrome for this example)
- Right-click on the page and then select ‘Inspect’.
- ‘Chrome DevTools’ window opens up at the side of the page where we can look at the HTML content of this webpage.
In our code, we will use BeautifulSoup to download this HTML content from the above website and then do our data extraction.
Let’s get started!
1️. Scraping Bigbasket Website :
In this section, I will walk you through the step by step process of extracting product details like Product Name, Brand Name, Product Quantity, Price and Product Description from this website using BeautifulSoup and finally store the data to a CSV file in a readable format.
Note: Please note that I have gathered EAN codes(unique codes) corresponding to grocery items which we will use in our code to get the above product details.
Step 1: Installing and Importing required libraries to Jupyter notebook.
pip install BeautifulSoup
pip install requests
from bs4 import BeautifulSoup as bs
import requests # importing requests module to open a URL
Step 2: Define our EAN code list for which we need to extract data and assign it to a variable called ‘eanCodeLists’
eanCodeLists = [126906,40139631,40041188,40075201,40053874,1204742,40046735,40100963,40067874,40045943]
Let’s first check for one EAN code-40053874 to get Product Name, Brand Name, Product Quantity, Price and Product Description and then later use
for loop
to iterate over the above list to get all products details.
Step 3: Open the URL using requests.get() method
, which makes HTTP request to a web page.
urlopen = requests.get('https://www.bigbasket.com/pd/40053874').text
Step 4: Use BeautifulSoup to parse HTML and assign it a variable called ‘soup’
soup = bs(urlopen,'html.parser')
Output:
Step 5: Now let’s open the URL(https://www.bigbasket.com/pd/40053874) in our browser and right-click on the content which we need and get the corresponding HTML tags. We will use these tags in our code to get the required data.
Here, let’s right-click on the field ‘Weikfield Chilli Vinegar, 200 g’ to get the tag names. This will give us the Brand name, Product Name, and Quantity. Refer below image
<h1 class=”GrE04" style=”-webkit-line-clamp:initial”>Weikfield Chilli Vinegar, 200 g </h1>
Now let’s use beautifulsoup
to refer these tags and assign to a variable ‘ProductInfo’
ProductInfo = soup.find("h1", {"class": "GrE04"}).text # .text will give us the text underlying that HTML element
Step 6: Now, we can use split()
method to get as below:
Here, split(' ‘,1)[1]
gives ‘Chilli Vinegar, 200 g ‘ and split(',')[0]
splits using ‘,’ and gives ‘Chilli Vinegar’
Step 7: To get Price and Product description
Price field tags: <td data-qa=”productPrice” class=”IyLvo”>Rs <! — →35</td>
Product description field tags: <div class=”_26MFu “><style …
Thus we now have,
ProductName= Chilli Vinegar
BrandName= Weikfield
ProducQty = 200 g
ProductPrice= Rs 35
ProductDesc = The spiciness of a fresh green chilli diffusing its heat into sharp vinegar makes this spicy vinegar a unique fusion of spicy and sour notes.
Step 8: We can now get all EAN codes info using the above code and for loop
Step 9: Using pandas to store data to Dataframe
Step 10: Finally saving data to CSV and JSON files ( on to our local directory)
2️. Scraping tabular format data :
In this section, we will scrape the website (https://www.ssa.gov/OACT/babynames/decades/names2010s.html) which has tabular format data of ‘200’ popular names of male and female babies born during the period 2010–2018 in the United States. (This is sample data based on Social Security card application data as of March 2019).
Step 1: Importing libraries and using BeautifulSoup to parse HTML content
import requests
from bs4 import BeautifulSoup as bsurl = requests.get('https://www.ssa.gov/OACT/babynames/decades/names2010s.html').text
soup = bs(url,'html.parser')
Step 2: Let’s use <table class=”t-stripe”> to extract table data
table_content = soup.find('table',{'class':'t-stripe'})
Here, we see tag names ‘td’ which represents table data(data cell), ‘th’ (table header) and ‘tr’ (table rows). We will now use ‘tr’ tag from this ‘table_content’ which has a combination of ‘td’ and ‘th’
data = table_content.findAll('tr')[0:202] #returns all 200 rows including header
Step 3: Now let’s use for loop
to iterate over these 200 rows and get the data into a ‘list’ variable called ‘rows_data’.
First, let’s check the length of the ‘data’ using len(data)
and here is the code…
Step 4: Now let’s use pandas to store data to Dataframe
Step 5: We can now perform some operations to this data and get few insights
Let’s check how many number of times name ‘Samuel’ is used.
df[df['Male_Name'] == 'Samuel'][['Male_Name','Male_Number','Rank']]
Conclusion
In this way, we can use web scraping using python to scrape any website and extract some useful data that can be used to perform any kind of analysis. Some of the use cases for web scraping are:
☑ Weather forecasting
☑ For Marketing
☑ For Businesses / eCommerce: Market Analysis, Price Comparison, Competition Monitoring
☑ Media companies
☑ Gathering data from multiple sources for analysis
☑ For getting updated news reports
☑ Travel companies use to collect live tracking details
and many more…
Thanks for reading and happy web scraping! 🙂
I look forward to your comments, please feel free to leave them below