Master Web Scraping Completely From Zero To Hero 🕸

Using Beautiful Soup and Requests Library with One Project

Abhay Parashar
Nov 7, 2020 · 14 min read
Image for post
Image for post

Web scraping is a technique of scraping data from different websites. This data can be in the form of texts, links, tables, and images. Although scraping any private data from websites is illegal but still we can scrape all the data that is publicly available and we can use it in our projects. Public data is a type of data that is publicly available for everyone to use and see, for example, A list of books available on an e-store. Private data is data that is private to a user of a company, for example, login credentials, cart details.

Web scraping can be very useful in some scenarios where we need to compare the prices of different products on different websites or we need to know the reviews of a product from different websites or a particular website.

We are going to Cover Web Scraping Completely from Zero to Hero In a Total of 5 Section.

  1. Installation and Understanding
  2. Find vs Find_all vs Select
  3. Scraping Links, Tables, and Images From the web
  4. Pagination and Interaction With Files
  5. Project: Amazon Book Records Scraper

Installation and Understanding

In Python, we have many libraries and packages using which we can perform web scraping, but the simplest and easiest library is Beautiful Soup.

Beautiful Soup :

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for different parsed pages that can be used to extract data from Source(HTML), which is useful for web scraping. It is available for both python 2 and 3. Beautiful Soup Uses a Request Module to send a connection request to the target server.

Requests :

Requests is a Python HTTP library, released under the Apache License 2.0. It is also the most installed package in python. The goal of this library is to send HTTP requests simpler and more human friendly. Using a single line of code we can establish the connection between our python script and the target server.

Let’s Start by Installing All the libraries we are going to need for web scraping. First, you need to install the latest version of python from python.org. Now open your terminal and create a folder and name it anything but if you want to follow the same just name it as Bs4. Now In folder Bs4, we are going to create a virtual environment so that we can install all our packages in it. Type

python -m venv env

In the terminal. Here env is the environment name. Now you need to activate the environment, just type env\scripts\activate in the terminal and your env is activated. Now you need to install the packages just use pip and install all the packages.

pip install requests
pip install beautifulsoup4

Just for checking, type python, and then import requests if everything goes well then you will not see any errors but if you saw some errors then let me know below I will try to help you out.

Structure of Web Page

HTML stands for HyperText Markup Language is used to design the structure of a web page. It contains several tags in which the whole content of the page resides. In HTML there is are different tags to do different works. For Images there is an IMG tag, for Headings, there are heading tags(h1-h6) and more. We can also perform nesting of tags in HTML by putting one tag inside another tag. Every tag in HTML has given a unique id or a class called CSS selectors. Using these CSS selectors we perform scraping. Let’s see an example of a simple HTML web page.

<!DOCTYPE html>
<html lang="en"><head>
<meta>
-----
<!--- meta tags for seo -->
</meta>
<title>Sample Page</title>
<!--- Title-->
<css>----------</css>
<!--- Link to css-->
<link>---</link>
<!--- Other Links-->
</head><body>
<h1>This is Heading h1</h1>
<!---Very Big Heading -->
<p>This is paragraph text</p>
<!--- paragraph-->
<a href="#">link</a>
<!--- Link to external page-->
<div>
<h2></h2>
<!---Big Heading -->
<p></p>
</div>
<div id="main" class="main-div">
<!-Container with ID and class->
<p class="title-paragraph"></p>
<!--- paragraph with a class-->
<p id="body">HELLO</p>
<!--- paragraph with a ID-->
</div>
</body></html>

In the above HTML code, we can see that the whole code is divided into two main tags head and body. In the head tag we have meta tags, titles, and links to different files and API. In our body tag we have heading tags, paragraph tags, image tags, links, and div tags. a div tag is like a container we can store anything inside it. In the second div tag, we also have a unique id and a class name. Using these class names and ID’s we are going to scrape the data from the web page.

How does Web Scraping Works

Image for post
Image for post

As shown in the above image also, At first we send an Http connection request to the server then the server replies to us with a response. if the response is <200> then we have successfully Implemented a connection between our machine and web server. The next step is to fetch the source code from the server using the Beautiful Soup library. We do this by again sending a request to the server and then the server replies to us with source code. Then using some functions in the Beautiful Soup Library we fetch out all the content which we need.

Below Code Explains How to send a request to the server and checking the response back.

import requests
url = "http://quotes.toscrape.com/"
res = requests.get(url)
print(res) ##<200>

Now Let’s try to fetch the source code of the web page Using BeautifulSoup Library.

from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text,'html.parser')
print(soup) ##It will Print out all the source code

Find vs Find_all vs Select

Find, Find_all, and select all three are used to fetch the content from the source code using CSS selectors. Let’s see each one by one.

Find

It finds the first occurrence of the element or selector. It is mainly used to fetch the headings, titles, and product names from a web page. Let’s understand it more with an example. Here below we have the source code of a web page.

<body>
<div id="main-div">
<div id="secondary-div">
<h1 id="first-h1">Python</h1>
</div>
<div id="third-div">
</div> </div>
<h1 id="second-h1">Web Scraping</h1>
<p id="first-p"></p>
</body>

In the above source code using find if we do soup.find('div') then, the resulting output contains the first div which have an id main-div . Same if we do soup.find('h1') if will return us the first h1 that is inside the div with the id secondary-div . To fetch the text from the resulting output we use .text . to fetch the Python text we need to run soup.find('h1').text

We can also use CSS selectors like class and id to fetch the content from the web page. if you need to fetch the text Web Scraping using find then you need to use the selector soup.find('h1',{'id':'second-h1'}).text . Let’s see it working on a real-world example.

from bs4 import BeautifulSoup
import requests
url = "http://quotes.toscrape.com/"
res =requests.get(url)
soup=
BeautifulSoup(res.text,'html.parser')print(soup.find('h1').text)
##First Heading(h1) Text
print(soup.find('a',{'class':'tag'}).text)
##returns the text from the anchor tag with class `tag` print(soup.find('div', {'class': 'tags'}))
##returns the first div tag with the class `tags`

Find All

It finds all the occurrences of the element or selector. It is mainly used to scrape data from the tables, product reviews, details, and listed products on a web page. Let’s Understand it more with the same example we seen above.

<body>
<div id="main-div">
<div id="secondary-div">
<h1 id="first-h1">Python</h1>
</div>
<div id="third-div">
</div></div>
<h1 id="second-h1">Web Scraping</h1>
<p id="first-p"></p>
</body>

In the above source code using Find_all if we do soup.find_all('div') then it will be going to return us all the div present in the source code. same if we do soup.find_all('h1') then it will going to return us all the h1 tags. One thing to remember from it that it always returns the output in a list format so if we want to fetch the text then we need to use the indexing and then apply on each of the elements in the list using a for loop. Let’s see it in a real-world example.

from bs4 import BeautifulSoup
import requests
url = "http://quotes.toscrape.com/"
res = requests.get(url)
soup=BeautifulSoup(res.text,'html.parser')print(soup.find_all('h1')) ## Return all the h1 tags in a list
print(soup.find_all('div', {'class': 'tags'})) ## Return all the div with the class tags in a list
If we want to fetch all the tags from the web page we need to run a for loop in side the list of tags and then for each tag we will apply some cleaning and fetch the text.
tags = soup.find_all(class_="tags")
lst = []
for tag in tags:
lst.append(tag.text.replace("\n"," ").strip())
lst2 = [tag.replace(" ","") for tag in lst]
print(lst2) ## Print out all the tags in the web page
Image for post
Image for post
“All The Tags On The Web Page”

Select

It Selects the element or selector directly in the form of a list. It uses CSS selectors and tag selectors both. It is more powerful than Find and Find_all. It also returns the output in the form of a list. To select anything from the list we just need to pass the class name or id or tag in the select and it gets selected. Let’s understand it more with the same HTML code as above.

<body>
<div id="main-div">
<div id="secondary-div">
<h1 id="first-h1">Python</h1>
</div>
<div id="third-div">
</div></div>
<h1 id="second-h1">Web Scraping</h1>
<p id="first-p"></p>
</body>

Using select if we want to select all the divs than we just need to pass div as a parameter and all divs get selected soup.select('div') . Now If you want to select the text Python then we just need to use the CSS selector id and use get_text() function soup.select('#first-h1').get_text() . Here for id we need to use # and for class, we need to use . same as we do in CSS. Let’s try to use it in a real-world example.

from bs4 import BeautifulSoup
import requests
url = "https://webscraper.io/test-sites/e-commerce/allinone"
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser')
print(soup.select('p')) ##return all p tags in list format
print(soup.select('.title').get_text())
##return the text from all the divs with the class titleNow if we want to scrape all the product details using select the code would look like this
name = soup.select(".title")
for i in range(0,len(name)):
price = soup.select(".price")[i].text
name = soup.select(".title")[i].get_text()
description = soup.select(".description")[i].get_text()print(name)
print(description)
print(price,end="\n\n")
Image for post
Image for post

Scraping Links, Tables, and Images From the web

Let’s start by instantiating our environment and importing all the libraries and fetching the source code from the server.

import requests
from bs4 import BeautifulSoupurl = "https://webscraper.io/test-sites/e-commerce/allinone"
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser')

Websites used in this section:
https://webscraper.io/test-sites/e-commerce/allinone

Links

Scraping LInks can be very useful and handy, whenever we are trying to fetch details of the product from a website. In HTML all the links reside inside the anchor tag href attribute. To scrape the links we first need to scrape the anchor tag then using attribute property we can scrape the links.

Scraping all the links from the web page: To scrape all the links from the web page we can use find_all or select both and just look for all the anchor(a) tags.

for a_tag in soup.findAll("a"):
href = a_tag.attrs.get("href")
if href != "":
print(href)
continue

The output will be all the Links available on the web page.

Scraping links from a particular selector: To scrape the links from a particular selector we just need to select that selector and then we can easily fetch the value of the attribute href

div = soup.find('div',{'class':'col-sm-4 col-lg-4 col-md-4'})
a = div.find('a')
link = a.attrs.get("href")
print(link)

Tables

Scraping Tables is the most used case for web scraping. every time whenever we are creating a dataset for our data science or machine learning project we need to fetch the data from a third-party resource. In general, to scrape the data from a table we first need to scrape the table tag, then we scrape the body and head of the table one by one. There are many types of tables we can find on the web and performing web scraping and scraping data is different for each table. below we will see three types of tables and see how we can scrape the data from them.

  1. Table with a proper thead and tbody: First we fetch the table, then the table header and after that, we concatenate it with table body. we are also going to save the table data into a CSV file.
table = soup.select('.table')[0]
table_header = table.find('thead').find_all('th')
with open('table.csv','a',newline='') as f:
writer = csv.writer(f)
header = []
for th in table_header:
header.append(th.text)
print(header)
writer.writerow(header)
for row in table.find_all('tr'):
body = []
for data in row.find_all('td'):
body.append(data.text)
print(body)
writer.writerow(body)

2. Table without thead: This type of scenario is much simpler than previously because in these we only have to scrape the tbody. To scrape table body, first we scrape all the tr from the body and then in each tr we scrape all the td and append them in a list.

table = soup.select('.table')[0]
table_header = table.find('thead').find_all('th')
with open('table.csv','a',newline='') as f:
writer = csv.writer(f)
for row in table.find_all('tr'):
body = []
for data in row.find_all('td'):
body.append(data.text)
print(body)
writer.writerow(body)

3. Multiple headers and rows with an empty row

table = soup.select('.table')[3]
with open('table.csv','a+',newline='') as f:
writer = csv.writer(f)
for row in table.find_all('tr'):
csvRow = []
for data in row.find_all('th'):
csvRow.append(data.text)
for data in row.find_all('td'):
csvRow.append(data.text)
print(csvRow)
writer.writerow(csvRow)

Images

Scraping images can be very handy when we are preparing a dataset for a deep learning model. It is also very tough, for every website the process will be different. In this blog, I am showing you to scrape images from shutter stock. To scrape an image we first need to fetch the img tag and then from that tag we need to fetch the src attribute and finally, we use the request module to save the images into png or jpg format.

from bs4 import BeautifulSoup
import requests
import os
res = requests.get(f"https://www.shutterstock.com/search/fruit")
soup = BeautifulSoup(res.text,"html.parser")
links = []
# print(soup)
images = []
tags = soup.findAll("img")
for link in tags:
src = link.attrs.get('src')
if str(src)!="None":
link = link.attrs.get('src')
links.append(link)
print(links)
image_count=1for image in links:
with open('image_'+str(image_count)+'.jpg', 'wb') as f:
res = requests.get(image)
f.write(res.content)
image_count+=1
print("Saving image_"+str(image_count))
Saved Images Using Web Scraping
Saved Images Using Web Scraping
“Saved Images Using Web Scraping”

Pagination and Interaction With Files

Pagination or paging is the process of dividing a document into multiple discrete pages. It is so common on every website nowadays. Most websites doesn’t Show much data on their home page because of SEO and Slow loading of the website. There are multiple types of pagination you will see on different websites. some of them only contains next and prev buttons, some might have numbers or some might have both like amazon.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Different Types of Pagination

Whenever we click on the next page you can see that there are some changes that happen in the URL. Suppose in starting if our URL is https://www.xyz.com then after clicking a page further the URL will become https://www.xyz.com/page=2 or https://www.xyz.com/page/2 . The change in the URL can be different for each website but the common thing is the page number.

Now to scrape the data from all the next pages we simply need to run a loop and in every iteration of the loop, we will change the page number. Let’s see it step by step with an example. TESTSITE

Step 1: Scraping the data from the home page

##Importing Libraries
import requests
from bs4 import BeautifulSoup
##Fetching Whole Source Code
res = requests.get("http://quotes.toscrape.com/")
soup = BeautifulSoup(res.text,"html.parser")
##Finding total number of quotes
length = len(soup.select(".text"))
##Running a Loop and Fetching Quote and Author Name
for i in range(0,length):
quote = soup.select(".text")[i].get_text().strip()
author = soup.select('.author')[i].get_text().strip()
print(quote)
print(author)

Step 2: Preparing the Custom URL

### Number of pages to Scrape
page=10
### Custom URL
URL = f'http://quotes.toscrape.com/page/2/'
### Here by In the place of 2 if we put 3 then we will redirected to page 3 same for 4 and so on. here we just need to change the 2 with a i and then we change the value of i using a for loop### Running a loop in the range of 0 to number of pages and generating URLS
for i in range(0,page):
URL=f"http://quotes.toscrape.com/page/{i}/"

Step 3: Using the Custom URL fetching all the Quotes

page = 10 ## Number of Pages you want to scrapefor i in range(0,page):
res = requests.get(f"http://quotes.toscrape.com/page/{i}/")
soup = BeautifulSoup(res.text,"html.parser")
### Finding the Page Length
length = len(soup.select(".text"))
### Scraping All the Quotes with author name
for j in range(0,length):
quote = soup.select(".text")[j].get_text().strip()
author = soup.select('.author')[j].get_text().strip()
print(quote)
print(author)
Image for post
Image for post
“Quotes From All the pages”

It is always a better idea to save all the data that we scrape into a file.

Let’s Use a CSV file and save all the quotes in the file

Project: Amazon Book Records Scraper

The Project We are Going to Build Can be Very Handy and Useful whenever we are looking to buy some books and we need a quick comparison with all the details of the book.

In this project, we are going to build a web scrape that can scrape all the book records from the amazon including Book Name, Book Price, and Link To that Book. We Will Also Going to save all the records in a CSV file. Let’s Start.

  1. Importing all the required Libraries
from bs4 import BeautifulSoup
import requests
import csv

2. Defining Headers: Some Websites like amazon blocks all the incoming requests that seem to be bot so we need to use headers so that we can react like a browser and amazon don’t block our requests.

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

3. Generating URL Based on a Book Name

def amazon(book):
book = book.replace(" ",'+')
url = f'https://www.amazon.in/s?k={book}&ref=nb_sb_noss_2'
print(url)
amazon(input("Enter the book name\n"))

4. Scraping Book Names and Prices From The Generated URL

def amazon(book):
book = book.replace(" ",'+')
url = f'https://www.amazon.in/s?k={book}&ref=nb_sb_noss_2'
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.text,'html.parser')
length= len(soup.select(".a-size-medium"))
for i in range(length):
price = soup.select(".a-spacing-top-small .a-price-whole")[i].get_text().strip()
names = soup.select(".a-color-base.a-text-normal")[i].get_text().strip()
link = soup.select("h2 .a-link-normal")[i].attrs.get("href")
amazon(input("Enter the book name\n"))

5. Removing All Those Records that don’t have a Price and Also Handling with the deal price

6. Asking User to enter the number of pages and then running a loop inside all the pages

amazon(input("Enter the book name\n"),int(input("Number of pages")))

7. Saving All the Records in a CSV file

-
-
-
with open('books.csv','a',newline='') as f:
writer = csv.writer(f)
lst = []
-
-
-
lst = [names,price,link]
writer.writerow(lst)

8. Final Code

Image for post
Image for post
Image for post
Image for post
Results

Thanks For Reading, If You Have Any Issue Regarding any code then let me know below.

If You Have, Find this blog useful then 👋

Thanks For Reading😀😀

About the Author

I am Abhay Parashar, a computer science student with an interest in data science and AI. I write articles related to data science, machine learning, and AI. If you want You can connect with me on Linkedin.

Find all the source code with 2 more projects here : GitHub

“The More You Learn The More You Earn”

Image for post
Image for post

Pythoneers

Everything You Can Do With Python Lives Here

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store