Concepts of Web Scraping with Python, Requests & Beautiful Soup — Part 3

Srijeet Chatterjee
9 min readSep 11, 2021

--

Hi Readers, its Time to Shoot through the concepts…..

Scene From “Die Another Day Q’s Briefing Shot Through scene”

Hopefully after the theory and introduction in part 1, you could understand and solve the part 2 major task completely and in that scenario, just to ensure it was not a fluke, lets solve some small small conceptual problems and once you solve all of them you will be ready to tell yourself that you have understood the Web Scrapping using Beautiful Soup conceptually. So let’s start.

We will be solving 4 medium tasks here. But remember the rule for Max Benefit is : “Completing the tasks on your own before going through the solution/article”.

You will find the entire code used in this article in my github link, metioned below.

Task 0 : Extract the title from the page "https://www.example.com/"

example.com webpage

We will start with importing the libraries.

#import the Libraries 
import lxml
import bs4
import requests

Then crawl the contents. And verify their data types.

# crawl the complete content
content = requests.get("https://www.example.com/")
>> print(type(content))
>> O/P : <class 'requests.models.Response'>
>> print(content)
>> O/P : <Response [200]>

We would like to keep the text only and get the contents in the string format.

textContent = content.text
print(type(textContent))
o/p : <class 'str'>

Lets see what is there in the textContent

textContent
text content in str format
formattedText = bs4.BeautifulSoup(textContent,'lxml')print(type(formattedText))
O/P : <class 'bs4.BeautifulSoup'>

Now lets see the formattedText

formattedText
formatted content output

In this task, it’s very easy and we just want to grab the element with “h1” tag. So let us just do that.

seletedContent = formattedText.select('h1')
print(type(seletedContent))
O/P : <class 'bs4.element.ResultSet'>

Let’s see the content of grabbed element.

seletedContent
o/p : [<h1>Example Domain</h1>]

We will select the first element of the list out-put using list indexing. And then just keep the text part of it and drop everything else.

seletedContent[0]
O/P : <h1>Example Domain</h1>
seletedContent[0].getText()
O/P : 'Example Domain'

BINGO Task 0 done !!!

Task 1 : Extract table of contents from a webpage.

The url of the webpage is, as mentioned below. It is the wiki page for the famous American computer scientist Grace Hopper.

url = 'https://en.wikipedia.org/wiki/Grace_Hopper'

The task is to extract the table contents from the page below as depicted in the below picture :

Table Contents on the Page

First step will never change, import the libraries and the get the HTML contents using requests module. I am not loading the libraries again here, hoping you are all fine with that.

scrappedtext = requests.get(url)scrappedtextStr = scrappedtext.textprocessedScrapped = bs4.BeautifulSoup(scrappedtextStr,'lxml')print(type(processedScrapped))
O/P : <class 'bs4.BeautifulSoup'>

And now once you see through the HTML page source you will find that (as per my inspection) table of contents are under the tag “.toclevel-1”. So let us grab the html elements under that tag. And get the text out of it and put it in a container “finalList”.

elementLists = processedScrapped.select('.toclevel-1')
finalList = []
for ele in elementLists:
finalList.append(ele.getText())
print(ele.getText())

The output of the above code is as below:

If you inspected differently or realised even a better way to see the whole thing(table contents) is like, when you select the id as ‘.toctext’ class.

processedScrapped.select('.toctext')
tag elements of ‘.toctext’

But before going ahead, let us just print the types and size of the elements, just to ensure we are going right.

print(type(processedScrapped.select('.toctext')))
O/P : <class 'bs4.element.ResultSet'>
print(len(processedScrapped.select('.toctext')))
O/P : 25
print(type(processedScrapped.select('.toctext')[0]))
O/P : <class 'bs4.element.Tag'>

So we will put together the code and get the table of contents via ‘.toctext’ elements and put it in out put container list.

elementLists = processedScrapped.select('.toctext')
finalList = []
for ele in elementLists:
finalList.append(ele.getText())

# .getText() on a bs4.element.Tag object
print(ele.getText())

# .text on a bs4.element.Tag object
print(ele.text)
Programme’s Output

Bingo 😃

Task 2 : Scrap particular images from a specific url

url : https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)"

We want to scrap the images : (1) deep Blue computer and (2) Kasparov playing chess.

The web-page
url = 'https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)'scrappedtext = requests.get(url)
scrappedtextStr = scrappedtext.text
processedScrapped = bs4.BeautifulSoup(scrappedtextStr,'lxml')
print(type(processedScrapped))
O/P : <class 'bs4.BeautifulSoup'>

Now we will see the “processedScrapped”.

processedScrapped
selectedContent = processedScrapped.select('img')print(len(selectedContent))
O/P : 9
print(selectedContent[1])
O/P : <img alt="Chess Programming.svg" data-file-height="60" data-file-width="60" decoding="async" height="150" src="//upload.wikimedia.org/wikipedia/commons/thumb/5/52/Chess_Programming.svg/150px-Chess_Programming.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/5/52/Chess_Programming.svg/225px-Chess_Programming.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/5/52/Chess_Programming.svg/300px-Chess_Programming.svg.png 2x" width="150"/>

This must be a more generic set of images and seems like its selecting more than the 2 images(as it has selected 9 ). I want to select/scrap in the main page inside the article.

Let us do a more detailed inspection. And let me try with ‘.image’ as seems a lot of images are under the same tag.

selectedContent = processedScrapped.select('.image')print(len(selectedContent))
O/P : 5
print(selectedContent[1])
O/P :
<a class="image" href="/wiki/File:Chess_Programming.svg"><img alt="Chess Programming.svg" data-file-height="60" data-file-width="60" decoding="async" height="150" src="//upload.wikimedia.org/wikipedia/commons/thumb/5/52/Chess_Programming.svg/150px-Chess_Programming.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/5/52/Chess_Programming.svg/225px-Chess_Programming.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/5/52/Chess_Programming.svg/300px-Chess_Programming.svg.png 2x" width="150"/></a>

And as you can see, these set of elements are also not the correct ones as it returns 5 image objects. So let us try inspecting again.

Page with Page-Souce Code
selectedContent = processedScrapped.select('.thumbimage')print(len(selectedContent))
O/P : 1

And yes finally this time it has rightly selected the kasperov image at the page bottom.

print(selectedContent[0].text)
O/P : [empty as there is no text]
print(selectedContent[0].getText())
O/P : [empty as there is no text]
print(selectedContent[0]['src'])

Now basically I want a specific portion of it and .getText() or .text did not return anything as this is empty. Thing to notice here is, it’s a special Tag object and you can consider this as a dictionary object and grab its properties such as ‘src’ in out scenario.

image = requests.get("https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Kasparov_Magath_1985_Hamburg-2.png/220px-Kasparov_Magath_1985_Hamburg-2.png")print(type(image.content))
O/P : bytes

So let us just save it and you can check it locally later. And then move to next task.

with open(‘myImage.png’,’wb’) as f:
f.write(image.content)

And comes the final task and its a complete task as well.

Task 3 : Get the title of every book with a 2 star rating from a site

url = 'https://books.toscrape.com/'
The Web Page
#lets load the libraries one last time in this article series 
import bs4,lxml,requests
scrapped = requests.get('https://books.toscrape.com/')
soup = bs4.BeautifulSoup(scrapped.text,'lxml')

After investigation, let us see the size of the “h3” tag elements.

len(soup.select(‘h3’))
O/P : 20

Seems good right , it’s 20 and if you count the number of books on the page it’s 20 books per page. So good to go.

But when we try to extract the text we realise that the name of books are not complete under the selected elements. As you can see in the below code.

soup.select('h3')[0].getText()
O/P : 'A Light in the ...'

So I tried with the “.product_pod” tag and for this tag also we had 20 sized output variable.

len(soup.select(‘.product_pod’))
O/P : 20

And then I took the first element and printed the output.

listBooks = soup.select('.product_pod')
firstBook = listBooks[0]
firstBook

The output looks like as below.

And I wanted to see whether selecting ‘.star-rating.Three’ will give me any fruitful result.

firstBook.select('.star-rating.Three')

So out put was not something I wanted. Point is how will I filter for book with rating 2 or 3 or 4 ??

What I can do is : I can check whether the class is “‘.star-rating.Three’” or not.

firstBook.select(‘.star-rating.Two’)
O/P : [ ]
firstBook.select('.star-rating.Two') == []
O/P : True

So basically we know how to check and now to get the full content we will look at the tag ‘a’ in each soup element.

firstBook.select(‘a’)O/P : 
[ <a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>,
<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a> ]firstBook.select(‘a’)[1]
O/P : <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>

Now after looking at this I feel we will get the full text name as well. Let’s see the type and get the full name.

type(firstBook.select('a')[1])
O/P : bs4.element.Tag
firstBook.select('a')[1].getText()
O/P : 'A Light in the ...'

But this is not the book name in full, it’s not under that tag. So lets just see the below :

firstBook.select('a')[1]['title']
O/P : 'A Light in the Attic'

Bingo.

Now the question that remains is : how to check how many pages are there logic ??

So, lets try with something which I know is not there… that is page number 100 is not there.

url = ‘https://books.toscrape.com/catalogue/page-100.html'

The message on the page which does not exist

If you analyse that this page is having h1 tag text as “404 Not Found” and we will use that only !!!

i = 100
scrapped = requests.get(f"https://books.toscrape.com/catalogue/page-{i}.html")
soup = bs4.BeautifulSoup(scrapped.text,'lxml')
soup
soup.select('title')[0].getText()
O/P : '404 Not Found'

Lets now combine both the components to get the final results:

i = 1scrapped = requests.get(f"https://books.toscrape.com/catalogue/page-{i}.html")
soup = bs4.BeautifulSoup(scrapped.text,'lxml')
responsepageTitle = soup.select('title')[0].getText()list_of_book_names = []while responsepageTitle != '404 Not Found':

#code to scrape the data
list_of_book_products = soup.select('.product_pod')

for book in list_of_book_products:
# print(book)
if book.select('.star-rating.Two') != []:
#append the scrapped data to the final list
list_of_book_names.append(book.select('a')[1]['title'])
# print(i)
i = i + 1
scrapped = requests.get(f"https://books.toscrape.com/catalogue/page-{i}.html")
soup = bs4.BeautifulSoup(scrapped.text,'lxml')

responsepageTitle = soup.select('title')[0].getText()

Job Done !!! Let us just see the output .

list_of_book_names[:10]O/P : 
['Starving Hearts (Triangular Trade Trilogy, #1)',
'Libertarianism for Beginners',
"It's Only the Himalayas",
'How Music Works',
'Maude (1883-1993):She Grew Up with the country',
"You can't bury them all: Poems",
'Reasons to Stay Alive',
'Without Borders (Wanderlove #1)',
'Soul Reader',
'Security']

perfect !!!!

So with this, we come to the end of this article series and I hope that you have learned “web Scraping with Beautiful Soup” to an extent. There are multiple other ways for scrapping, and may be I will write a separate article series for the same.

Please go back and clap for all my articles incase you liked the series 👌

Once again the full code is under my git repo. clone it and play along :)

References :

--

--

Srijeet Chatterjee

Cognitive Data Scientist @ IBM | IIT-Delhi alumni | Machine Learning , Deep Learning and an AI enthusiast.