Concepts of Web Scraping with Python, Requests & Beautiful Soup — Part 2

9 min readSep 4, 2021

Now that we have gone through all the necessary concepts, it is now to get our hands dirty with solving an end to end task on web scrapping.

Best way to learn a programming language is to code in it, similarly the best way to learn a tool is to use the tool to solve tasks till you master it !!!

We will be solving this end to end task in this article, where I will give a live coding documentary so that you understand the gninja story well !!! And for any additional questions you can always mention that in the comments section.

Task 1: “You have to go to a page “https://en.wikipedia.org/wiki/Android_version_history" and then scrap the table content present there. Put the content in an organised way in a csv format. Read the csv file as a pandas data-frame and display the first 8 records of the data frame”.

The entire task can be divided into 6 steps and then we will work on these five steps one by one. First two tasks has no order preference so both we can put as step 1 only.

STEP 1.0 : Getting the HTML page data using GET request STEP 1.1 : Inspect the HTML Page and Make observations about the "data of interest" like which tag it falls under, what are its properties and how many similar elements are present in the page. Once your Investigation is complete design a solution algorithm. STEP 2 : Parse the HTML data to Create the soup object STEP 3 : Filter the relevant information STEP 4 : create a csv file using the information. STEP 5 : Load the data as pandas data-frame and display

Step 1.0 : Getting the HTML page data

Lets get the HTML page first: 
    
Approach 1 : use requests library 
    Approach 2 : use urllib.request.urlopen library 
    Approach 3 : download it manually as .html page and then read it

We will implement all the three approaches. But before that lets keep the link of the web site that is the basel_url in a variable.

base_url = "https://en.wikipedia.org/wiki/Android_version_history"

Approach 1 :

>> import requests>> osVersionData = requests.get(base_url,'parser.html')>> print(type(osVersionData))   o/p : <class 'requests.models.Response'>

So the data type is : “requests.models.Response”. Lets try to print it .

print(osVersionData)o/p : <Response [200]>

Nothing right now to get the content we can use .text attribute of data type “requests.models.Response”

osVersionData = osVersionData.texttype(osVersionData)str

So now it’s a string. So lets print it

osVersionData

perfect right !!! Let’s try the second approach.

Approach 2 :

>> from urllib.request import urlopen>> osVersionData = urlopen(base_url)

This is a common issue with macOS users. To solve this: All you need to do is to install Python certificates!

>> import ssl
>> ssl._create_default_https_context = ssl._create_unverified_context

Done, now we are good to start back .

>> osVersionData = urlopen(base_url)>> print(type(osVersionData))
   o/p : <class 'http.client.HTTPResponse'>>> osVersionData
   o/p :  <http.client.HTTPResponse at 0x7fae0ebc9a60>

So basically this is an “HTTPResponse” so you can not read it directly.All you have to do is use .read() method on this object.

>> osVersionData = osVersionData.read()

Now lets print it

osVersionData

Now this is more like it. We have been habituated to see the page like this and lets now pull in our friend “Beautiful Soup”

Approach 3 :

save the entire page as .html file in local repo

So basically we saved it locally. And we will use this as our html web page doc in step 2.

Step 1.1 : Investigation

Inspect the HTML Page and Make observations about the "data of interest" like which tag it falls under, what are its properties and how many similar elements are present in the page.

We will see few screenshots below and we will be able to conclude that :

1. All the table contents are under <table> tag with **class** as ‘wikitable’.
2. Every entry in the table that is each row is placed under <tr> tag which stands for table row.
3. For the first row the contents which are table headers, are placed under <th> tag.
4. For rest of the rows that contents are under <td> tag.

Let’s now design our solution approach as we have the knowledge of the page architecture.

Solution approach :

Step 1. we will find/select all the tables in the page and then select the first table.This is table of interest for us.
Step 2. Then we will filter out the feature names/ headers using the "th" tag for the selected first table{ soupObject.findALL('tr') } .
Step 3. We will get all the rows as well from the same selected first table(this again includes the row of headers){ soupObject.findALL('th') } .
Step 4. Then we will loop through all the rows, select each row Sub Step 4.1. Then for each row we will filter all the column values using { soupObject.findALL('td') } and loop through each of them and add them to row_elemets.
Step 5. Keep adding our row elements to find solution.

We will be implementing the same solution steps in the coming portion.

Step 2 : Parse the HTML data to get the soup object

from bs4 import BeautifulSoupsoupObj = BeautifulSoup(osVersionData,’lxml’)type(soupObj)bs4.BeautifulSoup

Let’s print the soup object once .

Perfect !!! Lets now select all the <tr> element

Step 3: Filter the information of interest.

>> tableRows = soupObj.find_all('table',{'class':'wikitable'})>> len(tableRows)
   
   o/p : 32

But we have to deal with just the first table .

>> firstTable = tableRows[0] >> type(firstTable)   o/p : bs4.element.Tag

Lets see the object :

>> len(firstTable)
   o/p : 2

Lets now get the useful data from this table

features = firstTable.find_all('th',{})len(features) 
o/p : 7

And we know that we had 7 columns only. So seems good. Lets just print it.

We would like to get the contents in a list of featurenames.

features = [x.text for x in features]

Now these is a back slash at the end of each name. We can remove by using the slicing or may be stripping()

newfeatures = [x.strip('\n') for x in features]

So now the feature names are properly extracted. Now what we will do is :
(a) loop through the rest of the elements and get the data row wise
(b) And we will start looping from 2nd row as 1st row is already used to get the feature names extracted.

rowData = firstTable.find_all('tr',{})len(rowData)
o/p : 32type(rowData)
o/p : bs4.element.ResultSet

Now let’s capture the entire data as list of lists.

table_data = []
for ele in rowData[1:]:
    row_data = []
    
    celldata = ele.find_all('td',{})
    for data in celldata:
        row_data.append(data.getText())
        
    table_data.append(row_data)

As you can see a lot of cleaning is required. Every cell has a \n, so strip it.

table_data = []
for ele in rowData[1:]:
    row_data = []
    
    celldata = ele.find_all('td',{})
    for data in celldata:
        row_data.append(data.getText().strip('\n'))
        
    table_data.append(row_data)

So now as you must be observing that not each element has same size. And the reason is that table rows are being merged. In the original page table.

In scenarios like this first entry has length 7 and the second has length 5. So lets handle this :

table_data = []for ele in rowData[1:]:
    row_data = []
    
    celldata = ele.find_all('td',{})
    
    if len(celldata) == 5:
        row_data.append(table_data[-1][0])
        row_data.append(table_data[-1][1])
    
    for data in celldata:
        row_data.append(data.getText().strip('\n'))
    
    table_data.append(row_data)

so now we have a list of lists and we have features.

Step 4 : save the data in a csv file

with open('mydata.csv','w') as f:
    featureString = ','.join(features)
    
    f.write(featureString)
    
    for data in table_data:
        rowString = ','.join(data)
        f.write(rowString)
        break

Now the challenge is they are not forming new lines as enter to csv …

with open('mydata.csv','w') as f:
    featureString = ','.join(features)
    featureString = featureString + '\n'
    f.write(featureString)
    for data in table_data:
        rowString = ','.join(data)
        rowString = rowString + '\n'
        f.write(rowString)

with open('mydata.csv','r') as f:
    print(f.readlines())

Step 5 : Load the data as pandas dataframe and display.

import pandas as pddf = pd.read_csv('mydata.csv')df.head()

So you got the problem ??? It has made the name as the index …. The root cause is because we have “,” as part of the date.And csv is considering it as separate values.So lets just take care of it while adding it into the csv.

with open('mydata.csv','w') as f:
    featureString = ','.join(features)
    featureString = featureString + '\n'
    
    f.write(featureString)
    
    for data in table_data:
        rowString = ','.join([x.replace(',','') for x in data])
        rowString = rowString + '\n'
        f.write(rowString)

Looks fine now … so let’s just load it as data-frame.

import pandas as pddf = pd.read_csv('mydata.csv')

Perfect right !!!

The complete code repo :

https://github.com/chatterjeesrijeet/Web-Scraping-with-Python

In future I am planning to com up with 4–5 task article where we will solve small small tasks related to web scraping, I will share the link in this article incase i do.

In the same git repo you can find a file named : part_4_web_scrapping_major_assignment_2, which has an unsolved problem, you can try to solve the same. Remember you are your own evaluator so no cheating!!!

References :

Beautiful Soup Documentation - Beautiful Soup 4.9.0 documentation

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to…

www.crummy.com

Beautiful Soup

Download | Documentation | Hall of Fame | For enterprise | Source | Changelog | Discussion group | Zine ] You didn't…

www.crummy.com

Learning Web Scraping with Python, Requests, & BeautifulSoup

Did you know learning web scraping w/ Python, Requests, and Beautiful Soup is easy...

medium.com

HTML Reference

Well organized and easy to understand Web building tutorials with lots of examples of how to use HTML, CSS, JavaScript…

www.w3schools.com

HTML Attributes

HTML attributes provide additional information about HTML elements. All HTML elements can have attributes Attributes…

www.w3schools.com

CSS Syntax

A CSS rule consists of a selector and a declaration block. The selector points to the HTML element you want to style…

www.w3schools.com

Beautiful Soup: Build a Web Scraper With Python - Real Python

In this tutorial, you'll walk through the main steps of the web scraping process. You'll learn how to write a script…

realpython.com

Concepts of Web Scraping with Python, Requests & Beautiful Soup — Part 2

Step 1.0 : Getting the HTML page data

Step 1.1 : Investigation

Step 2 : Parse the HTML data to get the soup object

Step 3: Filter the information of interest.

Step 4 : save the data in a csv file

Step 5 : Load the data as pandas dataframe and display.

Beautiful Soup Documentation - Beautiful Soup 4.9.0 documentation

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to…

Beautiful Soup

Download | Documentation | Hall of Fame | For enterprise | Source | Changelog | Discussion group | Zine ] You didn't…

Learning Web Scraping with Python, Requests, & BeautifulSoup

Did you know learning web scraping w/ Python, Requests, and Beautiful Soup is easy...

HTML Reference

Well organized and easy to understand Web building tutorials with lots of examples of how to use HTML, CSS, JavaScript…

HTML Attributes

HTML attributes provide additional information about HTML elements. All HTML elements can have attributes Attributes…

CSS Syntax

A CSS rule consists of a selector and a declaration block. The selector points to the HTML element you want to style…

Beautiful Soup: Build a Web Scraper With Python - Real Python

In this tutorial, you'll walk through the main steps of the web scraping process. You'll learn how to write a script…

Written by Srijeet Chatterjee