Concepts of Web Scraping with Python, Requests & Beautiful Soup — Part 2

Srijeet Chatterjee
9 min readSep 4, 2021

--

Now that we have gone through all the necessary concepts, it is now to get our hands dirty with solving an end to end task on web scrapping.

Best way to learn a programming language is to code in it, similarly the best way to learn a tool is to use the tool to solve tasks till you master it !!!

We will be solving this end to end task in this article, where I will give a live coding documentary so that you understand the gninja story well !!! And for any additional questions you can always mention that in the comments section.

Task 1: “You have to go to a page “https://en.wikipedia.org/wiki/Android_version_history" and then scrap the table content present there. Put the content in an organised way in a csv format. Read the csv file as a pandas data-frame and display the first 8 records of the data frame”.

The web page of interest from wiki

The entire task can be divided into 6 steps and then we will work on these five steps one by one. First two tasks has no order preference so both we can put as step 1 only.

STEP 1.0 : Getting the HTML page data using GET request

STEP 1.1 : Inspect the HTML Page and Make observations about the "data of interest" like which tag it falls under, what are its properties and how many similar elements are present in the page.

Once your Investigation is complete design a solution algorithm.

STEP 2 : Parse the HTML data to Create the soup object

STEP 3 : Filter the relevant information

STEP 4 : create a csv file using the information.

STEP 5 : Load the data as pandas data-frame and display

lets do it

Step 1.0 : Getting the HTML page data

Lets get the HTML page first: 

Approach 1 : use requests library
Approach 2 : use urllib.request.urlopen library
Approach 3 : download it manually as .html page and then read it

We will implement all the three approaches. But before that lets keep the link of the web site that is the basel_url in a variable.

base_url = "https://en.wikipedia.org/wiki/Android_version_history"

Approach 1 :

>> import requests>> osVersionData = requests.get(base_url,'parser.html')>> print(type(osVersionData))   o/p : <class 'requests.models.Response'>

So the data type is : “requests.models.Response”. Lets try to print it .

print(osVersionData)o/p : <Response [200]>

Nothing right now to get the content we can use .text attribute of data type “requests.models.Response”

osVersionData = osVersionData.texttype(osVersionData)str

So now it’s a string. So lets print it

osVersionData

perfect right !!! Let’s try the second approach.

Approach 2 :

>> from urllib.request import urlopen>> osVersionData = urlopen(base_url)
url Open error with urlopen() command

This is a common issue with macOS users. To solve this: All you need to do is to install Python certificates!

>> import ssl
>> ssl._create_default_https_context = ssl._create_unverified_context

Done, now we are good to start back .

>> osVersionData = urlopen(base_url)>> print(type(osVersionData))
o/p : <class 'http.client.HTTPResponse'>
>> osVersionData
o/p : <http.client.HTTPResponse at 0x7fae0ebc9a60>

So basically this is an “HTTPResponse” so you can not read it directly.All you have to do is use .read() method on this object.

>> osVersionData = osVersionData.read()

Now lets print it

osVersionData
HTML page data as byte object

Now this is more like it. We have been habituated to see the page like this and lets now pull in our friend “Beautiful Soup”

Approach 3 :

save the entire page as .html file in local repo
html file saved locally

So basically we saved it locally. And we will use this as our html web page doc in step 2.

Step 1.1 : Investigation

Inspect the HTML Page and Make observations about the "data of interest" like which tag it falls under, what are its properties and how many similar elements are present in the page.

We will see few screenshots below and we will be able to conclude that :

1. All the table contents are under <table> tag with **class** as ‘wikitable’.
2. Every entry in the table that is each row is placed under <tr> tag which stands for table row.
3. For the first row the contents which are table headers, are placed under <th> tag.
4. For rest of the rows that contents are under <td> tag.

Let’s now design our solution approach as we have the knowledge of the page architecture.

Solution approach :

Step 1. we will find/select all the tables in the page and then select the first table.This is table of interest for us.

Step 2. Then we will filter out the feature names/ headers using the "th" tag for the selected first table{ soupObject.findALL('tr') } .

Step 3. We will get all the rows as well from the same selected first table(this again includes the row of headers){ soupObject.findALL('th') } .

Step 4. Then we will loop through all the rows, select each row
Sub Step 4.1. Then for each row we will filter all the column values using { soupObject.findALL('td') } and loop through each of them and add them to row_elemets.

Step 5. Keep adding our row elements to find solution.

We will be implementing the same solution steps in the coming portion.

Step 2 : Parse the HTML data to get the soup object

from bs4 import BeautifulSoupsoupObj = BeautifulSoup(osVersionData,’lxml’)type(soupObj)bs4.BeautifulSoup

Let’s print the soup object once .

Perfect !!! Lets now select all the <tr> element

Step 3: Filter the information of interest.

>> tableRows = soupObj.find_all('table',{'class':'wikitable'})>> len(tableRows)

o/p : 32

But we have to deal with just the first table .

>> firstTable = tableRows[0] >> type(firstTable)   o/p : bs4.element.Tag

Lets see the object :

>> len(firstTable)
o/p : 2

Lets now get the useful data from this table

features = firstTable.find_all('th',{})len(features) 
o/p : 7

And we know that we had 7 columns only. So seems good. Lets just print it.

We would like to get the contents in a list of featurenames.

features = [x.text for x in features]

Now these is a back slash at the end of each name. We can remove by using the slicing or may be stripping()

newfeatures = [x.strip('\n') for x in features]

So now the feature names are properly extracted. Now what we will do is :
(a) loop through the rest of the elements and get the data row wise
(b) And we will start looping from 2nd row as 1st row is already used to get the feature names extracted.

rowData = firstTable.find_all('tr',{})len(rowData)
o/p : 32
type(rowData)
o/p : bs4.element.ResultSet

Now let’s capture the entire data as list of lists.

table_data = []
for ele in rowData[1:]:
row_data = []

celldata = ele.find_all('td',{})
for data in celldata:
row_data.append(data.getText())

table_data.append(row_data)

As you can see a lot of cleaning is required. Every cell has a \n, so strip it.

table_data = []
for ele in rowData[1:]:
row_data = []

celldata = ele.find_all('td',{})
for data in celldata:
row_data.append(data.getText().strip('\n'))

table_data.append(row_data)

So now as you must be observing that not each element has same size. And the reason is that table rows are being merged. In the original page table.

In scenarios like this first entry has length 7 and the second has length 5. So lets handle this :

table_data = []for ele in rowData[1:]:
row_data = []

celldata = ele.find_all('td',{})

if len(celldata) == 5:
row_data.append(table_data[-1][0])
row_data.append(table_data[-1][1])

for data in celldata:
row_data.append(data.getText().strip('\n'))

table_data.append(row_data)

so now we have a list of lists and we have features.

Step 4 : save the data in a csv file

with open('mydata.csv','w') as f:
featureString = ','.join(features)

f.write(featureString)

for data in table_data:
rowString = ','.join(data)
f.write(rowString)
break

Now the challenge is they are not forming new lines as enter to csv …

with open('mydata.csv','w') as f:
featureString = ','.join(features)
featureString = featureString + '\n'
f.write(featureString)
for data in table_data:
rowString = ','.join(data)
rowString = rowString + '\n'
f.write(rowString)
with open('mydata.csv','r') as f:
print(f.readlines())

Step 5 : Load the data as pandas dataframe and display.

import pandas as pddf = pd.read_csv('mydata.csv')df.head()

So you got the problem ??? It has made the name as the index …. The root cause is because we have “,” as part of the date.And csv is considering it as separate values.So lets just take care of it while adding it into the csv.

with open('mydata.csv','w') as f:
featureString = ','.join(features)
featureString = featureString + '\n'

f.write(featureString)

for data in table_data:
rowString = ','.join([x.replace(',','') for x in data])
rowString = rowString + '\n'
f.write(rowString)

Looks fine now … so let’s just load it as data-frame.

import pandas as pddf = pd.read_csv('mydata.csv')

Perfect right !!!

The complete code repo :

https://github.com/chatterjeesrijeet/Web-Scraping-with-Python

In future I am planning to com up with 4–5 task article where we will solve small small tasks related to web scraping, I will share the link in this article incase i do.

In the same git repo you can find a file named : part_4_web_scrapping_major_assignment_2, which has an unsolved problem, you can try to solve the same. Remember you are your own evaluator so no cheating!!!

References :

--

--

Srijeet Chatterjee

Cognitive Data Scientist @ IBM | IIT-Delhi alumni | Machine Learning , Deep Learning and an AI enthusiast.