Data Acquisition Using Web Scraping, Web Crawlers and APIs (Part 1)

Published in

Analytics Vidhya

4 min readJun 11, 2020

Introduction

This article will cover the basic techniques of scrapping data from the web using different methods like the use of crawlers and libraries like BeautifulSoup, urllib and requests to acquire and parse data in an efficient manner.

All the codes are provided in a GitHub repository, please click here to see the codes.

Web Scraping Using BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

In this exercise, we will extract table based on android version history (first table) from the page: https://en.wikipedia.org/wiki/Android_version_history

Getting HTML Page data using urllib

from urllib.request import urlopen
android_url = "https://en.wikipedia.org/wiki/Android_version_history"
android_data = urlopen(android_url) 
# android_data is an http response objectandroid_html = android_data.read()
# we can get the whole html of that page by using the read() method on the android_data (object)android_html = android_data.read()
android_data.close()

We use urlopen(url) to send an HTTP request to the given URL and we get a response, and we use that response object to extract the page Html using the read() built-in method. Remember to close the connection at the end.

Now we will use the BeautifulSoup library to parse the Html code we have acquired.

from bs4 import BeautifulSoup as soupandroid_soup = soup(android_html, 'html.parser')
tables = android_soup.findAll('table', {'class':'wikitable'})
print("Number of Tables: {}".format(len(tables)))

Output:

Number of Tables: 31

We first have to make a soup class object using the and specify the parser as “html.parser”. Then we will use the findall() method to detect a specific element and give any further search attribute as the second parameter in form of a dictionary, in our case we want tables that belong to class “wikitable”.

We can use the browser’s built-in inspect element to find the element tag and any element for searches like class name or specific id and pass it as a dictionary.

Example of findall():

We will try to find all the elements with “h1” tag:

a = android_soup.findAll('h1', {})
# it return a list of matching objectsprint(a)
print(type(a))
print(len(a))

Output:

[<h1 class="firstHeading" id="firstHeading" lang="en">Android version history</h1>]
<class 'bs4.element.ResultSet'>
1

We are getting only 1 output from our search for ℎ1h1 tags as this webpage contains only one h1 tag that is the heading of the page as highlighted below:

Before going forward we will have a look at the table we are trying to scrape

Android version history table, from the first version to the 11th

As we can see the first two rows cannot be used because no codename is given to those android versions.

Similarly, the last row is useless as it contains missing values and we can also drop the last column, the References column as it doesn't contain any information.

We will choose the first table data and extract the columns headers from it:

android_table = tables[0]
headers = android_table.findAll('th')# We will use these headers as columns names hence we have to store them in a variable
column_titles = [ct.text for ct in headers]# slice out the carriage return (\n)
column_titles = [ct.text[:-1] for ct in headers]# We wont be using the last column('References') hence we will remove it from our column names:
column_titles = column_titles[:-1]
print(len(column_titles))
print(column_titles)

Output:

4
['Name', 'Version number(s)', 'Initial release date', 'API level']

We will now get rows of the table and finish sorting and cleaning through data by storing everything in a table_rows variable:

rows_data = android_table.findAll('tr')[1:]
print("Total number of rows: {}".format(len(rows_data)))# We will start with the third row as the first two rows have no name for the software versions
rows_data = rows_data[2:]
print("Number of rows we are going to display: {}".format(len(rows_data)))

Output:

Total number of rows: 18
Number of rows we are going to display: 16

Final code snippet to bring everything together:

table_rows = []
for row_id, row in enumerate(rows_data):
    
    # We will skip the last row as it contains missing value
    if row_id == 15:
        continue
    
    current_row = []
    row_data = row.findAll('td', {})
    
    for col_id, data in enumerate(row_data):
        
        # We will skip the last column(References) as it does not contain any major information
        if col_id == 4:
            continue
            
        # We will also replace any commas(',') present in the data as we have to store the data in a csv
        # file
        text = data.text
        text = text.replace(",", "")[:-1] # We are slicing the data as all the elemnets contain a carriage
        # return('\n') at the end
        current_row.append(text)
            
    table_rows.append(current_row)

The variable table_rows is a nested list that contains data of all the rows, we can save this data in a CSV file and later display it as a data frame(using pandas):

import pandas as pd
pd.DataFrame(table_rows, columns=column_titles).to_csv("Data/android_version_history_pandas.csv", index=False)# Reading CSV file and Displaying data
data_1 = pd.read_csv("Data/android_version_history.csv")
data_1.head(10)

The final result looks like this:

Output of the csv file of the data scrapped from web

Conclusion

This article shows the basics of using beautiful soup as an Html parser for cleaning and retrieving the information needed.

In the second part of this article, I will discuss how to retrieve data from APIs: “Data Acquisition Using Web Scraping, Web Crawlers and APIs (Part 2)”.

Data Acquisition Using Web Scraping, Web Crawlers and APIs (Part 1)

Introduction

Web Scraping Using BeautifulSoup

Example of findall():

Before going forward we will have a look at the table we are trying to scrape

Conclusion

Written by Aryan Chugh