Web Scrapping With Python — Part 1
Understanding And Building first Web Scraper
Web scrapping is wizardry and you will gain the foundational power of it once you go through the content below.
My goal in writing this article is to get anyone started with the basics of web scrapping. Web scrapping using python is definitely a useful skill to have. Let’s say you find data from the web and there is no direct way to download it, you can extract the data into a useful form by scrapping the web using Python.
Tools setup:
First I recommend installing the Anaconda Python distribution(Download Anaconda for Python Version 3.7) which is available from the following link : https://www.anaconda.com/download/
Next, run code using: Jupyter Notebook.
Jupyter notebook installation steps:
# Python 3, open command prompt and type the following
pip3 install jupyter#Once you have installed it, use the following command to open it.
jupyter notebook#Once you run the command, you will see the application opening in web browser
http://localhost:8888#on the menu bar, you will see options to create folder and files.
Goal :
Remember, our goal is to scrap the results table in this website and store it in a csv for easier manipulation using Python: http://www.hubertiming.com/results/2018Resolution
So, what does scrapping the results table mean? We will be doing the following steps.
- Retrieving HTML data from the specified website
- Parsing that data for target information
- Storing the target information
As a starter, we are only focusing on building a scrapper to download the data from the table. There are more complex real time examples to automate the user interactions on the web site and then download the required data. Fortunately, we can accomplish those using selenium package from python.
Now you are set to start writing your first web scrapper.
Step 1: Import the necessary libraries and get the html of the page.
- Use urllib.request module to open URLs.
- The Beautiful Soup package is used to extract data from html files. The Beautiful Soup library’s name is bs4 which stands for Beautiful Soup, version 4.
from urllib.request import urlopen
from bs4 import BeautifulSoup
After importing necessary modules, we can specify URL containing the data and pass it to urlopen() to get the html of the page.
html = urlopen("http://www.hubertiming.com/results/2018Resolution")
Step 2: Create a Beautiful Soup object from the html. Beautiful Soup package is used to take the raw html text and break it into Python objects (Parsing the data). It helps formatting and organizing the messy web by fixing bad HTML and presenting us with easily traversable Python objects.
bsobj = BeautifulSoup(html, 'lxml')
print(bsobj)
There are two things to note in the above code. 1) we are converting the html data to a beautiful soup object. 2) lxml is a very useful XML/HTML processing library.
output:
Step3 : Store the target information. Our target is to take each row of data and store it in a csv file.
- Retrieve the table rows
table_rows= bsobj.findAll('tr')
print(table_rows)
output: here we have all the table rows <tr>
- But, what we need is all table rows in a list so that we can convert the list into data frame.
for row in table_rows:
each_row= row.findAll('td')
print(each_row)
output: The output below shows that each row is printed with html tags.
- The above is not what you want. You want text without html tags. You can remove the html tags using BeautifulSoup.
Note: You are doing two things here. 1) removing html tags using BeautifulSoup and extracting only text 2) Creating an empty list and appending each row of text to the empty list.
lists_of_rows = []
for row in table_rows:
each_row= row.findAll('td')
str_row= str(each_row)
row_text = BeautifulSoup(str_row, "lxml").get_text()
lists_of_rows.append(row_text)
print(lists_of_rows)
output:
4. The next step is to convert the list into a data frame and get the view of the rows using Pandas.
- import the required libraries
import pandas as pd
import numpy as np
# you can import matplotlib if you plan to create visualization. But the goal of this article is just to crawl the data from the website and convert it into a dataframe.
- convert the list created in step 3 into a data frame
data = pd.DataFrame(lists_of_rows[5:])
print(data)
output:
- The output is saved in a csv, so we clean and manipulate the data when ever we want. Note: create a csv in the specified path so that we can write the retrieved data from the website to this csv.
data.to_csv(r'filepath\Book1.csv', encoding='utf-8', index=False)
Conclusion
Great start! You performed web scraping using Python. We used the Beautiful Soup library to parse html data and convert it into a form that can be used for analysis. With the data stored in the csv you can perform data cleaning and create useful plots to extract insights.
Before you go, you can ask why would we want to use web scrapper if we have API’s available?
In general, it is preferable to use an API (if one exists), rather than build a scrapper to get the same data. However, there are several reasons why an API might not exist:
• You are gathering data across a collection of sites that do not have a cohesive API.
• The data you want is a fairly small, finite set that the webmaster did not think warranted an API.
• The source does not have the infrastructure or technical ability to create an API.