Step-by-step guide to building a web scraper with Selenium

Published in

The Startup

7 min readJun 24, 2020

Introduction:

This is a step-by-step introduction to building a working web scraper with Selenium. There are other python web scraper libraries such as BeautifulSoup and Scrapy and do try them out too as they are both much faster than Selenium. However, I’m focusing on Selenium here because I found many websites to have either blocked the library or the urlfetch module from their websites and thus preventing the use of either of those tools.

What is Selenium

In short, Selenium is a browser which you can program. And as it turns out, when you have that ability, you can use it to automate the extraction of HTML DOM element data (aka web scraping). Thus, when you use Selenium, you will see Selenium opening up a normal Browser on your screen to carry out the programmed actions. This is the reason why Selenium is much slower than BeautifulSoup and Scrapy (it needs to render all the HTML/CSS/JS code on a website rather than just extracting data directly from the servers) but it also allows us to get past some of the website’s block against web scraping tools (since we’re accessing the websites like a normal user).

Steps to building your Web Scraper

There are 3 generic steps to building your web scraper with Selenium, and I’ve broken them down below:

1. Downloading python/pip/selenium

If you have never used python or pip before. This step is for you, otherwise, please feel free to skip ahead to the relevant parts for you.

To use Selenium, you will need to install Python on your computer first. You can do that simply by heading to the Python website here and download and install it on your computer. I will recommend downloading the latest 3.7.x version (3.7.7) of Python instead of 3.8.x, as there are some packages which has yet to be updated for the 3.8.x version and thus becomes a source of issues. The installation is self-explanatory, but it is important for you to add Python to PATH during installation so that you may call the program easily, and you don’t have to deal with PATH issues.

Add Python to PATH during installation and install including IDLE, pip and documentation

Once that is done, pip (the package installer for python), which you will use to install Selenium, will also be installed on your computer. The next thing will be to install Selenium itself. There are a couple of ways to do it, but the easiest way is to simply use pip to install it. To do that, you will need to open your command prompt (search for cmd in Windows, or terminal on Mac), and type:

pip install selenium

You should see this upon a successful install of pip

We also have to download Chromedriver (if you are using Chrome). You will have to download other webdriver such as Firefox if you use Firefox as your default browser. You can download Chromedriver here. There is no installation involved, but you will need to extract the file and put it in a known folder such as:

C:\Program Files (x86)\chromedriver.exe

And that’s it for the files we need to install. We’re ready to move on to step 2 and writing the code for our web scraper.

2. Boilerplate code for selenium

We’ll need to start the code with some boilerplate code and import some of the tools we’ll use from the Selenium package. You can copy these code into your python file.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

Next, we’ll need to set the PATH for the Webdriver so that Selenium can use the Webdriver. If you have extracted the Webdriver in the same location as I above, you can just copy the code.

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

And lastly, you use the driver to open up and manipulate the web page you want to scrape. You can replace ‘URL_Link’ with the link of the web page you are trying to scrape.

driver.get('URL_Link')

3. Scraping

Now that we’ve installed the required tools and packages, and setup the driver and boiler plate code for Selenium, we’re finally ready to start scraping data from the ‘URL_Link’ that we want.

But before we dive into the code, I want to give a brief introduction on the key principle of web scraping, which is based on the fundamental structure a web page is displayed. This refers to the Document Object Model (DOM) that displays the underlying XML or HTML document, by which all data are usually stored as part of a HTML element, and we can extract the required data by manipulating the specific element that we want. I highly encourage you to read more about this on your own if you’re interested.

Ok, so onto the example — We’ll be using bookstoscrape in our example here. To get started, we simply need to direct our Selenium driver towards the URL by replacing with ‘URL_Link’ above with the exact URL.

driver.get('http://books.toscrape.com/')

The next thing we need want to do is to be able to scrape some data from the page. In this case, we’ll be scraping the titles. To do that, we’ll need to head bookstoscrape first and find the DOM element that holds the titles data. We do by pressing `F12` on the website to bring up the Chrome console and filtering down to the exact element that holds the titles. The console helps you by highlighting the element that you are currently choosing in the console. You can see how this works in the gif below.

How searching for the specific element looks like

With the above, we found that the element that holds the titles is under element of “article.product_prod > h3 > a”, and that there is a button to click to the next page under an element that has a link text of “next”. These are the two things we need to get manually from the website to program our script.

Now, to get all the titles on the bookstoscrape, we want our script to first load bookstoscrape, then extract all the titles of the books in the current page, before moving to the next page and repeating this process to the last page. To do this, we need 2 parts to the script. The first part is a function to extract all the titles from the page by using the Selenium driver function to find all the elements containing the titles, then looping it through and adding each of them to a .txt file.

filename = "book_titles.txt"  # To save store data#Getting all the titles in each page
def title_extraction():
    titles =           driver.find_elements_by_css_selector('article.product_pod > h3 > a')
    with open(filename, 'a+') as f:
        for i in titles:
            title = i.get_attribute("title")
            f.write(title + "\n")

And the second part is run the title_extraction function on all pages till the last page so that we are able to extract all the data in the list. We determine the last page by our understanding that there isn’t a “next” button on the last page. And we do this by setting up a Timeout function to stop the script when the script is unable to find and click on the ‘next’ button after 5 seconds, which will only happen on the last page.

#loop through all the pages
nextPageElement = driver.find_elements(By.LINK_TEXT,'next')
while(True):
    title_extraction()
    try:
     WebDriverWait(driver,5).until(EC.element_to_be_clickable((By.LINK_TEXT,'next'))).click()
    except TimeoutException:
        print('No more pages available')
        driver.quit()
        break;

And that’s it! We have created a web scraper that scrapes all the book titles from bookstoscrape.com and put it in a .txt file called “book_titles.txt”.

Here’s the full code:

Next steps:

We have only gone through the steps of setting up your first web scraping script, the basic foundations of how it works, and only used a couple of the functions that the web scraping tools like Selenium and Scrapy offers us to scrape more complicated website here.

I hope this guide helps to eliminate the possible frustrations I faced when I was building my first web scraper for you, and accelerates your own learning with web scraping and data analysis!

Thank you for reading.

Bonus:

If you are trying to scrape a website with an infinite scroll setting. You can use this scroll_down() function with Selenium to scroll till the end of it, and scrape the required data.

Resources:

Recommended text editor to write your code: Atom
Recommended console emulator (so that you don’t have to run your code with cmd.exe): Cmder
Selenium documentation
w3schools HTML DOM explanation