Web Scraping with Selenium in Python

Published in

CodeX

6 min readJul 1, 2021

Often, data is publicly available to us, but not in a form that is readily useable. That is where web scraping comes in. Web scraping is the process of extracting data from a website. We can use web scraping to get our desired data into a convenient format that can then be used. In this tutorial, I will show how you can extract information of interest from a website using the selenium package in Python. Selenium is extremely powerful. It allows us to drive a browser window and interact with the website programmatically. Selenium also has several methods which make extracting data very easy.

In this tutorial I will be developing in a Jupyter Notebook using Python3 on Windows 10.

Firstly, we will need to download a driver. In this tutorial, I will use ChromeDriver for Google Chrome. For a full list of supported drivers and platforms, refer to https://www.selenium.dev/downloads/. If you want to use Google Chrome, head over to https://chromedriver.chromium.org/ and download the driver that corresponds to your current version of Google Chrome.

Once you’ve installed the driver, you can begin writing the code. Let’s start by importing the libraries that we will be using:

from selenium import webdriver
import urllib3
import re
import time
import pandas as pd

Now that we’ve got our libraries imported, we need to initialize our Chrome webdriver object. You’ll need to specify the path to your driver:

# Create driver object. Opens browser window.
path = "C:/Users/Robpr/OneDrive/Documents/chromedriver.exe"
driver = webdriver.Chrome(executable_path=path)

You should see a blank Chrome window appear, as shown below.

Figure 1: A Chrome browser opened by selenium.

Next we’ll need to navigate to our site of interest. Recently, I’ve been doing some work scraping insolvencyinsider.ca for filing data, so I will use that. Navigate to https://insolvencyinsider.ca/filing/ with the get() method:

# Navigates browser to insolvency insider.
driver.get("https://insolvencyinsider.ca/filing/")

You should see your browser navigate to Insolvency Insider.

Figure 2: Navigating the browser to insolvencyinsider.ca/filing/

If you scroll down to the bottom of the page, you’ll notice a pesky “Load more” button. Without selenium, we would be limited to the first page of data.

Selenium provides several methods for locating elements on the webpage. We’ll use the find_element_by_xpath() method to create a button object, that we can then interact with:

# Creates "load more" button object.
loadMore = driver.find_element_by_xpath(xpath="/html/body/div[2]/div/main/div/div/div/button")

Before we go any further, we’ll need to know how many pages there are so we know how many times we need to click the button. We’ll need a way of extracting the website’s source code. Luckily this process is relatively pain free with the urllib3 and re libraries.

url = "https://insolvencyinsider.ca/filing/"
http = urllib3.PoolManager()
r = http.request("GET", url)
text = str(r.data)

text is now a string. Now, we need a way of extracting total_pages from our text string. Print text to see how we can extract this using RegEx with the re package. We can total_pages like so:

totalPagesObj = re.search(pattern='"total_pages":\d+', string=text)totalPagesStr = totalPagesObj.group(0)totalPages = int((re.search(pattern="\d+", string=totalPagesStr)).group(0))

The search method takes a pattern and a string. In this case our pattern is '"total_pages":\d+' . If you’re not familiar with RegEx, all this means is that we are looking for the string "total_pages": with two or more digits after the colon. The \d refers to a digit between 0 and 9, while the + indicates that Python should look for one or more of the preceding regular expression. You can read more about the re package here. The search() method returns a Match object. re provides the group() method which returns one or more subgroups of the match. We pass 0 as an argument to indicate that we want the entire patch. The third line just extracts the integer corresponding to total_pages from the string.

With that complete, we can now load every page of Insolvency Insider. We can click the Load more button by accessing the click() method of the object. We wait three seconds in between clicks so that we’re not overwhelming the website.

# Clicks the Load more button (total pages - 1) times with a three second delay.
for i in range(totalPages-1):
    loadMore.click()
    time.sleep(3)

Once you run this, you should see the Load more button being clicked and remaining pages being loaded.

Once every page is loaded, we can begin to scrape the content. Now, scraping certain elements like the filing name, the date, and the hyper reference are pretty straight forward. We can use selenium’s find_elements_by_class_name() and find_elements_by_xpath() methods (notice the extra s after element ):

# Creates a list of filing name elements and a list of filing date elements.
filingNamesElements = driver.find_elements_by_class_name("filing-name")
filingDateElements = driver.find_elements_by_class_name("filing-date")
filingHrefElements = driver.find_elements_by_xpath("//*[@id='content']/div[2]/div/div[1]/h3/a")

We’d also like the filing meta data, i.e., the filing type, the industry of the filing company, and the province that they operate in. Extracting this data takes a little bit more work.

filingMetas = []
for i in range(len(filingNamesElements) + 1):
    filingMetai = driver.find_elements_by_xpath(("//*[@id='content']/div[2]/div[%d]/div[2]/div[1]" %(i)))
    for element in filingMetai:
        filingMetaTexti = element.text
        filingMetas.append(filingMetaTexti)

the text() method returns the element’s text as a string. The above code snippet results in a list like this:

['Filing Type: NOI\nCompany Counsel: Loopstra Nixon\nTrustee: EY\nTrustee Counsel: DLA Piper\nIndustry: Food & Accommodation\nProvince: Alberta', ... ]

From each element of filingMetas we can extract the filing type, the industry, and the province, like so:

metaDict = {"Filing Type": [], "Industry": [], "Province": []}for filing in filingMetas:
    filingSplit = filing.split("\n")
    
    for item in filingSplit:
        itemSplit = item.split(": ")
        
        if itemSplit[0] == "Filing Type":
            metaDict["Filing Type"].append(itemSplit[1])
        elif itemSplit[0] == "Industry":
            metaDict["Industry"].append(itemSplit[1])
        elif itemSplit[0] == "Province":
            metaDict["Province"].append(itemSplit[1])
            
    if "Filing Type" not in filing:
        metaDict["Filing Type"].append("NA")
    elif "Industry" not in filing:
        metaDict["Industry"].append("NA")
    elif "Province" not in filing:
        metaDict["Province"].append("NA")

The second block of if statements ensures that all of our key values have the same length. This is necessary if we want to put this data into a pandas DataFrame. You can verify that this is the case:

for key in metaDict:
    print(len(metaDict[key]))

Now, we still need to put our filing names and dates into lists. We do this by appending each elements text to a list using the text() method from before:

# Initiates a list for filing names and a list for filing dates.
filingName = []
filingDate = []
filingLink = []# for each element in filing name elements list, appends the
# element's text to the filing names list.
for element in filingNamesElements:
    filingName.append(element.text)# for each element in filing date elements list, appends the
# element's text to the filing dates list.
for element in filingDateElements:
    filingDate.append(element.text)for link in filingHrefElements:
    if link.get_attribute("href"):
        filingLink.append(link.get_attribute("href"))

You can also do this in just two lines with list comprehensions.

Once we have that, we are ready to put everything into one dictionary and then create a pandas DataFrame:

# Creates a final dictionary with filing names and dates.
fullDict = {
    "Filing Name": filingName,
    "Filing Date": filingDate, 
    "Filing Type": metaDict["Filing Type"],
    "Industry": metaDict["Industry"],
    "Province": metaDict["Province"],
    "Link": filingLink
}# Creates a DataFrame.
df = pd.DataFrame(fullDict)
df["Filing Date"] = pd.to_datetime(df["Filing Date"], infer_datetime_format=True)

And voilà! Now we have a data base of all kinds of insolvency filings.

Figure 3: A table of insolvency filing data.

I hope you have found this tutorial useful. Now, you can use selenium to extract data from a variety of websites.

Web Scraping with Selenium in Python

Written by Robert Proner