Using selenium and Pandas in Python to get table data from a JavaScript website

Michael Hodge
6 min readJun 23, 2022

--

This story follows on from the two previous posts I made in relation to creating a Twitter bot to post when the urgent UK passport Fast Track and Premium services are online. You can find out about those here:

The ukpassportcheck Twitter account (https://twitter.com/ukpassportcheck) now has over 11,000 followers at the time of writing. And it’s helped thousands of people get urgent Fast Track and Premium appointments.

However, for a long time I wanted to figure out something. Could I grab the number of appointments available each time the service went online? Well to do this, I would need something more complicated than I was using already. The reason for this is that the website uses JavaScript elements, which the user interacts with by either clicking, or inputting data.

Take the below. This is the first page when the service is online. It asks you how many applications the user is making, and then there is a submit (Continue) button. The user would normally click one of the boxes, and then the green Continue button. I needed a script to do this for me.

Figure 1. Book an appointment first page on the website.

For this, I used selenium for Python. Some things I needed to do first was make sure Chrome was installed on my machine (it was), and then I used the pip package chromedriver_autoinstaller to automatically use the correct Chrome driver that was on my machine.

Next, it was a case of importing selenium in Python and initiating a web driver and calling a URL:

from selenium import webdriverchromedriver_autoinstaller.install()
this_driver = webdriver.Chrome()
this_driver.get(the_url)

After this, I wanted to put some checks in place to make sure that the webpage was showing the page I wanted it to (and not an error page). So I just parsed the body text using:

body = get_body(this_driver)
if "message" in body.text:
print("Success")
...run code

Next, I needed to setup two main functions. One for clicking elements, and one for inputting data to elements (such as names, dates). For clicking, it is:

def click_page_element(the_driver, path_value, wait_time, by_what="xpath"):
"""
Click the page element
:param the_driver: <Selenium.webdriver> The selenium webdriver
:param path_value: <string> the path value
:param wait_time: <int> the wait time
:param by_what: <string> what you want to select
"""

time.sleep(wait_time)
if by_what == "xpath":
element = the_driver.find_element(by=By.XPATH, value=path_value)
elif by_what == "class":
element = the_driver.find_element(by=By.CLASS_NAME, value=path_value)
element.click()

I’ve added a simple if statement so I can select on either XPATH (preferred) or CLASS_NAME. Next is the input function:

def enter_page_element(the_driver, path_value, value_to_enter, wait_time, by_what="xpath"):
"""
Enter a value on the page
:param the_driver: <Selenium.webdriver> The selenium webdriver
:param path_value: <string> the path value
:param value_to_enter: <string> the value to enter
:param wait_time: <int> the wait time
:param by_what: <string> what you want to select
"""

time.sleep(wait_time)
if by_what == "xpath":
element = the_driver.find_element(by=By.XPATH, value=path_value)
elif by_what == "class":
element = the_driver.find_element(by=By.CLASS_NAME, value=path_value)
element.send_keys(value_to_enter)

This sends values to the element, which are then entered as if you were typing them out yourself.

After these functions have been built it’s just a case of walking through the process yourself, logging the XPATH (Inspect > Right Click Element > Copy XPATH) as you go of the elements you want to click. Then calling the functions above in the relevant order.

Figure 2. Using Inspect to get the elements XPATH.

After getting through the input pages is the the appointments pages, where I want to scrape the values.

Figure 3. The appointments page to get the data from.

As you see above, there is more than one page, and I will discuss how to move between them soon. But first, to scrape the above data I used pandas :

html = the_driver.page_source
table = pd.read_html(html)
df = table[0]

This creates a nice pandas DataFrame based on the values in the HTML table. All I had to do was clean it up a little bit — add a constant index for office locations and just have the values in the DataFrame cells.

To parse to the next page, I simply used the click_page_element function defined above after finding the XPATH for the button. I then merge the DataFrames from each page and remove duplicates.

Lastly, it’s a little more tidying up and then creating a useful table. Here, I chose to use seaborn and a heatmap, as I had tabular data that I wanted to improve with a colorbar because of the differences in the number of appointments, per office, per day. A resulting image is shown below.

Figure 4. The seaborn heatmap from the data.

I then used tweepy to post to Twitter (where authenticate_twitter is a function I made for the previous work to auth to Twitter using my access tokens):

def post_media():
"""
"""

api = authenticate_twitter()

# Posts status to Twitter
media = api.media_upload(filename=filename)
api.update_status(status=f'The latest slots', media_ids=[media.media_id])

print("Posted update to Twitter")

And…..there we go!

Figure 5. The tweeted photo!

Ok, come on, it mustn’t have been that easy?!

Ok, you got me. So here were the challenges:

XPATH errors

If you chose the wrong XPATH, this won’t work. I did this a lot.

Element not found (when parsing appointment tables)

So I used a really ropey loop to keep trying to go to the next page on the appointment page. But when it couldn’t find that Next Page element, it would crash. Sometimes there was only one next page, sometimes many, so I couldn’t hard code the number in the loop. Instead I just used an error exception:

from selenium.common.exceptions import NoSuchElementExceptiontry:
the_driver.find_element(by=By.XPATH, value='')
click_page_element(the_driver, '', 4, by_what="xpath")
except NoSuchElementException:
...go to next step

Automating this with GitHub Actions

The Twitter account relies on automating using GitHub Actions, so I don’t have to run code. This is great. But whenever I try something new like this I hit so many errors with GitHub Actions.

The issue was that when testing and developing locally I was using Chrome and visualising the steps selenium was taking. However, on GitHub Actions it needed to be a headless browser. So I added some options to my code:

options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--window-size=1920,1080')
this_driver = webdriver.Chrome(options=options)

This meant it was working with a window size equivalent on my local machine, but working in a headless state as well.

Getting selenium working on GitHub Actions

This wasn’t as hard as I thought it would be. I used a bash script and saved this in the scripts folder:

#!/bin/bash
set -ex
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo apt install ./google-chrome-stable_current_amd64.deb

And then called this in my GitHub Action using:

- name: Install Google Chrome # Using shell script to install Google Chrome
run: |
chmod +x ./scripts/InstallChrome.sh
./scripts/InstallChrome.sh

Any questions, just shout.

Michael

--

--