Realtime Data Scraping with Python

Leverage Selenium and Beautifulsoup for live updates

Gioele Monopoli
CodeX
6 min readOct 9, 2022

--

Photo by Aron Visuals on Unsplash

Context

Supply Chain Resiliency is an essential topic for companies that rely on their supply chain for their goods. Imagine the situation in which your Supply Chain Manager requests real-time updates on the current weather condition in some parts of the world, along with possible climate warnings and alerts, available here. You decide to tackle this problem by scraping the website in real-time. You will do this by creating a Python script to scrape all data needed, and you then schedule it to run every 30 minutes to receive live updates.

This article is best suited for programmers familiar with Python.

Scraping

  1. The first thing we need to do is install the necessary libraries for the scraping, i.e BeautifulSoup, and Selenium

To give a simple distinction, we will need Selenium to go to a website, interact with the browser by clicking buttons, and wait for elements to be present. Then, BeautifulSoup is used to iterate over the HTML and extract the actual data (i.e, what you see).

2. We now explore the website. As you can see in the picture below, a waiting time of ~ 5 seconds is needed before the data is correctly loaded.

Data is loading (icon)
Loaded data in HTML

Because of this, starting scraping directly with BeautifulSoup will lead to no entries as we need to wait for the data to be in the HTML. We solve this problem by setting a listener on the element that gets created once the data is fetched.

By right-clicking and pressing the “Inspect Element” button on the website, we see in the inspection interface that the element we need to wait for is the <div> with the class dataTables_scrollBody.

Inspect Element result

To scrape a website, the library Selenium requires us to have the Google Chrome browser (you can also use another browser). We thus tell Selenium to spin up Google Chrome

and tell the driver where our website is by passing its URL

Now, we can set the listener mentioned above to let the driver wait for the <div> element with dataTables_scrollBody class to be in the HTML

We define our scraping function as scrapeWeather and our code at this point should be similar to this:

3. Now that the data is in the HTML, we can select the entries we want to scrape with BeautifulSoup.

As we can see from the inspection, all the data is in the <tbody> tag. Each <tr> tag contains one entry (row) in the table. Thus, we must find the correct <tbody> and start looping over all its <tr> tags. We do this with the function findAll, which finds all the entries of an HTML tag.

Since we will save the entries to a CSV file, we will:

  • create an empty array that we will populate with the data of each row of the table,
  • iterate over each row (i), iterate over each column (j) of the row (i), and
  • save the info to the correct variable.

The code will look as this:

Now that we have saved the info in the list, let's push it to a CSV.

The CSV file should look as follow:

Data scraped from the website

Congratulation! You have scraped the website. Now let's look at how to automate the process.

2. Real-time Automation

To schedule the scraping every X minutes (depending on your need), we will need to use a Scheduler. Here we have two of the many options available:

  • GitHub Actions
  • Google Cloud Scheduler

For this tutorial, we will use GitHub Actions, as I think it is the most straightforward and accessible.

  1. First of all, we need to slightly change the code to be able to open Google Chrome through Selenium on GitHub. We need to install the module pyvisualdisplay as

Then we need to make the following changes to the existing code:

and in the scrapeWeather class, we do not need to call the ChromeDriver installer anymore

2. We are ready to deploy the code to GitHub and schedule it. For this we need to:

  • create a repository
  • push the python script
  • create and push a requirements.txt file (pip install pipreqsand run pipreqsin the terminal folder where your script is present)
  • create a workflow: in your GitHub repository -> Actions -> “New workflow”. In the workflow, we will need to add the following code (copy-paste it and change it according to your setup):

Perfect. Now your script will run every 30 minutes and add the data to the CSV. You could now call this CSV file hosted on GitHub from another endpoint and get real-time weather updates!

This was an example of scraping data from the web using BeautifulSoup, Selenium, and GitHub Actions. I have used this script in the project I did for the HackZurich I participated in last weekend. It is the biggest hackathon in Europe, and in 48 hours, we built a supply chain warning application, which made us win the challenge. You can see our app and the GitHub repository with all the code here.

Thank you for your precious time spent reading the article. Remember to follow me on Medium and contact me on LinkedIn if you have any questions. See you next!

--

--

Gioele Monopoli
CodeX

Data Science student and Software Engineer. Sport Lover. Follow me on Linkedin: https://www.linkedin.com/in/gioele-monopoli/