Realtime Data Scraping with Python

Leverage Selenium and Beautifulsoup for live updates

Gioele Monopoli
CodeX
6 min readOct 9, 2022

--

Photo by Aron Visuals on Unsplash

Context

Supply Chain Resiliency is an essential topic for companies that rely on their supply chain for their goods. Imagine the situation in which your Supply Chain Manager requests real-time updates on the current weather condition in some parts of the world, along with possible climate warnings and alerts, available here. You decide to tackle this problem by scraping the website in real-time. You will do this by creating a Python script to scrape all data needed, and you then schedule it to run every 30 minutes to receive live updates.

This article is best suited for programmers familiar with Python.

Scraping

  1. The first thing we need to do is install the necessary libraries for the scraping, i.e BeautifulSoup, and Selenium
pip install bs4
pip install selenium

To give a simple distinction, we will need Selenium to go to a website, interact with the browser by clicking buttons, and wait for elements to be present. Then, BeautifulSoup is used to iterate over the HTML and extract the actual data (i.e, what you see).

2. We now explore the website. As you can see in the picture below, a waiting time of ~ 5 seconds is needed before the data is correctly loaded.

Data is loading (icon)
Loaded data in HTML

Because of this, starting scraping directly with BeautifulSoup will lead to no entries as we need to wait for the data to be in the HTML. We solve this problem by setting a listener on the element that gets created once the data is fetched.

By right-clicking and pressing the “Inspect Element” button on the website, we see in the inspection interface that the element we need to wait for is the <div> with the class dataTables_scrollBody.

Inspect Element result

To scrape a website, the library Selenium requires us to have the Google Chrome browser (you can also use another browser). We thus tell Selenium to spin up Google Chrome

from selenium import webdriverdriver = webdriver.Chrome(ChromeDriverManager().install())

and tell the driver where our website is by passing its URL

driver.get("https://severeweather.wmo.int/v2/list.html")

Now, we can set the listener mentioned above to let the driver wait for the <div> element with dataTables_scrollBody class to be in the HTML

try:        
elem = WebDriverWait(driver, 30).until( EC.presence_of_element_located((By.CLASS_NAME, "dataTables_scrollBody")))
finally:
print('loaded')

We define our scraping function as scrapeWeather and our code at this point should be similar to this:

### imports
import pandas as pd
from bs4 import BeautifulSoupfrom selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
###def scrapeWeather(): # Our function for scraping
driver = webdriver.Chrome(ChromeDriverManager().install())
#url request
driver.get("https://severeweather.wmo.int/v2/list.html")
try:
elem = WebDriverWait(driver, 30).until( EC.presence_of_element_located((By.CLASS_NAME, "dataTables_scrollBody")))
finally:
print('loaded')

3. Now that the data is in the HTML, we can select the entries we want to scrape with BeautifulSoup.

As we can see from the inspection, all the data is in the <tbody> tag. Each <tr> tag contains one entry (row) in the table. Thus, we must find the correct <tbody> and start looping over all its <tr> tags. We do this with the function findAll, which finds all the entries of an HTML tag.

soup = BeautifulSoup(driver.page_source, 'html.parser')    """Scraper getting each row"""    
all = soup.findAll("tbody")[2] #the <tbody> we want is the third one
row = all.findAll('tr')

Since we will save the entries to a CSV file, we will:

  • create an empty array that we will populate with the data of each row of the table,
  • iterate over each row (i), iterate over each column (j) of the row (i), and
  • save the info to the correct variable.

The code will look as this:

rest_info = [] # empty array populated with the info of each rowfor i in rows: #i is a row
infos_row = i.findAll('td') # get the info of a single row
for index, j in enumerate(infos_row): #j is a col of row i
info = None
if index == 0: #in this case the first col has the event information
info = j.find('span') #the info is within a span
event = info.text #we extract the text from the span
if index == 4:
info = j.find('span')
areas = info.text
if index == 1:
issued_time = j.text
if index == 3:
country = j.text
if index == 5:
regions = j.text
if index == 2:
continue
#finally we append the infos to the list (for each row)
rest_info.append([event,issued_time,country,areas,regions)])

Now that we have saved the info in the list, let's push it to a CSV.

df = pd.DataFrame(rest_info, columns=
['Event_type','Issued_time','Country','Areas','Regions','Date'])
df.to_csv("scraped_weather.csv",mode='a', index=False,header=False)

The CSV file should look as follow:

Data scraped from the website

Congratulation! You have scraped the website. Now let's look at how to automate the process.

2. Real-time Automation

To schedule the scraping every X minutes (depending on your need), we will need to use a Scheduler. Here we have two of the many options available:

  • GitHub Actions
  • Google Cloud Scheduler

For this tutorial, we will use GitHub Actions, as I think it is the most straightforward and accessible.

  1. First of all, we need to slightly change the code to be able to open Google Chrome through Selenium on GitHub. We need to install the module pyvisualdisplay as
pip install PyVirtualDisplay

Then we need to make the following changes to the existing code:

import pandas as pdfrom bs4 import BeautifulSoupfrom selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import chromedriver_autoinstaller
from pyvirtualdisplay import Display
display = Display(visible=0, size=(800, 800))
display.start()
chromedriver_autoinstaller.install() # Check if the current version of chromedriver existschrome_options = webdriver.ChromeOptions()
options = [
# Define window size here
"--window-size=1200,1200",
"--ignore-certificate-errors"
]
for option in options:
chrome_options.add_argument(option)
driver = webdriver.Chrome(options = chrome_options)

and in the scrapeWeather class, we do not need to call the ChromeDriver installer anymore

def scrapeWeather():    
#driver = webdriver.Chrome(ChromeDriverManager().install()) #not needed anymore!
driver.get("https://severeweather.wmo.int/v2/list.html")
....

2. We are ready to deploy the code to GitHub and schedule it. For this we need to:

  • create a repository
  • push the python script
  • create and push a requirements.txt file (pip install pipreqsand run pipreqsin the terminal folder where your script is present)
  • create a workflow: in your GitHub repository -> Actions -> “New workflow”. In the workflow, we will need to add the following code (copy-paste it and change it according to your setup):
name: scrap3on:
schedule:
- cron: '*/30 * * * *' #the schedule, in this case every 30 mins, in cron time (URL CRON)
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: checkout repo content
uses: actions/checkout@v2
- name: setup python
uses: actions/setup-python@v2
with:
python-version: '3.7.7' # install the python version needed

- name: install python packages
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt

- name: execute py script
run: python scrape.py #NAME OF YOUR FILE HERE!!

- name: commit files
run: |
git config --local user.email "action@github.com"
git config --local user.name "GitHub Action"
git add -A
git commit -m "update data" -a

- name: push changes
uses: ad-m/github-push-action@v0.6.0
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
branch: main

Perfect. Now your script will run every 30 minutes and add the data to the CSV. You could now call this CSV file hosted on GitHub from another endpoint and get real-time weather updates!

This was an example of scraping data from the web using BeautifulSoup, Selenium, and GitHub Actions. I have used this script in the project I did for the HackZurich I participated in last weekend. It is the biggest hackathon in Europe, and in 48 hours, we built a supply chain warning application, which made us win the challenge. You can see our app and the GitHub repository with all the code here.

Thank you for your precious time spent reading the article. Remember to follow me on Medium and contact me on LinkedIn if you have any questions. See you next!

--

--

Gioele Monopoli
CodeX
Writer for

Data Science student and Software Engineer. Sport Lover. Follow me on Linkedin: https://www.linkedin.com/in/gioele-monopoli/