Realtime Data Scraping with Python

Leverage Selenium and Beautifulsoup for live updates

Gioele Monopoli

Published in

CodeX

6 min readOct 9, 2022

Context

Supply Chain Resiliency is an essential topic for companies that rely on their supply chain for their goods. Imagine the situation in which your Supply Chain Manager requests real-time updates on the current weather condition in some parts of the world, along with possible climate warnings and alerts, available here. You decide to tackle this problem by scraping the website in real-time. You will do this by creating a Python script to scrape all data needed, and you then schedule it to run every 30 minutes to receive live updates.

This article is best suited for programmers familiar with Python.

Scraping

The first thing we need to do is install the necessary libraries for the scraping, i.e BeautifulSoup, and Selenium

pip install bs4
pip install selenium

To give a simple distinction, we will need Selenium to go to a website, interact with the browser by clicking buttons, and wait for elements to be present. Then, BeautifulSoup is used to iterate over the HTML and extract the actual data (i.e, what you see).

2. We now explore the website. As you can see in the picture below, a waiting time of ~ 5 seconds is needed before the data is correctly loaded.

Because of this, starting scraping directly with BeautifulSoup will lead to no entries as we need to wait for the data to be in the HTML. We solve this problem by setting a listener on the element that gets created once the data is fetched.

By right-clicking and pressing the “Inspect Element” button on the website, we see in the inspection interface that the element we need to wait for is the <div> with the class dataTables_scrollBody.

Inspect Element result

To scrape a website, the library Selenium requires us to have the Google Chrome browser (you can also use another browser). We thus tell Selenium to spin up Google Chrome

from selenium import webdriverdriver = webdriver.Chrome(ChromeDriverManager().install())

and tell the driver where our website is by passing its URL

driver.get("https://severeweather.wmo.int/v2/list.html")

Now, we can set the listener mentioned above to let the driver wait for the <div> element with dataTables_scrollBody class to be in the HTML

try:        
  elem = WebDriverWait(driver, 30).until(    EC.presence_of_element_located((By.CLASS_NAME, "dataTables_scrollBody")))
 finally:        
  print('loaded')

We define our scraping function as scrapeWeather and our code at this point should be similar to this:

### imports
import pandas as pdfrom bs4 import BeautifulSoupfrom selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager###def scrapeWeather(): # Our function for scraping   
 driver = webdriver.Chrome(ChromeDriverManager().install()) #url request
driver.get("https://severeweather.wmo.int/v2/list.html")  try:        
  elem = WebDriverWait(driver, 30).until(    EC.presence_of_element_located((By.CLASS_NAME, "dataTables_scrollBody")))
 finally:        
  print('loaded')

3. Now that the data is in the HTML, we can select the entries we want to scrape with BeautifulSoup.

As we can see from the inspection, all the data is in the <tbody> tag. Each <tr> tag contains one entry (row) in the table. Thus, we must find the correct <tbody> and start looping over all its <tr> tags. We do this with the function findAll, which finds all the entries of an HTML tag.

soup = BeautifulSoup(driver.page_source, 'html.parser')    """Scraper getting each row"""    
all = soup.findAll("tbody")[2] #the <tbody> we want is the third one
row = all.findAll('tr')

Since we will save the entries to a CSV file, we will:

create an empty array that we will populate with the data of each row of the table,
iterate over each row (i), iterate over each column (j) of the row (i), and
save the info to the correct variable.

The code will look as this:

rest_info = [] # empty array populated with the info of each rowfor i in rows: #i is a row
        infos_row = i.findAll('td')   # get the info of a single row
        for index, j in enumerate(infos_row): #j is a col of row i 
            info = None
            if index == 0: #in this case the first col has the event information
                info = j.find('span') #the info is within a span
                event = info.text #we extract the text from the span            if index == 4:
                info = j.find('span')
                areas = info.text            if index == 1:
                issued_time = j.text
            if index == 3:
                country = j.text            if index == 5:
                regions = j.text            if index == 2:
                continue
        #finally we append the infos to the list (for each row) 
        rest_info.append([event,issued_time,country,areas,regions)])

Now that we have saved the info in the list, let's push it to a CSV.

df = pd.DataFrame(rest_info, columns=
 ['Event_type','Issued_time','Country','Areas','Regions','Date'])df.to_csv("scraped_weather.csv",mode='a', index=False,header=False)

The CSV file should look as follow:

Congratulation! You have scraped the website. Now let's look at how to automate the process.

2. Real-time Automation

To schedule the scraping every X minutes (depending on your need), we will need to use a Scheduler. Here we have two of the many options available:

GitHub Actions
Google Cloud Scheduler

For this tutorial, we will use GitHub Actions, as I think it is the most straightforward and accessible.

First of all, we need to slightly change the code to be able to open Google Chrome through Selenium on GitHub. We need to install the module pyvisualdisplay as

pip install PyVirtualDisplay

Then we need to make the following changes to the existing code:

import pandas as pdfrom bs4 import BeautifulSoupfrom selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManagerfrom selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import chromedriver_autoinstallerfrom pyvirtualdisplay import Display
display = Display(visible=0, size=(800, 800))  
display.start()chromedriver_autoinstaller.install()  # Check if the current version of chromedriver existschrome_options = webdriver.ChromeOptions()      
options = [
  # Define window size here
   "--window-size=1200,1200",
    "--ignore-certificate-errors"
]for option in options:
    chrome_options.add_argument(option)driver = webdriver.Chrome(options = chrome_options)

and in the scrapeWeather class, we do not need to call the ChromeDriver installer anymore

def scrapeWeather():    
 #driver = webdriver.Chrome(ChromeDriverManager().install()) #not needed anymore! driver.get("https://severeweather.wmo.int/v2/list.html")
 ....

2. We are ready to deploy the code to GitHub and schedule it. For this we need to:

create a repository
push the python script
create and push a requirements.txt file (pip install pipreqsand run pipreqsin the terminal folder where your script is present)
create a workflow: in your GitHub repository -> Actions -> “New workflow”. In the workflow, we will need to add the following code (copy-paste it and change it according to your setup):

name: scrap3on:
  schedule:
    - cron: '*/30 * * * *' #the schedule, in this case every 30 mins, in cron time (URL CRON)jobs:
  build:
    runs-on: ubuntu-latest
    steps:- name: checkout repo content
        uses: actions/checkout@v2- name: setup python
        uses: actions/setup-python@v2
        with:
          python-version: '3.7.7' # install the python version needed
          
      - name: install python packages
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
          
      - name: execute py script
        run: python scrape.py #NAME OF YOUR FILE HERE!!
          
      - name: commit files
        run: |
          git config --local user.email "action@github.com"
          git config --local user.name "GitHub Action"
          git add -A
          git commit -m "update data" -a
          
      - name: push changes
        uses: ad-m/github-push-action@v0.6.0
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          branch: main

Perfect. Now your script will run every 30 minutes and add the data to the CSV. You could now call this CSV file hosted on GitHub from another endpoint and get real-time weather updates!

This was an example of scraping data from the web using BeautifulSoup, Selenium, and GitHub Actions. I have used this script in the project I did for the HackZurich I participated in last weekend. It is the biggest hackathon in Europe, and in 48 hours, we built a supply chain warning application, which made us win the challenge. You can see our app and the GitHub repository with all the code here.

Thank you for your precious time spent reading the article. Remember to follow me on Medium and contact me on LinkedIn if you have any questions. See you next!

Realtime Data Scraping with Python

Leverage Selenium and Beautifulsoup for live updates

Context

Scraping

2. Real-time Automation

Written by Gioele Monopoli