Two Data Scraping Examples

Using python’s beautifulSoup to extract data from the web

4 min readApr 17, 2023

Intro

Data scraping, also known as web scraping or data extraction, is the process of extracting data from websites and using them to conduct data analysis or for other potential uses. While data scraping can be a valuable technique for collecting and analysing information in many cases, it is also important to recognise that it’s not always legal or ethical.

Legal and ethical concerns: Some websites explicitly forbid data scraping in their terms of service, and scraping such sites could lead to legal consequences. Additionally, data scraping can raise privacy concerns, especially when dealing with personal or sensitive information.

Impact on servers: Data scraping can cause server strain and increased bandwidth usage, especially when done at high frequency or in large volumes with data crawling models. In some cases, excessive data scraping can even cause servers to crash or become temporarily unavailable.

How to proceed with data scraping: To engage in responsible and ethical data scraping, consider the following:

Check for permissions: Always review a website’s terms of service or robots.txt file to determine if data scraping is allowed.
Limit your scraping rate: Don’t overload servers by sending too many requests in a short amount of time.
Use APIs when available: Many websites provide APIs that allow structured access to their data. Using an API is often a more efficient and legal way to access data compared to scraping. A lot of the APIs are free whilst others are subscription based.
Respect user privacy: Be mindful of the data you’re collecting, especially if it includes personal or sensitive information.
Stay informed of legal developments: Stay up to date with whatever changes to ensure you’re always in compliance with the latest rules.

Always ensure that your data scraping activities are both legal and ethical. Minimise the risk of negative consequences for both yourself and the website you scrape.

Extracting Data from yahoo finance

For this example we are going to scrape USD/AUD exchange rate data from yahoo finance and save them to a .csv file. You can also automate this request and provide yourself with daily updates of the current exchange rate for USD/AUD. For more information on automation and emailing through python see my other project: https://medium.com/@ioannidis.au/creating-an-automated-news-roundup-c6b3642c38ab

Importing libraries:

import requests
from bs4 import BeautifulSoup
import pandas as pd

Defining the URL and headers for this request:

url = "https://au.finance.yahoo.com/quote/AUDUSD%3DX/history?p=AUDUSD%3DX"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}

response = requests.get(url, headers=headers)

soup = BeautifulSoup(response.content, 'html.parser')

Finding the table and extracting columns:

table = soup.find('table', {'class': 'W(100%) M(0)'})

column_names = []
for th in table.thead.tr.find_all('th'):
    column_names.append(th.get_text())

Extracting the data:

data = []
for row in table.tbody.find_all('tr'):
    cells = row.find_all('td')
    row_data = []
    for cell in cells:
        row_data.append(cell.get_text())
    if len(row_data) > 0:
        data.append(row_data)

Printing and saving data:

df = pd.DataFrame(data, columns=column_names)
print(df)

df.to_csv('usd_aud_historical_data.csv', index=False)

           Date    Open    High     Low  Close* Adj. close** Volume
0   16 Apr 2023  0.6711  0.6711  0.6711  0.6711       0.6711      -
1   14 Apr 2023  0.6784  0.6794  0.6697  0.6784       0.6784      -
2   13 Apr 2023  0.6699  0.6781  0.6686  0.6699       0.6699      -
3   12 Apr 2023  0.6654  0.6722  0.6652  0.6654       0.6654      -
4   11 Apr 2023  0.6648  0.6680  0.6645  0.6648       0.6648      -
..          ...     ...     ...     ...     ...          ...    ...
95  05 Dec 2022  0.6813  0.6851  0.6710  0.6813       0.6813      -
96  02 Dec 2022  0.6811  0.6835  0.6747  0.6811       0.6811      -
97  01 Dec 2022  0.6800  0.6841  0.6792  0.6800       0.6800      -
98  30 Nov 2022  0.6684  0.6740  0.6676  0.6684       0.6684      -
99  29 Nov 2022  0.6652  0.6749  0.6641  0.6652       0.6652      -

[100 rows x 7 columns]

Data has successfully been extracted and is ready for data analytics. Assuming df is the dataframe with the scraped data you can use

first_row = df.iloc[0]

send_email(first_row)

to create an automated emailed daily price update.

Extracting Census data

In this example we will be downloading all links from the trade balance payments page ofthe Australian Bureau of Statistics. The data are already in .xlsx and .zip format and are provided in the website in 32 different links. We will be use beautifulsoup to download all 32 simultaneously, instead of manually downloading each file and folder.

Importing libraries:

import requests
from bs4 import BeautifulSoup
import os

Defining the URL and creating the folder for the output from this request:

url = 'https://www.abs.gov.au/statistics/economy/international-trade/balance-payments-and-international-investment-position-australia/dec-2022' #your website url
output_directory = 'downloads'

if not os.path.exists(output_directory):
    os.makedirs(output_directory)

Finding links and specific file extensions for download:

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

links = soup.find_all('a')

for link in links:
    href = link.get('href')
    print('Found link:', href)  # Add this line to see the links being parsed
    if href and href.startswith('http') and (href.endswith('.zip') or href.endswith('.rar') or href.endswith('.tar.gz')):
        print('Downloading:', href)
        response = requests.get(href)
        filename = os.path.join(output_directory, href.split('/')[-1])
        with open(filename, 'wb') as f:
            f.write(response.content)
        print('Saved:', filename)
    else:
        print('Skipping link:', href)  # Add this line to see which links are being skipped-you can use this to fine tune your scraper

Viewing output:

print('All files downloaded.')

The code successfully downloaded 32 .xlsx and .zip files from the website in 1.1 seconds.

Conclusion

Web scraping techniques are widely used in various applications, such as data mining, data analysis, sentiment analysis, price comparison, job search automation, and more. They allows users to gather large amounts of data from the internet in a structured and efficient manner.

However other than Python skills, recognizing HTML elements of a web page are also crucial for successful web scraping. By understanding the HTML structure and being able to recognise various elements, and effectively navigate, identify, and extract the desired data from a web page.