Two Data Scraping Examples
Using python’s beautifulSoup to extract data from the web
Intro
Data scraping, also known as web scraping or data extraction, is the process of extracting data from websites and using them to conduct data analysis or for other potential uses. While data scraping can be a valuable technique for collecting and analysing information in many cases, it is also important to recognise that it’s not always legal or ethical.
Legal and ethical concerns: Some websites explicitly forbid data scraping in their terms of service, and scraping such sites could lead to legal consequences. Additionally, data scraping can raise privacy concerns, especially when dealing with personal or sensitive information.
Impact on servers: Data scraping can cause server strain and increased bandwidth usage, especially when done at high frequency or in large volumes with data crawling models. In some cases, excessive data scraping can even cause servers to crash or become temporarily unavailable.
How to proceed with data scraping: To engage in responsible and ethical data scraping, consider the following:
- Check for permissions: Always review a website’s terms of service or robots.txt file to determine if data scraping is allowed.
- Limit your scraping rate: Don’t overload servers by sending too many requests in a short amount of time.
- Use APIs when available: Many websites provide APIs that allow structured access to their data. Using an API is often a more efficient and legal way to access data compared to scraping. A lot of the APIs are free whilst others are subscription based.
- Respect user privacy: Be mindful of the data you’re collecting, especially if it includes personal or sensitive information.
- Stay informed of legal developments: Stay up to date with whatever changes to ensure you’re always in compliance with the latest rules.
Always ensure that your data scraping activities are both legal and ethical. Minimise the risk of negative consequences for both yourself and the website you scrape.
Extracting Data from yahoo finance
For this example we are going to scrape USD/AUD exchange rate data from yahoo finance and save them to a .csv file. You can also automate this request and provide yourself with daily updates of the current exchange rate for USD/AUD. For more information on automation and emailing through python see my other project: https://medium.com/@ioannidis.au/creating-an-automated-news-roundup-c6b3642c38ab
Importing libraries:
import requests
from bs4 import BeautifulSoup
import pandas as pd
Defining the URL and headers for this request:
url = "https://au.finance.yahoo.com/quote/AUDUSD%3DX/history?p=AUDUSD%3DX"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
Finding the table and extracting columns:
table = soup.find('table', {'class': 'W(100%) M(0)'})
column_names = []
for th in table.thead.tr.find_all('th'):
column_names.append(th.get_text())
Extracting the data:
data = []
for row in table.tbody.find_all('tr'):
cells = row.find_all('td')
row_data = []
for cell in cells:
row_data.append(cell.get_text())
if len(row_data) > 0:
data.append(row_data)
Printing and saving data:
df = pd.DataFrame(data, columns=column_names)
print(df)
df.to_csv('usd_aud_historical_data.csv', index=False)
Date Open High Low Close* Adj. close** Volume
0 16 Apr 2023 0.6711 0.6711 0.6711 0.6711 0.6711 -
1 14 Apr 2023 0.6784 0.6794 0.6697 0.6784 0.6784 -
2 13 Apr 2023 0.6699 0.6781 0.6686 0.6699 0.6699 -
3 12 Apr 2023 0.6654 0.6722 0.6652 0.6654 0.6654 -
4 11 Apr 2023 0.6648 0.6680 0.6645 0.6648 0.6648 -
.. ... ... ... ... ... ... ...
95 05 Dec 2022 0.6813 0.6851 0.6710 0.6813 0.6813 -
96 02 Dec 2022 0.6811 0.6835 0.6747 0.6811 0.6811 -
97 01 Dec 2022 0.6800 0.6841 0.6792 0.6800 0.6800 -
98 30 Nov 2022 0.6684 0.6740 0.6676 0.6684 0.6684 -
99 29 Nov 2022 0.6652 0.6749 0.6641 0.6652 0.6652 -
[100 rows x 7 columns]
Data has successfully been extracted and is ready for data analytics. Assuming df is the dataframe with the scraped data you can use
first_row = df.iloc[0]
send_email(first_row)
to create an automated emailed daily price update.
Extracting Census data
In this example we will be downloading all links from the trade balance payments page ofthe Australian Bureau of Statistics. The data are already in .xlsx and .zip format and are provided in the website in 32 different links. We will be use beautifulsoup to download all 32 simultaneously, instead of manually downloading each file and folder.
Importing libraries:
import requests
from bs4 import BeautifulSoup
import os
Defining the URL and creating the folder for the output from this request:
url = 'https://www.abs.gov.au/statistics/economy/international-trade/balance-payments-and-international-investment-position-australia/dec-2022' #your website url
output_directory = 'downloads'
if not os.path.exists(output_directory):
os.makedirs(output_directory)
Finding links and specific file extensions for download:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
for link in links:
href = link.get('href')
print('Found link:', href) # Add this line to see the links being parsed
if href and href.startswith('http') and (href.endswith('.zip') or href.endswith('.rar') or href.endswith('.tar.gz')):
print('Downloading:', href)
response = requests.get(href)
filename = os.path.join(output_directory, href.split('/')[-1])
with open(filename, 'wb') as f:
f.write(response.content)
print('Saved:', filename)
else:
print('Skipping link:', href) # Add this line to see which links are being skipped-you can use this to fine tune your scraper
Viewing output:
print('All files downloaded.')
The code successfully downloaded 32 .xlsx and .zip files from the website in 1.1 seconds.
Conclusion
Web scraping techniques are widely used in various applications, such as data mining, data analysis, sentiment analysis, price comparison, job search automation, and more. They allows users to gather large amounts of data from the internet in a structured and efficient manner.
However other than Python skills, recognizing HTML elements of a web page are also crucial for successful web scraping. By understanding the HTML structure and being able to recognise various elements, and effectively navigate, identify, and extract the desired data from a web page.