ChatGPT Web Scraper

Build a python based webscraper with ChatGPT in minutes

Wayne Berry
Tech News & Articles
4 min readJul 4, 2023

--

Photo by Kim Kuperkova on Shutterstock

Preface

First thing, a disclaimer…. With any web-scraping activity, make sure you have permission from the website owner to scrape! Check the terms and conditions to ensure there’s no clauses that prohibit scraping. Check copyright and legals for where you are and where you’re scraping from.

Respect website owners wishes! Don’t infringe copyright or break the law!

The Web Scraper

Disclaimers aside, everyone has a need to scrape some web content occasionally (legally and ethically of course!). I decided to try it on ChatGPT. I found it was simple and effective.

Here’s my journey (note i’ve redacted any reference to the web address. This is from one of my own websites and I don’t want everyone scraping it!):

write a python web scraper using beautiful soup. scrape from this site: https://xxxxxx.com. locate the element with tag “li” and class “article”. find all the “a” elements nested below this. add “https://xxxxxx.com" to the front of each link. go to each page for the link and scrape all of the elements with “div” and class “article-body”. Save everything in a single html file at: /Users/wayneberry/Downloads/

import requests
from bs4 import BeautifulSoup

# URL of the main page
url = "https://xxxxxxxxxxxxx.com"

# Create a session
session = requests.Session()
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Find the 'li' elements with class 'article'
articles = soup.find_all("li", class_="article")

# List to store scraped data
scraped_data = []

# Loop through each 'li' element
for article in articles:
# Find all 'a' elements nested below the 'li' element
links = article.find_all("a")
for link in links:
# Get the URL and prepend the base URL
article_url = "https://xxxxxxxxxxxx.com" + link["href"]
article_response = session.get(article_url)
article_soup = BeautifulSoup(article_response.content, "html.parser")
# Find the 'div' elements with class 'article-body'
divs = article_soup.find_all("div", class_="article-body")
# Append the scraped data to the list
scraped_data.extend(divs)

# Create an HTML file to save the scraped data
file_path = "/Users/wayneberry/Downloads/scraped_data.html"
with open(file_path, "w") as file:
# Write the HTML structure and the scraped data to the file
file.write("<html><body>")
for data in scraped_data:
file.write(str(data))
file.write("</body></html>")

print("Scraping complete. Data saved at:", file_path)

An empty file! time to dig a bit deeper. Looks like I messed up in my initial request. There’s no element “div” with class “article-body. The element is “article”.

This is wrong: divs = article_soup.find_all(“div”, class_=”article-body”). change element to article not div

import requests
from bs4 import BeautifulSoup

# URL of the main page
url = "https://xxxxxxxxxxxxx.com"

# Create a session
session = requests.Session()
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Find the 'li' elements with class 'article'
articles = soup.find_all("li", class_="article")

# List to store scraped data
scraped_data = []

# Loop through each 'li' element
for article in articles:
# Find all 'a' elements nested below the 'li' element
links = article.find_all("a")
for link in links:
# Get the URL and prepend the base URL
article_url = "https://xxxxxxxxxxxx.com" + link["href"]
print("Visiting:", article_url)
article_response = session.get(article_url)
article_soup = BeautifulSoup(article_response.content, "html.parser")
# Find the 'article' elements with class 'article-body'
article_bodies = article_soup.find_all("article", class_="article-body")
# Append the scraped data to the list
scraped_data.extend(article_bodies)

# Create an HTML file to save the scraped data
file_path = "/Users/wayneberry/Downloads/scraped_data.html"
with open(file_path, "w") as file:
# Write the HTML structure and the scraped data to the file
file.write("<html><body>")
for data in scraped_data:
file.write(str(data))
file.write("</body></html>")

print("Scraping complete. Data saved at:", file_path)

Success! I have an output. But I found any &nbsp was being displayed as Â. This is an encoding issue and easily fixed:

set <meta charset=”utf-8"> in the head

import requests
from bs4 import BeautifulSoup

# URL of the main page
url = "https://xxxxxxxxxxxx.com"

# Create a session
session = requests.Session()
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Find the 'li' elements with class 'article'
articles = soup.find_all("li", class_="article")

# List to store scraped data
scraped_data = []

# Loop through each 'li' element
for article in articles:
# Find all 'a' elements nested below the 'li' element
links = article.find_all("a")
for link in links:
# Get the URL and prepend the base URL
article_url = "https://xxxxxxxxxxxxx.com" + link["href"]
print("Visiting:", article_url)
article_response = session.get(article_url)
article_soup = BeautifulSoup(article_response.content, "html.parser")
# Find the 'article' elements with class 'article-body'
article_bodies = article_soup.find_all("article", class_="article-body")
# Append the scraped data to the list
scraped_data.extend(article_bodies)

# Create an HTML file to save the scraped data
file_path = "/Users/wayneberry/Downloads/scraped_data.html"
with open(file_path, "w", encoding="utf-8") as file:
# Write the HTML structure and the scraped data to the file
file.write("<html><head><meta charset='utf-8'></head><body>")
for data in scraped_data:
file.write(str(data))
file.write("</body></html>")

print("Scraping complete. Data saved at:", file_path)

Success. I have a html file with the scraped data. It retained all of the html markup and formatting which is what I was wanting.

I could keep going with this and get it to find and replace text or add some navigation links, Perhaps write a css file for me. It’s easy and saved me a lot of time if I were to code it myself.

The only real work I had to do was to inspect the webpage elements and identify which elements the scraper needed to focus on. Anyone that knows a little about html would have that job done in less than a minute.

Also be sure to check out OpenAI’s scraper plugin available to Plus users, and can be used for well known websites as detailed in this article by The PyCoach.

Artificial Corner’s Free ChatGPT Cheat Sheet

We’re offering a free cheat sheet to our readers. Join our newsletter with 30K+ people and get our free ChatGPT cheat sheet.

--

--

Wayne Berry
Tech News & Articles

Experienced digital transformation professional - Passionate about the future of data and technology.