Building a Web Scraper and Content Classifier with Python and ttkbootstrap

sutton
CodeX
Published in
3 min readJul 3, 2024
Operating Diagram

Web scraping is a powerful technique for extracting data from websites. In this tutorial, we’ll create a Python application that scrapes web content, classifies it into themes, and displays the results in a graphical user interface (GUI) using ttkbootstrap. This project combines web scraping with data organization and visualization, providing a comprehensive example of how these techniques can be applied.

Tools and Libraries

To build this application, we need several libraries:

  • requests: For making HTTP requests to fetch web pages.
  • BeautifulSoup: For parsing HTML content.
  • pandas: For organizing data into a DataFrame.
  • ttkbootstrap: For creating a modern, themeable GUI.
  • tkinter: The standard Python interface to the Tk GUI toolkit.
  • PIL: For image handling.

Install these libraries using pip:

pip install requests beautifulsoup4 pandas ttkbootstrap pillow

Steps to Create the Web Scraper and GUI

Step 1: Web Scraping and Data Organization

The scrape_to_df function handles the web scraping and data organization:

  1. Fetching the Web Page: Send an HTTP GET request to the specified URL and check if the request was successful.
  2. Parsing HTML Content: Use BeautifulSoup to parse the HTML content of the page.
  3. Extracting Paragraphs: Find all paragraph (<p>) elements and store their text content.
  4. Classifying Content: Define themes and associated keywords, then classify each paragraph based on these keywords.
  5. Creating a DataFrame: Organize the classified data into a pandas DataFrame.
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_to_df(url):
response = requests.get(url)
if response.status_code != 200:
print(f"Error accessing the page: {response.status_code}")
return None

soup = BeautifulSoup(response.content, 'html.parser')
paragraphs = soup.find_all('p')

if not paragraphs:
print("No paragraphs found on the page.")
return None

data = [p.text.strip() for p in paragraphs]

themes = {
'Launch': ['launch', 'release', 'announce', 'global'],
'Features': ['feature', 'update', 'new', 'mode', 'character'],
'Community': ['community', 'player', 'feedback', 'event'],
'Company': ['company', 'team', 'development']
}

organized_data = {theme: [] for theme in themes}

for paragraph in data:
assigned = False
for theme, keywords in themes.items():
if any(keyword in paragraph.lower() for keyword in keywords):
organized_data[theme].append(paragraph)
assigned = True
break
if not assigned:
organized_data.setdefault('Others', []).append(paragraph)

rows = [{'Theme': theme, 'Paragraph': paragraph} for theme, paragraphs in organized_data.items() for paragraph in paragraphs]
df = pd.DataFrame(rows)

return df

Step 2: Creating the GUI

The display_data function interacts with the GUI elements:

  1. User Input: Prompt the user to enter a URL.
  2. Displaying Data: Fetch and classify the data, then display it in a tree view table.
import ttkbootstrap as tb
from ttkbootstrap.constants import *
from tkinter import messagebox
from PIL import Image, ImageTk

def display_data():
url = url_entry.get()
if not url:
messagebox.showerror("Error", "Please enter a URL.")
return

df = scrape_to_df(url)
if df is None:
messagebox.showerror("Error", "Could not scrape the page.")
return

for item in tree.get_children():
tree.delete(item)

for i, (index, row) in enumerate(df.iterrows()):
tree.insert("", "end", values=(row['Theme'], row['Paragraph']))

root = tb.Window(themename="darkly")
root.title("Web Scraper")

logo_image = Image.open("logo.png")
logo_photo = ImageTk.PhotoImage(logo_image)
root.iconphoto(False, logo_photo)

frame = tb.Frame(root, padding="10")
frame.grid(row=0, column=0, sticky=(N, S, E, W))

tb.Label(frame, text="URL:").grid(row=1, column=0, sticky=W)
url_entry = tb.Entry(frame, width=50)
url_entry.grid(row=1, column=1, sticky=(E, W))

scrape_button = tb.Button(frame, text="Start Scraping", command=display_data, bootstyle=SECONDARY)
scrape_button.grid(row=1, column=2, padx=10)

columns = ("Theme", "Paragraph")
tree = tb.Treeview(frame, columns=columns, show="headings", bootstyle=DARK)
tree.heading("Theme", text="Theme")
tree.heading("Paragraph", text="Paragraph")
tree.column("Theme", width=50)
tree.grid(row=2, column=0, columnspan=3, sticky=(N, S, E, W))

frame.columnconfigure(1, weight=1)
frame.rowconfigure(2, weight=1)

scrollbar = tb.Scrollbar(frame, orient=VERTICAL, command=tree.yview)
tree.configure(yscroll=scrollbar.set)
scrollbar.grid(row=2, column=3, sticky=(N, S))

root.columnconfigure(0, weight=1)
root.rowconfigure(0, weight=1)

root.mainloop()

Running the Application

To run the application, save the above code in a Python file and execute it. Enter the URL of a webpage you want to scrape and click “Start Scraping” to see the classified content displayed in the table.

test

Conclusion

This tutorial demonstrates how to combine web scraping, data classification, and GUI creation in Python. Using requests, BeautifulSoup, and pandas for data handling, along with ttkbootstrap for a modern interface, you can build versatile applications for various data-driven tasks. Experiment with different themes and customize the application to suit your needs!

test

--

--

sutton
CodeX
Writer for

AI Engineer, I teach my knowledge of artificial intelligence and mathematics. sutton3201@gmail.com