Exploring Web Scraping Techniques: A Deep Dive with Python

7 min readFeb 11, 2024

Introduction

In the digital age, data is abundant and diverse, yet accessing it efficiently can be a challenge. Web scraping, the process of extracting information from websites, offers a solution. In this article, we’ll delve into the art of web scraping using Python, exploring different tools and approaches through three distinct websites: Politifact, Altnews, and Mastodon. For this Project, I took a basic template of code from the ChatGPT and updated it according to my requirements and applied my own approaches in the code.

Packages Required

Ensure you have these packages installed in your Python environment using pip, a package manager for Python. You can install them using the following commands:

pip install selenium webdriver_manager requests beautifulsoup4

Scraping Politifact with BeautifulSoup

Politifact is a treasure trove of articles and fact-checks, making it an ideal candidate for our scraping endeavor. We’ll employ BeautifulSoup, a Python library for parsing HTML and XML, to navigate through the website’s structure.

Challenges Faced

Large Volume of Articles: Politifact’s website contains a large volume of articles and fact-checks, making it challenging to efficiently scrape all the data.
Sequential Scraping: Initially scraping each article sequentially led to significant time consumption, especially due to the need to retrieve all the links before scraping individual articles.
Efficiency and Speed: Given the vast amount of data, ensuring the scraping process is efficient and fast becomes crucial.
Parsing Complex HTML: Politifact’s website may have complex HTML structures, requiring careful parsing to extract the desired information accurately.

Approach

Gathering Article and Fact-Check Links: We utilize BeautifulSoup to extract links from the list pages of articles and fact-checks.
Parallel Execution: To expedite the scraping process, we employ ThreadPoolExecutor for fetching links and scraping articles concurrently.
Extracting Data: Using BeautifulSoup, we retrieve various details from each article, including title, content, images, author information, and sources.
Saving data: Finally, we store the scraped data in a JSON file for further analysis.

Code

# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import json
import concurrent.futures

# Function to fetch article details
def fetch_article_details(link):
    response = requests.get(link)
    if response.status_code == 200:
        article = {}
        soup = BeautifulSoup(response.content, "html.parser")
        # Extract relevant details such as title, content, images, author, etc.
        article["title"] = soup.find("div", class_="m-statement__quote").text.strip()
        article["content"] = soup.find("article", class_="m-textblock").text.strip()
        article["images"] = [img["src"] for img in soup.find("article", class_="m-textblock").find_all("img")]
        # Extract author details
        authors = []
        for author in soup.find_all("div", class_="m-author"):
            a = {}
            if author.find("img") is not None:
                a["avatar"] = author.find("img")["src"]
            a["name"] = author.find("div", class_="m-author__content copy-xs u-color--chateau").find("a").text.strip()
            a["profile"] = "https://www.politifact.com" + author.find("div", class_="m-author__content copy-xs u-color--chateau").find("a")["href"]
            authors.append(a)
        article["authors"] = authors
        return article
    else:
        print("Failed to retrieve the page. Status code:", response.status_code)

# Function to fetch article links
def fetch_article_links(page_number):
    url = f"https://www.politifact.com/article/list/?page={page_number}"
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        articles = soup.find_all("div", class_="m-teaser")
        return ["https://www.politifact.com" + article.find("h3", class_="m-teaser__title").find("a")["href"] for article in articles]
    else:
        print(f"Failed to retrieve the page {url}. Status code:", response.status_code)
        return []

# Fetch article links in parallel
with concurrent.futures.ThreadPoolExecutor() as executor:
    article_links = []
    results = executor.map(fetch_article_links, range(1, 10))  # Adjust range as needed
    for result in results:
        article_links.extend(result)

articles = []

# Fetch article details concurrently
with concurrent.futures.ThreadPoolExecutor() as executor:
    results = executor.map(fetch_article_details, article_links)
    for result in results:
        if result:
            articles.append(result)

# Save the results to a JSON file
with open('politifact_articles.json', 'w') as json_file:
    json.dump(articles, json_file, indent=4)

print("Data saved to politifact_articles.json")

Dynamic Scraping with Selenium: Altnews Case Study

Altnews presents a unique challenge with dynamically loaded content. To overcome this hurdle, we turn to Selenium, a powerful automation tool for web browsing.

Challenges Faced

Dynamically Loaded Content: Altnews website loads content dynamically as the user scrolls down the page, making it challenging to extract all the relevant data.
Scrolling and Page Interaction: Selenium needs to simulate user interactions such as scrolling to trigger the loading of additional content, which requires careful scripting.
Parsing Dynamic HTML: Extracting data from dynamically loaded HTML elements can be challenging, as the structure may change dynamically.
Efficiency and Speed: Efficiently handling dynamic content while ensuring the scraping process is fast and reliable is crucial.

Approach

Dynamic Content Handling: Selenium allows us to mimic human interaction by scrolling through the webpage to load additional content.
Extracting Links: Once the page is fully loaded, we use Selenium to extract links to individual articles.
Scraping Articles: Leveraging BeautifulSoup, we scrape each article to retrieve pertinent information such as title, author, content, images, and videos.
Parallelism for Efficiency: Employing ThreadPoolExecutor, we fetch and process multiple articles simultaneously, enhancing efficiency.

Code

# Import necessary libraries
import json
import re
import concurrent.futures
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.chrome.options import Options

# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run Chrome in headless mode
chrome_options.add_argument("--disable-gpu")

# Initialize WebDriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
url = 'https://www.altnews.in/'

# Open the website
driver.get(url)

try:
    num_scrolls = 10
    # Scroll the page multiple times to load more content
    for _ in range(num_scrolls):
        # Scroll down to the bottom of the page
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait for some time for the content to load
        time.sleep(2)

    links = driver.find_elements(By.CSS_SELECTOR, 'h4 a')
    # Filter links that match the specified pattern
    filtered_links = [link.get_attribute("href") for link in links if re.match(r'^https://www.altnews.in/.+$', link.get_attribute("href"))]

    def fetch_post_data(link):
        response = requests.get(link)
        if response.status_code == 200:
            post = {}
            soup = BeautifulSoup(response.content, 'html.parser')
            author_detail = soup.find('span', class_='author vcard')
            if author_detail is not None:

                post["title"] = soup.find('h1', class_='headline entry-title e-t').get_text()
                element = soup.find('div', class_='entry-content e-c m-e-c guten-content')

                post["author_avatar"] = author_detail.find('img')['src']
                post["author_name"] = author_detail.find('a', class_="url fn n").get_text()
                post["author_profile"] = author_detail.find('a', class_="url fn n")['href']
                images = [img['src'] for img in element.find_all('img')]
                if images:
                    post["images"] = images
                videos = [vid['src'] for vid in element.find_all('video')]
                if videos:
                    post["videos"] = videos
                post["content"] = element.get_text()
                post["link"] = link
            pprint.pprint(post)
            return post

    # Concurrently fetch data from multiple links
    with concurrent.futures.ThreadPoolExecutor() as executor:
        posts = list(executor.map(fetch_post_data, filtered_links))

    # Write to JSON file
    with open('altnews.json', 'w') as json_file:
        json.dump(posts, json_file, indent=4)

finally:
    driver.quit()

Tackling Mastodon’s Dynamic Interface

Mastodon presents a unique challenge due to its dynamic interface and continuous content updates. We again employ Selenium for this task, but with additional considerations.

Challenges Faced

Infinite Scroll: Mastodon’s interface employs infinite scrolling, loading new posts dynamically as the user scrolls down, complicating the scraping process.
Duplicate Content: Due to the dynamic nature of the page, repeated scrolling may lead to duplicate content being scraped, requiring careful handling.
Parsing HTML Attributes: Identifying unique attributes for each post becomes necessary to avoid scraping the same content repeatedly.
Data Extraction and Storage: Extracting various data elements such as images, videos, and post details while ensuring efficient storage and processing presents a challenge.

Approach

Incremental Scrolling: To handle Mastodon’s dynamic interface, we break down scrolling into smaller increments, allowing us to capture new posts while avoiding redundancy.
Unique Post Identification: Mastodon assigns a unique data-id attribute to each post, enabling us to differentiate between new and previously captured posts.
Extracting Post Data: Using Selenium, we extract various details from each post, including author details, content, media files (images/videos), status cards, and engagement metrics.
Data Export: We export the scraped data both in CSV and JSON formats for further analysis and visualization.

Code

# Import necessary libraries
import csv
import json
import pprint
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys

# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run Chrome in headless mode
chrome_options.add_argument("--disable-gpu")

# Initialize WebDriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
url = 'https://mastodon.social/'

# Open the website
driver.get(url)

try:
    visited = []
    num_scrolls = 30

    # Find the <body> element
    body = driver.find_element(By.TAG_NAME, 'body')
    posts = []

    # Scroll using keyboard keys
    for _ in range(num_scrolls):
        curr_post = {}
        body.send_keys(Keys.PAGE_DOWN)  # Scroll down one page
        time.sleep(1)  # Adjust sleep time as needed

        for article in driver.find_elements(By.CSS_SELECTOR, '.status.status-public'):
            if article.get_attribute("data-id") is not None and article.get_attribute("data-id") not in visited:
                # Extract post details
                curr_post["avatar"] = article.find_elements(By.CSS_SELECTOR, '.status__avatar img')[0].get_attribute("src")
                curr_post["name"] = article.find_elements(By.CSS_SELECTOR, '.display-name__html')[0].text
                curr_post["account"] = "\n".join([i.text for i in article.find_elements(By.CSS_SELECTOR, '.display-name__account')])
                curr_post["content"] = "\n".join([i.text for i in article.find_elements(By.CSS_SELECTOR, '.status__content__text.status__content__text--visible.translate')])
                # Extract media (images and videos)
                media = article.find_elements(By.CSS_SELECTOR, 'img')
                if media:
                    curr_post["pictures"] = [i.get_attribute("src") for i in media]
                media = article.find_elements(By.CSS_SELECTOR, 'video')
                if media:
                    curr_post["videos"] = [i.get_attribute("src") for i in media]
                # Extract hashtags, if available
                hashtags = article.find_elements(By.CSS_SELECTOR, 'hashtag-bar')
                if hashtags:
                    curr_post["hashtags"] = [i.text for i in hashtags]
                # Extract status card content and link
                if article.find_elements(By.CSS_SELECTOR, '.status-card__content'):
                    curr_post["status_card"] = "\n".join([i.text for i in article.find_elements(By.CSS_SELECTOR, '.status-card__content')])
                    curr_post["status_card_link"] = article.find_elements(By.CSS_SELECTOR, '.status-card.expanded')[0].get_attribute("href")
                # Extract engagement metrics (boosts, favorites, replies)
                counters = article.find_elements(By.CSS_SELECTOR, '.icon-button__counter')
                if counters:
                    curr_post["boosts"] = counters[1].text
                    curr_post["favs"] = counters[2].text
                    curr_post["replies"] = counters[0].text
                pprint.pprint(curr_post)
                if len(visited) > 10:
                    visited.pop(0)
                visited.append(article.get_attribute("data-id"))
                posts.append(curr_post)

    # Write to CSV file
    with open('mastodon_posts.csv', mode='w', newline='', encoding='utf-8') as file:
        fieldnames = ['avatar', 'name', 'account', 'content', 'pictures', 'videos', 'status_card', 'status_card_link', 'boosts', 'favs', 'replies']
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()
        for post in posts:
            writer.writerow(post)
    
    # Write to JSON file
    with open('mastodon_posts.json', 'w') as json_file:
        json.dump(posts, json_file, indent=4)

finally:
    driver.quit()

Conclusion

As our journey through the dynamic realms of web scraping draws to a close, we reflect on the myriad challenges and triumphs encountered along the way. From the structured confines of Politifact to the dynamic landscapes of Altnews and Mastodon Social, each website presented unique obstacles that tested our mettle and ingenuity. Through the judicious application of Python libraries, automation tools, and strategic approaches, we transcended the limitations of dynamic content, emerging victorious with a wealth of valuable data in tow. As we chart new courses and embark on future expeditions, may our experiences serve as beacons of inspiration for fellow adventurers in the ever-expanding universe of web scraping.

Exploring Web Scraping Techniques: A Deep Dive with Python

Introduction

Packages Required

Scraping Politifact with BeautifulSoup

Challenges Faced

Approach

Code

Dynamic Scraping with Selenium: Altnews Case Study

Challenges Faced

Approach

Code

Tackling Mastodon’s Dynamic Interface

Challenges Faced

Approach

Code

Conclusion

Written by Muhammad Wahid

No responses yet