Exploring Web Scraping Techniques: A Deep Dive with Python
Introduction
In the digital age, data is abundant and diverse, yet accessing it efficiently can be a challenge. Web scraping, the process of extracting information from websites, offers a solution. In this article, we’ll delve into the art of web scraping using Python, exploring different tools and approaches through three distinct websites: Politifact, Altnews, and Mastodon. For this Project, I took a basic template of code from the ChatGPT and updated it according to my requirements and applied my own approaches in the code.
Packages Required
Ensure you have these packages installed in your Python environment using pip, a package manager for Python. You can install them using the following commands:
pip install selenium webdriver_manager requests beautifulsoup4
Scraping Politifact with BeautifulSoup
Politifact is a treasure trove of articles and fact-checks, making it an ideal candidate for our scraping endeavor. We’ll employ BeautifulSoup, a Python library for parsing HTML and XML, to navigate through the website’s structure.
Challenges Faced
- Large Volume of Articles: Politifact’s website contains a large volume of articles and fact-checks, making it challenging to efficiently scrape all the data.
- Sequential Scraping: Initially scraping each article sequentially led to significant time consumption, especially due to the need to retrieve all the links before scraping individual articles.
- Efficiency and Speed: Given the vast amount of data, ensuring the scraping process is efficient and fast becomes crucial.
- Parsing Complex HTML: Politifact’s website may have complex HTML structures, requiring careful parsing to extract the desired information accurately.
Approach
- Gathering Article and Fact-Check Links: We utilize BeautifulSoup to extract links from the list pages of articles and fact-checks.
- Parallel Execution: To expedite the scraping process, we employ ThreadPoolExecutor for fetching links and scraping articles concurrently.
- Extracting Data: Using BeautifulSoup, we retrieve various details from each article, including title, content, images, author information, and sources.
- Saving data: Finally, we store the scraped data in a JSON file for further analysis.
Code
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import json
import concurrent.futures
# Function to fetch article details
def fetch_article_details(link):
response = requests.get(link)
if response.status_code == 200:
article = {}
soup = BeautifulSoup(response.content, "html.parser")
# Extract relevant details such as title, content, images, author, etc.
article["title"] = soup.find("div", class_="m-statement__quote").text.strip()
article["content"] = soup.find("article", class_="m-textblock").text.strip()
article["images"] = [img["src"] for img in soup.find("article", class_="m-textblock").find_all("img")]
# Extract author details
authors = []
for author in soup.find_all("div", class_="m-author"):
a = {}
if author.find("img") is not None:
a["avatar"] = author.find("img")["src"]
a["name"] = author.find("div", class_="m-author__content copy-xs u-color--chateau").find("a").text.strip()
a["profile"] = "https://www.politifact.com" + author.find("div", class_="m-author__content copy-xs u-color--chateau").find("a")["href"]
authors.append(a)
article["authors"] = authors
return article
else:
print("Failed to retrieve the page. Status code:", response.status_code)
# Function to fetch article links
def fetch_article_links(page_number):
url = f"https://www.politifact.com/article/list/?page={page_number}"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, "html.parser")
articles = soup.find_all("div", class_="m-teaser")
return ["https://www.politifact.com" + article.find("h3", class_="m-teaser__title").find("a")["href"] for article in articles]
else:
print(f"Failed to retrieve the page {url}. Status code:", response.status_code)
return []
# Fetch article links in parallel
with concurrent.futures.ThreadPoolExecutor() as executor:
article_links = []
results = executor.map(fetch_article_links, range(1, 10)) # Adjust range as needed
for result in results:
article_links.extend(result)
articles = []
# Fetch article details concurrently
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(fetch_article_details, article_links)
for result in results:
if result:
articles.append(result)
# Save the results to a JSON file
with open('politifact_articles.json', 'w') as json_file:
json.dump(articles, json_file, indent=4)
print("Data saved to politifact_articles.json")
Dynamic Scraping with Selenium: Altnews Case Study
Altnews presents a unique challenge with dynamically loaded content. To overcome this hurdle, we turn to Selenium, a powerful automation tool for web browsing.
Challenges Faced
- Dynamically Loaded Content: Altnews website loads content dynamically as the user scrolls down the page, making it challenging to extract all the relevant data.
- Scrolling and Page Interaction: Selenium needs to simulate user interactions such as scrolling to trigger the loading of additional content, which requires careful scripting.
- Parsing Dynamic HTML: Extracting data from dynamically loaded HTML elements can be challenging, as the structure may change dynamically.
- Efficiency and Speed: Efficiently handling dynamic content while ensuring the scraping process is fast and reliable is crucial.
Approach
- Dynamic Content Handling: Selenium allows us to mimic human interaction by scrolling through the webpage to load additional content.
- Extracting Links: Once the page is fully loaded, we use Selenium to extract links to individual articles.
- Scraping Articles: Leveraging BeautifulSoup, we scrape each article to retrieve pertinent information such as title, author, content, images, and videos.
- Parallelism for Efficiency: Employing ThreadPoolExecutor, we fetch and process multiple articles simultaneously, enhancing efficiency.
Code
# Import necessary libraries
import json
import re
import concurrent.futures
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.chrome.options import Options
# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless") # Run Chrome in headless mode
chrome_options.add_argument("--disable-gpu")
# Initialize WebDriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
url = 'https://www.altnews.in/'
# Open the website
driver.get(url)
try:
num_scrolls = 10
# Scroll the page multiple times to load more content
for _ in range(num_scrolls):
# Scroll down to the bottom of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait for some time for the content to load
time.sleep(2)
links = driver.find_elements(By.CSS_SELECTOR, 'h4 a')
# Filter links that match the specified pattern
filtered_links = [link.get_attribute("href") for link in links if re.match(r'^https://www.altnews.in/.+$', link.get_attribute("href"))]
def fetch_post_data(link):
response = requests.get(link)
if response.status_code == 200:
post = {}
soup = BeautifulSoup(response.content, 'html.parser')
author_detail = soup.find('span', class_='author vcard')
if author_detail is not None:
post["title"] = soup.find('h1', class_='headline entry-title e-t').get_text()
element = soup.find('div', class_='entry-content e-c m-e-c guten-content')
post["author_avatar"] = author_detail.find('img')['src']
post["author_name"] = author_detail.find('a', class_="url fn n").get_text()
post["author_profile"] = author_detail.find('a', class_="url fn n")['href']
images = [img['src'] for img in element.find_all('img')]
if images:
post["images"] = images
videos = [vid['src'] for vid in element.find_all('video')]
if videos:
post["videos"] = videos
post["content"] = element.get_text()
post["link"] = link
pprint.pprint(post)
return post
# Concurrently fetch data from multiple links
with concurrent.futures.ThreadPoolExecutor() as executor:
posts = list(executor.map(fetch_post_data, filtered_links))
# Write to JSON file
with open('altnews.json', 'w') as json_file:
json.dump(posts, json_file, indent=4)
finally:
driver.quit()
Tackling Mastodon’s Dynamic Interface
Mastodon presents a unique challenge due to its dynamic interface and continuous content updates. We again employ Selenium for this task, but with additional considerations.
Challenges Faced
- Infinite Scroll: Mastodon’s interface employs infinite scrolling, loading new posts dynamically as the user scrolls down, complicating the scraping process.
- Duplicate Content: Due to the dynamic nature of the page, repeated scrolling may lead to duplicate content being scraped, requiring careful handling.
- Parsing HTML Attributes: Identifying unique attributes for each post becomes necessary to avoid scraping the same content repeatedly.
- Data Extraction and Storage: Extracting various data elements such as images, videos, and post details while ensuring efficient storage and processing presents a challenge.
Approach
- Incremental Scrolling: To handle Mastodon’s dynamic interface, we break down scrolling into smaller increments, allowing us to capture new posts while avoiding redundancy.
- Unique Post Identification: Mastodon assigns a unique data-id attribute to each post, enabling us to differentiate between new and previously captured posts.
- Extracting Post Data: Using Selenium, we extract various details from each post, including author details, content, media files (images/videos), status cards, and engagement metrics.
- Data Export: We export the scraped data both in CSV and JSON formats for further analysis and visualization.
Code
# Import necessary libraries
import csv
import json
import pprint
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless") # Run Chrome in headless mode
chrome_options.add_argument("--disable-gpu")
# Initialize WebDriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
url = 'https://mastodon.social/'
# Open the website
driver.get(url)
try:
visited = []
num_scrolls = 30
# Find the <body> element
body = driver.find_element(By.TAG_NAME, 'body')
posts = []
# Scroll using keyboard keys
for _ in range(num_scrolls):
curr_post = {}
body.send_keys(Keys.PAGE_DOWN) # Scroll down one page
time.sleep(1) # Adjust sleep time as needed
for article in driver.find_elements(By.CSS_SELECTOR, '.status.status-public'):
if article.get_attribute("data-id") is not None and article.get_attribute("data-id") not in visited:
# Extract post details
curr_post["avatar"] = article.find_elements(By.CSS_SELECTOR, '.status__avatar img')[0].get_attribute("src")
curr_post["name"] = article.find_elements(By.CSS_SELECTOR, '.display-name__html')[0].text
curr_post["account"] = "\n".join([i.text for i in article.find_elements(By.CSS_SELECTOR, '.display-name__account')])
curr_post["content"] = "\n".join([i.text for i in article.find_elements(By.CSS_SELECTOR, '.status__content__text.status__content__text--visible.translate')])
# Extract media (images and videos)
media = article.find_elements(By.CSS_SELECTOR, 'img')
if media:
curr_post["pictures"] = [i.get_attribute("src") for i in media]
media = article.find_elements(By.CSS_SELECTOR, 'video')
if media:
curr_post["videos"] = [i.get_attribute("src") for i in media]
# Extract hashtags, if available
hashtags = article.find_elements(By.CSS_SELECTOR, 'hashtag-bar')
if hashtags:
curr_post["hashtags"] = [i.text for i in hashtags]
# Extract status card content and link
if article.find_elements(By.CSS_SELECTOR, '.status-card__content'):
curr_post["status_card"] = "\n".join([i.text for i in article.find_elements(By.CSS_SELECTOR, '.status-card__content')])
curr_post["status_card_link"] = article.find_elements(By.CSS_SELECTOR, '.status-card.expanded')[0].get_attribute("href")
# Extract engagement metrics (boosts, favorites, replies)
counters = article.find_elements(By.CSS_SELECTOR, '.icon-button__counter')
if counters:
curr_post["boosts"] = counters[1].text
curr_post["favs"] = counters[2].text
curr_post["replies"] = counters[0].text
pprint.pprint(curr_post)
if len(visited) > 10:
visited.pop(0)
visited.append(article.get_attribute("data-id"))
posts.append(curr_post)
# Write to CSV file
with open('mastodon_posts.csv', mode='w', newline='', encoding='utf-8') as file:
fieldnames = ['avatar', 'name', 'account', 'content', 'pictures', 'videos', 'status_card', 'status_card_link', 'boosts', 'favs', 'replies']
writer = csv.DictWriter(file, fieldnames=fieldnames)
writer.writeheader()
for post in posts:
writer.writerow(post)
# Write to JSON file
with open('mastodon_posts.json', 'w') as json_file:
json.dump(posts, json_file, indent=4)
finally:
driver.quit()
Conclusion
As our journey through the dynamic realms of web scraping draws to a close, we reflect on the myriad challenges and triumphs encountered along the way. From the structured confines of Politifact to the dynamic landscapes of Altnews and Mastodon Social, each website presented unique obstacles that tested our mettle and ingenuity. Through the judicious application of Python libraries, automation tools, and strategic approaches, we transcended the limitations of dynamic content, emerging victorious with a wealth of valuable data in tow. As we chart new courses and embark on future expeditions, may our experiences serve as beacons of inspiration for fellow adventurers in the ever-expanding universe of web scraping.