Python Web Scraping

buzonliao
Python 101
Published in
2 min readOct 14, 2023
Photo by ChatGPT4

The article presents a Python-based solution that employs popular libraries, such as requests, BeautifulSoup, and pprint. It outlines the following essential steps:

  1. Data Extraction: The post demonstrates how to use Python and libraries like requests and BeautifulSoup for web scraping. It fetches information such as story titles, links, and votes from the Hacker News website.
  2. Pagination Handling: The tutorial also covers how to handle pagination by making requests to multiple pages (e.g., the first and second pages of Hacker News).
  3. Data Organization: The scraped data is collected and organized into lists for easy processing.
  4. Sorting by Votes: The highlight of the post is sorting the collected data by the number of votes each story has received, ensuring that the most popular stories appear at the top.

The code examples are a practical and hands-on way to learn web scraping techniques using Python. By following the steps outlined in the post, readers can gain valuable experience in web scraping, data extraction, and data manipulation.

Reference links to official documentation and resources are included for further learning and understanding of the tools and concepts used in the tutorial. This post is a valuable resource for those interested in web scraping and data extraction from websites like Hacker News.

Hacker News Web Scraping

  • Get data from Hacker News website
  • Fetch the story title, links, and votes information, then store it in a list
  • Sort the data by votes
import requests
from bs4 import BeautifulSoup
import pprint

res = requests.get('https://news.ycombinator.com/news')
res2 = requests.get('https://news.ycombinator.com/news?p=2')
soup = BeautifulSoup(res.text, 'html.parser')
soup2 = BeautifulSoup(res2.text, 'html.parser')
links = soup.select('.titleline > a')
links2 = soup2.select('.titleline > a')
subtext = soup.select('.subtext')
subtext2 = soup2.select('.subtext')

mega_links = links + links2
mega_subtext = subtext + subtext2

def sort_stories_by_votes(hnlist):
return sorted(hnlist, key=lambda item: item['votes'], reverse=True)

def create_custom_hn(links, subtext):
hn = []
for i, item in enumerate(links):
title = item.get_text()
href = item.get('href', None)

if i < len(subtext):
vote = subtext[i].select('.score')
if vote:
point = int(vote[0].get_text().replace(' points', ''))
if point >= 100:
hn.append({'title': title, 'link': href, 'votes': point})

return sort_stories_by_votes(hn)

pprint.pprint(create_custom_hn(mega_links, mega_subtext))

Reference:

--

--