Python Web Scraping

Published in

Python 101

2 min readOct 14, 2023

The article presents a Python-based solution that employs popular libraries, such as requests, BeautifulSoup, and pprint. It outlines the following essential steps:

Data Extraction: The post demonstrates how to use Python and libraries like requests and BeautifulSoup for web scraping. It fetches information such as story titles, links, and votes from the Hacker News website.
Pagination Handling: The tutorial also covers how to handle pagination by making requests to multiple pages (e.g., the first and second pages of Hacker News).
Data Organization: The scraped data is collected and organized into lists for easy processing.
Sorting by Votes: The highlight of the post is sorting the collected data by the number of votes each story has received, ensuring that the most popular stories appear at the top.

The code examples are a practical and hands-on way to learn web scraping techniques using Python. By following the steps outlined in the post, readers can gain valuable experience in web scraping, data extraction, and data manipulation.

Reference links to official documentation and resources are included for further learning and understanding of the tools and concepts used in the tutorial. This post is a valuable resource for those interested in web scraping and data extraction from websites like Hacker News.

Hacker News Web Scraping

Get data from Hacker News website
Fetch the story title, links, and votes information, then store it in a list
Sort the data by votes

import requests
from bs4 import BeautifulSoup
import pprint

res = requests.get('https://news.ycombinator.com/news')
res2 = requests.get('https://news.ycombinator.com/news?p=2')
soup = BeautifulSoup(res.text, 'html.parser')
soup2 = BeautifulSoup(res2.text, 'html.parser')
links = soup.select('.titleline > a')
links2 = soup2.select('.titleline > a')
subtext = soup.select('.subtext')
subtext2 = soup2.select('.subtext')

mega_links = links + links2
mega_subtext = subtext + subtext2

def sort_stories_by_votes(hnlist):
    return sorted(hnlist, key=lambda item: item['votes'], reverse=True)

def create_custom_hn(links, subtext):
    hn = []
    for i, item in enumerate(links):
        title = item.get_text()
        href = item.get('href', None)

        if i < len(subtext):
            vote = subtext[i].select('.score')
            if vote:
                point = int(vote[0].get_text().replace(' points', '')) 
                if point >= 100:
                    hn.append({'title': title, 'link': href, 'votes': point})
    
    return sort_stories_by_votes(hn)

pprint.pprint(create_custom_hn(mega_links, mega_subtext))

Reference:

Python Web Scraping

Hacker News Web Scraping

Written by buzonliao