Scraping and Analyzing Trending GitHub Repositories with Python

Aniyom Ebenezer
3 min readSep 28, 2023

--

Introduction

A thought

HTML has been built to be “displayed”. It’s working very well… but when you want to build a script to collect actionable data, you are left with this:

HTML code sample

GitHub is a treasure trove of open-source projects and the latest trends in the world of software development.

Github trending’s page

Developers often keep an eye on trending repositories to stay updated with the most exciting projects.

In this blog post, we’ll walk you through a Python project that scrapes and analyzes trending GitHub repositories. We’ll utilize popular libraries like Requests and BeautifulSoup to achieve this.

Project Overview

The Project consists of the following steps:

Step 1: Request — Fetching the GitHub Trending Page

We’ll start by creating a function request_github_trending(url) that will return the HTML content of the GitHub trending page when given the URL. We use the requests library to make an HTTP GET request.

Request from the trending URL

Step 2: Extract — Parsing HTML with BeautifulSoup

The extract function uses BeautifulSoup to parse the HTML content of the GitHub trending page. It searches for all <article> elements, which contain the repository information.

Articles Extraction

Step 3: Transform — Structuring Repository Data

Our transform function processes the extracted HTML data and extracts relevant information like the number of stars, repository name, and developer name if available. It returns an array of dictionaries, where each dictionary represents a repository.

Data Transformation

Step 4: Format — Converting to CSV

Our format function organizes the extracted data into a structured CSV-like string, making it easier to analyze. It contains columns separated by commas and rows separated by newlines.

CSV formatting

How to Use the Project

  1. Libraries Installation: Before running the project, you need to ensure that you have the required libraries installed. You can do this by using pip:
libraries installation

2. Running the Project: Uncomment the _run() line at the end of the code and execute the script. It will scrape the GitHub trending page, process the data, and print it in CSV format.

Conclusion

In this project, we’ve demonstrated how to use Python, Requests, and BeautifulSoup to scrape data from web pages and organize it for analysis. We’ve provided functions for requesting web pages, extracting data, transforming it into a structured format, and formatting it into a CSV-like string. This project can serve as a foundation for further analysis and automation of tasks related to trending GitHub repositories. Understanding web scraping and data extraction opens up a world of possibilities for accessing and utilizing web data for various purposes.

link to complete code>> Github trending stories codes

--

--