Leveraging GitHub API with Python: A Comprehensive Guide

3 min readFeb 13, 2024

In this article, we’ll explore how to use the GitHub API to fetch and analyze repositories based on user locations, specifically focusing on Northern Italian cities like Milan and Turin. This approach is highly beneficial for developers looking to automate tasks, manage repositories, or integrate GitHub features into their applications using Python.

https://github.com/mazzasaverio/etl-github-projects

Setting Up Your Environment

First, ensure you have Python installed on your system. You’ll also need requests for making API requests and pandas for data manipulation. Install them using pip if you haven't already:

pip install requests pandas

Authenticating with GitHub API

To perform authenticated requests and increase your rate limit from 60 to 5,000 requests per hour, generate a Personal Access Token (PAT) from your GitHub account. Follow GitHub’s official guide on creating an access token, ensuring to check the necessary scopes for your tasks.

Fetching Users by Location

Let’s start by writing a function to fetch GitHub users by their location. This function makes a GET request to the GitHub Search API, retrieves users based on the specified location, and handles pagination to fetch more than the initial results:

import requests

def fetch_users_by_location(location, max_users=100, access_token=''):
    users = []
    page = 1
    headers = {"Authorization": f"token {access_token}"}
    while len(users) < max_users:
        url = f"https://api.github.com/search/users?q=location:{location}&per_page=100&page={page}"
        response = requests.get(url, headers=headers).json()
        batch = response.get("items", [])
        if not batch:
            break
        users.extend(batch[:max_users - len(users)])
        page += 1
    return [user["login"] for user in users]

Fetching Repository Details

Next, we’ll write a function to fetch repository details for each user. This function also handles pagination and extracts relevant information such as repository name, stars, forks, and topics:

def fetch_repo_details(username, max_repos=100, access_token=''):
    repos = []
    page = 1
    headers = {"Authorization": f"token {access_token}"}
    while len(repos) < max_repos:
        url = f"https://api.github.com/users/{username}/repos?per_page=100&page={page}"
        response = requests.get(url, headers=headers).json()
        if not response:
            break
        repos.extend(response[:max_repos - len(repos)])
        page += 1
    return [{
        "repo_name": repo.get("name", ""),
        "username": username,
        "creation_date": repo.get("created_at", ""),
        "stars": repo.get("stargazers_count", 0),
        "forks": repo.get("forks_count", 0),
        "last_update": repo.get("updated_at", ""),
        "description": repo.get("description", ""),
        "topics": repo.get("topics", []),
    } for repo in repos]

Next Steps

Data Analysis and Visualization: With the DataFrame df containing GitHub repositories information, you can now analyze the data further. Consider visualizing the data using libraries such as Matplotlib or Seaborn to gain insights into repository trends, popularity, and activity based on locations.
Expand the Scope: Consider extending the functionality to include more locations or additional repository details. Experiment with different parameters of the GitHub API to customize the data you’re collecting.
Automate and Schedule: To keep your dataset up-to-date, consider automating the script to run at regular intervals using scheduling tools like cron for Linux/MacOS or Task Scheduler for Windows.
Contribute to Open Source: If your project can benefit others, consider making it open-source. This way, you can collaborate with others, get feedback, and improve the project further.

Conclusion

This guide provides a step-by-step approach to using the GitHub API with Python to fetch and analyze GitHub repositories based on user locations. By leveraging Python libraries such as requests and pandas, you can automate tasks, manage repositories, or integrate GitHub features into your applications efficiently. Remember to handle API rate limits and authenticate your requests to access more detailed data and perform a higher number of requests.