Leveraging GitHub API with Python: A Comprehensive Guide
In this article, we’ll explore how to use the GitHub API to fetch and analyze repositories based on user locations, specifically focusing on Northern Italian cities like Milan and Turin. This approach is highly beneficial for developers looking to automate tasks, manage repositories, or integrate GitHub features into their applications using Python.
https://github.com/mazzasaverio/etl-github-projects
Setting Up Your Environment
First, ensure you have Python installed on your system. You’ll also need requests
for making API requests and pandas
for data manipulation. Install them using pip if you haven't already:
pip install requests pandas
Authenticating with GitHub API
To perform authenticated requests and increase your rate limit from 60 to 5,000 requests per hour, generate a Personal Access Token (PAT) from your GitHub account. Follow GitHub’s official guide on creating an access token, ensuring to check the necessary scopes for your tasks.
Fetching Users by Location
Let’s start by writing a function to fetch GitHub users by their location. This function makes a GET request to the GitHub Search API, retrieves users based on the specified location, and handles pagination to fetch more than the initial results:
import requests
def fetch_users_by_location(location, max_users=100, access_token=''):
users = []
page = 1
headers = {"Authorization": f"token {access_token}"}
while len(users) < max_users:
url = f"https://api.github.com/search/users?q=location:{location}&per_page=100&page={page}"
response = requests.get(url, headers=headers).json()
batch = response.get("items", [])
if not batch:
break
users.extend(batch[:max_users - len(users)])
page += 1
return [user["login"] for user in users]
Fetching Repository Details
Next, we’ll write a function to fetch repository details for each user. This function also handles pagination and extracts relevant information such as repository name, stars, forks, and topics:
def fetch_repo_details(username, max_repos=100, access_token=''):
repos = []
page = 1
headers = {"Authorization": f"token {access_token}"}
while len(repos) < max_repos:
url = f"https://api.github.com/users/{username}/repos?per_page=100&page={page}"
response = requests.get(url, headers=headers).json()
if not response:
break
repos.extend(response[:max_repos - len(repos)])
page += 1
return [{
"repo_name": repo.get("name", ""),
"username": username,
"creation_date": repo.get("created_at", ""),
"stars": repo.get("stargazers_count", 0),
"forks": repo.get("forks_count", 0),
"last_update": repo.get("updated_at", ""),
"description": repo.get("description", ""),
"topics": repo.get("topics", []),
} for repo in repos]
Next Steps
- Data Analysis and Visualization: With the DataFrame
df
containing GitHub repositories information, you can now analyze the data further. Consider visualizing the data using libraries such as Matplotlib or Seaborn to gain insights into repository trends, popularity, and activity based on locations. - Expand the Scope: Consider extending the functionality to include more locations or additional repository details. Experiment with different parameters of the GitHub API to customize the data you’re collecting.
- Automate and Schedule: To keep your dataset up-to-date, consider automating the script to run at regular intervals using scheduling tools like cron for Linux/MacOS or Task Scheduler for Windows.
- Contribute to Open Source: If your project can benefit others, consider making it open-source. This way, you can collaborate with others, get feedback, and improve the project further.
Conclusion
This guide provides a step-by-step approach to using the GitHub API with Python to fetch and analyze GitHub repositories based on user locations. By leveraging Python libraries such as requests
and pandas
, you can automate tasks, manage repositories, or integrate GitHub features into your applications efficiently. Remember to handle API rate limits and authenticate your requests to access more detailed data and perform a higher number of requests.