Consuming the GitHub API v3 and load the data into SQLite database

Jose Luis Bracamonte Amavizca
MCD-UNISON
Published in
9 min readSep 22, 2021
Github Repo data to SQLite database using python by an ETL workflow.

Github is the most popular Git repositories internet hosting, and is the preferred place for sharing open source code byt the open source comunity to be accesible to the public and providing easy collaboration workflow.

Each collaboration workflow in a Github repository is stored in the form of commits, branches and pull requests as principal activities. In this post, I will show you how to extrac this information consuming the Github API v3 using Python and the PyGithub Library to connect and consume the Github API to then transform and load the data into a SQLite database. Once we have all the repository data sincronized into the database, we can run SQL queries to get some insigths and construct some charts to present our findings.

Consuming Github API v3 with PyGithub

PyGitHub is a Python library to access the GitHub REST API.
More detailed information can be found on the PyGitHub documentation site.

Generate Github access token

The PyGithub library require a GitHub personal token. To generate the token, you must have an acount in Github, the follow the next stps:

  1. Go to the settings section in the user menu
  2. Click on Developper settings int the left options menu “Account settings”.
  3. Click on the option Personal access tokens in the left menu.
  4. Click Generate new token button.
  5. Provide a note text.
  6. Click Generate token button and leave all of the remaining checkbox options as default.
Steps to generate Github access token

After click the Generate token button, Github will be providing you the token, is important to save this token in a safe place, because you will be not able to consult the token, the only option is to delete the generated ones or create a new one.

Connect to Github API with PyGithub Python library

We have to install the PyGithub library before using it to connecto to the Github API.

$ pip install PyGithub

We are going to need our Github accesss token prevoisly generated to connect to the Github API using the PyGithub library. In the next demo, we are going to connect using out token and print the list of repositories.

from github import Githubg = Github("access_token")
for repo in g.get_user().get_repos():
print(repo.name)

Now we have connected to the GitHub API and there are available python class functions for repositories, branch, commits, NamedUser (author, committer), etc.

API endpoints

The Github API is a REST API that provide JSON response form its different endpoints. Each JSON response have the related data depending the resource we are requesting. There are 3 principal endpoints we are going to use to retrive the desired related data from particular repository.

The PyGithub library internally consume this endpoints to retrive the corresponding information and populate the data into the built-in python objects. Is important to understand the JSON response for the previos endpoints in order to know what information is available on each returned object by the library once we are calling functions to retrieve repo, branches, commits, etc.

The Github API documentation provide examples of JSON responses. As we are only focused on the main information of a repository, I will left the links to them.

Each instance of Repository , which are the items in the list returned by the function get_repos() in the Github instance, each item have available an attribute to get the endpoint raw JSON response in a dictionary format. We can use it to quickly show the available data returned by the API endpoint. This information will be usefull to desing the database structure and schema to store our data.

# store the repos into a list
repos = list(g.get_user().get_repos())
# assign the element in repos into a repo variale
repo = repos[0]
# you can specify a public repo
# providing the repo full name like "luisjba/pygithub-etl"
repo = g.get_repo("luisjba/pygithub-etl")
# display the raw API response of the repo
print(repo.raw_data)

Repository raw data print output:

Repository raw data print in iPython console

We can print the raw data for a branch and a commit in the same way.

# get the 'main' branch and print the raw data
branch = repo.get_branch("master")
print(branch.raw_data)
# get the first commit item form commits
commit = None
for c in repo.get_commits():
commit = c
break # we break here to only have the first iteration
# print the commit raw data
print(commit_raw_data)

Branch and Commit raw data print output:

Branch and commit raw data print output

With the previos information, we have enougth information to select the desired data and desing our database tables.

SQLite database design and connection

A SQLite database is a small and portable database file and we use the C-language SQLite library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. Most of the popular language have the wrappers to use this C library for SQLite databases. Python have it in the standars library, so, we don’t have to install anything to use them.

With the following code, we can connect to the my_db.db SQLite database. If the database does not exists, the library will a new one with the provided file name but with any table inside them.

import sqlite3db_conn = sqlite3.connect("my_db.db")

DDL database tables design

A DDL (Data Definition Language) script is required to define our database structure and run into our SQLite database. The SQLite oficial documentation provide more details for table creation , useful to check the available data types.

To store all the related data of the desired GitHub repository, we must have to create 4 tables (repo, branch, commits and commit_files). I have created the following DDL for the tables, but if you want to store more information, you can modify it and make your own implementation.

# assigning the ddl string into the ddl variable
ddl = """
CREATE TABLE IF NOT EXISTS repo(
id integer PRIMARY KEY,name VARCHAR(255) NOT NULL,owner VARCHAR(255) NOT NULL,fullname VARCHAR(511) NOT NULL,description TEXT NOT NULL,url VARCHAR(511) NOT NULL,pushed_date INTEGER NOT NULL,created_date INTEGER NOT NULL,updated_date INTEGER NOT NULL,size INTEGER NOT NULL,stars INTEGER NOT NULL,forks INTEGER NOT NULL,watchers INTEGER NOT NULL,language VARCHAR(255) NOT NULL,topics TEXT NOT NULL);CREATE TABLE IF NOT EXISTS branch(id integer PRIMARY KEY,repo_id INTEGER NOT NULL,name VARCHAR(255) NOT NULL,commit_sha VARCHAR(255) NOT NULL,protected INTEGER NOT NULL,FOREIGN KEY (repo_id) REFERENCES repo (id));CREATE TABLE IF NOT EXISTS commits(id integer PRIMARY KEY,repo_id INTEGER NOT NULL,commit_sha VARCHAR(255) NOT NULL,commit_message TEXT NOT NULL,commit_author_name VARCHAR(255) NOT NULL,commit_author_email VARCHAR(255) NOT NULL,commit_author_date INTEGER NOT NULL,commit_committer_name VARCHAR(255) NOT NULL,commit_committer_email VARCHAR(255) NOT NULL,commit_committer_date INTEGER NOT NULL,author_login VARCHAR(255) NULL,author_id INTEGER NULL,author_avatar_url VARCHAR(255) NULL,author_type VARCHAR(255) NULL,committer_login VARCHAR(255) NULL,committer_id INTEGER NULL,committer_avatar_url VARCHAR(255) NULL,committer_type VARCHAR(255) NULL,stats_addtions INTEGER NOT NULL,stats_deletions INTEGER NOT NULL,stats_total INTEGER NOT NULL,FOREIGN KEY (repo_id) REFERENCES repo (id));CREATE TABLE IF NOT EXISTS commit_file(id integer PRIMARY KEY,commit_id INTEGER NOT NULL,repo_id INTEGER NOT NULL,file_name VARCHAR(255) NOT NULL,addtions INTEGER NOT NULL,deletions INTEGER NOT NULL,changes INTEGER NOT NULL,status VARCHAR(255) NOT NULL,FOREIGN KEY (commit_id) REFERENCES commits (id),FOREIGN KEY (repo_id) REFERENCES repo (id));
"""

Once we have our DDL into the ddl variable, we have to execute the script into the SQLite database using the connection we previously created.

cursor = db_conn.cursor()
cursor.execute(ddl)

Now, we can use any SQLite client to verify that the tables were created. There are the DB Browser for SQLite available for MacOs, Windows and Linux platforms. Once you installed , open the client application and open the database (browse for it in your directory).

Verify the table creation in the database using the SQLite client.

Save the Repository information into the SQLite database

We have to create a dictionary and dump the repository data into it. SQLite does not support date data type, we have to use a timestamp integer number transformed from the dates that the Repository instance provide, like the pushed_at, created_at and updated_at dates.

repo_data = {"name": repo.name,"owner": repo.owner.login,"fullname": repo.full_name,"description": repo.description,"url": repo.url,"pushed_date": int(repo.pushed_at.timestamp()),"created_date": int(repo.created_at.timestamp()),"updated_date": int(repo.updated_at.timestamp()),"size": repo.size,"stars": repo.stargazers_count,"forks": repo.forks_count,"watchers": repo.watchers_count,"language": repo.language,"topics": ",".join(repo.get_topics()),}

now we have the data into the repo_data dictionaty. After save it into the dabs, we have to check if the repo already exists, executing a query that will try to retrieve the repo by ‘fullname’.

cursor = db_conn.cursor()
cursor.row_factory = Row
cursor.execute("SELECT * FROM repo WHERE fullname=?", [repo.full_name])result = cursor.fetchall()db_repo = result[0] if len(result) > 0 else None
if db_repo is None:
columns = list(repo_data.keys())query = "INSERT INTO repo({columns}) VALUES ({values})".format(columns=",".join(columns), values=",".join(["?"]*len(columns)))

values = [repo_data[k] for k in columns]
db_conn.cursor().execute(query, values)db_conn.commit()

Now, we can execute queries into the database to fetch the results.

At this point, we know how to:

  • Connect to the Github API.
  • Connect to a SQLite database.
  • Design and create the structure for our database using DDL.
  • Insert data into the database.
  • Fetch data from the database.

We only cover the example for the repository ETL (Extract, Transform and Load) data into the database, but to have the complete or almost the mos important data related with a repository, we must repeat the ETL worflow for branches, commits and commits files. It seems to be a tedious process and a lot of code to write, isn’t it?

Fortunately, I have created the GitHubETL Library that implements all the ETL process for the repository, branches, commits and commit files, developed following the OOP (Object Oriented Programing) and multi-layer standards. Is an easy to use library that support a command line usage and module library usage.

Using the GitHubETL Library

The GitHubETL Library that I have developed, reflects all of the knowledge that we have covered in this post, but in a more easy way to use, with just only a few lines of code we will be able to have all the related repository data into a SQLite desired database and execute any time to synchronize all the related data.

Installation

Open your desired terminal (or GithBash in windows), move to your desired directory where you want to donwload the library. Clone the repository and navigate to the donwloaded directory. Then install the dependencies.

$ git clone https://github.com/luisjba/pygithub-etl.git
$ cd pygithub-etl
$ pip install -r requirements.txt

Usage

We only need the Github access token and our desired repository full name. In the console, execute the sync command $ python main.py sync TOKEN REPO. For our example, the XXXXXXXXXXXXXX argument after the sync command, represents the token (you must provide the one that you have) and after the token we provide our repository full name (luisjba/pygithub-etl) for our example.

$ python main.py sync XXXXXXXXXXXXXX  luisjba/pygithub-etl

After executing the function, by default, the database is created into the data/da.db and will have all of the information. For advanced usage got to the README file to check the available options supported .

GithubETL Library usage and advanced options

Get fun with data in pandas

Pandas is the popular library used by Data Scientist, and we can use them to connect to our SQLite database and generate a dataframe in pandas.

import pandas as pdimport sqlite3# Read sqlite query results into a pandas DataFramesqlite_conn = sqlite3.connect("data/data.db")df = pd.read_sql_query("SELECT * FROM commit_file", sqlite_conn)df.head()
Pandas dataframe head output

We can perform advanced SQL queries into de database to generate more interesting dataframe in pandas, for example, create a join between the commits and commit_file tables and get insight about how many authors and committers the repository have, the number of total insertions, deletions, and changes, and any other that you can perform.

I hope you enjoy the post and find them really usefull.

--

--

Jose Luis Bracamonte Amavizca
MCD-UNISON

Computer Science Engineer and student of the Master’s Degree in Data Science at University of Sonora