How to detect Github trending repositories, using GH Archive, Heroku, MongoDB and Github API ?

The open-source tool is available right here : https://maxday.github.io/trending/#JavaScript

Before diving into the process of building my own tool, I would like to focus on why I did this side-project.

IT is changing fast, very fast, and I need to make sure to that my skills stay up-to-date regarding new technologies, methodologies and processes. As a consequence, I often browse trending projects on Github but I wanted something more customized, very often updated and totally free. That’s why I started this project.

Spoiler alert : this article is more about the architecture part of this tool and less about the data mining one. The trending detecting algorithm I came up with, is very (very) straightforward. You may be disappointed :)

The API is just a single JSON file describing trending repositories in each language. I also wrote a (very) basic UI to consume it and looks like that

Image for post
Image for post
Figure 1 : Basic UI

As you can see, the UI cannot be simpler but it does the work I needed !

Which tools and why ?

  • GH Archive. This website provides one compressed JSON file for each hour. https://www.gharchive.org/

How does it work ?

1 — An Heroku scheduler triggers a set of scripts each hour.

2 — The Github archive corresponding to the last hour is fetched.

3 —This archive is imported into MongoDB and trending repositories are computed, finally an output JSON file is generated.

4 — MongoDB cleaning step (to make sure we don’t exceed the free quota)

5 — The JSON file is added, commited and pushed to Github

6 —A simple UI hosted on Github Page uses this file and displays the result

Here is the global architecture of the project :

Image for post
Image for post
Figure 2 : Global architecture of the project

How did I detect trending repositories ?

As I said, this is a very simple approach and fine tuning this algorithm would be a nice next step.

1.Load the archive into MongoDB using mongoimport

2. Perform an aggregation query

This query will return all repositories which had being starred tree times (over all the records of the archive, so over one hour).

3. I use the Github API to detect the main language of each of these repositories

Indeed, by fetching this url

https://api.github.com/repos/:user/:repo?acess_token=XXX

you will receive a lot of information, including the main language(s) as shown on the image below

Image for post
Image for post
Figure 3 : Example of Github’s API return

4. A final JSON object is built and looks like that :

Image for post
Image for post

5. This file is then commited to Github and directly consumed by the UI.

Full source code :

1 — The first one to handle the ssh connection to Github : https://github.com/maxday/heroku-buildpack-github-ssh

2 — The second one to use mongoimport : https://github.com/maxday/heroku-buildpack-mongoimport

This is my very first technical post, any feedback is more than welcome :) Feel free to clone, hack and share !

Maxime David.

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store