How to detect Github trending repositories, using GH Archive, Heroku, MongoDB and Github API ?
Before diving into the process of building my own tool, I would like to focus on why I did this side-project.
IT is changing fast, very fast, and I need to make sure to that my skills stay up-to-date regarding new technologies, methodologies and processes. As a consequence, I often browse trending projects on Github but I wanted something more customized, very often updated and totally free. That’s why I started this project.
Spoiler alert : this article is more about the architecture part of this tool and less about the data mining one. The trending detecting algorithm I came up with, is very (very) straightforward. You may be disappointed :)
The API is just a single JSON file describing trending repositories in each language. I also wrote a (very) basic UI to consume it and looks like that
As you can see, the UI cannot be simpler but it does the work I needed !
Which tools and why ?
- GH Archive. This website provides one compressed JSON file for each hour. https://www.gharchive.org/
- Heroku. https://www.heroku.com/
- Free heroku add-on : heroku scheduler. An easy way to run scripts periodically
- Free heroku add-on : mLab mongoDB. I chose to use a Mango database while aggregating data using NodeJS to take advantage of JSON files provided by GH Archive. Plus, having a full JS stack is pretty convenient.
- Github API. The only way to detect what are the main languages of a given repository. https://developer.github.com/v3/?
- I had to write two custom Heroku buildpacks, both of them are also open-sourced, see links at the end of this article (one for handling SSH to Github, the other one for using mongoimport). https://devcenter.heroku.com/articles/buildpacks
- No configs are hardcoded in the scripts, every config is handled via Heroku environment variables
How does it work ?
1 — An Heroku scheduler triggers a set of scripts each hour.
2 — The Github archive corresponding to the last hour is fetched.
3 —This archive is imported into MongoDB and trending repositories are computed, finally an output JSON file is generated.
4 — MongoDB cleaning step (to make sure we don’t exceed the free quota)
5 — The JSON file is added, commited and pushed to Github
6 —A simple UI hosted on Github Page uses this file and displays the result
Here is the global architecture of the project :
How did I detect trending repositories ?
As I said, this is a very simple approach and fine tuning this algorithm would be a nice next step.
1.Load the archive into MongoDB using mongoimport
2. Perform an aggregation query
This query will return all repositories which had being starred tree times (over all the records of the archive, so over one hour).
3. I use the Github API to detect the main language of each of these repositories
Indeed, by fetching this url
you will receive a lot of information, including the main language(s) as shown on the image below
4. A final JSON object is built and looks like that :
5. This file is then commited to Github and directly consumed by the UI.
Full source code :
- Heroku app : https://github.com/maxday/github-trending-heroku
- Two custom buildpacks
1 — The first one to handle the ssh connection to Github : https://github.com/maxday/heroku-buildpack-github-ssh
2 — The second one to use mongoimport : https://github.com/maxday/heroku-buildpack-mongoimport
This is my very first technical post, any feedback is more than welcome :) Feel free to clone, hack and share !