Cataloging GitHub

Kushagra Singh
New Impetus

--

Dammm.. This is gonna take some time.

GitHub’s developer API neatly exposes a GET call to retrieve complete details about all the public repositories ever created on this temple of geekdom.

So what do you do on a late Friday night ? You write up a sketchy NodeJs server to make recursive calls to this API endpoint and store all the data in a glorified text file and then publish it on GitHub. [Get the code here]

Bravo on your drunk escapade. Now what ?

Now. You WAIT.

According to my humble estimates, there are ~32 million repos on GitHub. My server has been running non-stop on my tiresome MBP for nearly 36 hrs now. A grand total of 6 million repos have been fetched which according to the highly advanced mathematics that I performed on my fingers means that it will probably be a few days before the process reaches it’s conclusion.( I am pretty sure this process can be significantly sped up but I am actually content the way it is ).

All this for what ?

Running grep queries on this mammoth chunk of data should be interesting. Fancy graphs could also be deployed. Humanity could be saved. Pepperoni pizzas could become free. Twitter might stop sinking. Who knows ?

BTW what about a Vim Vs Emacs on the number of GitHub projects ? Mmmmm tasty !!

Do tune in later for more on this.

--

--