Using BigQuery Github data to rank npm repositories

5 min readAug 9, 2016

In his great book “How Google Works” Eric Schmidt explains how astonishing things are in the Internet Century:

Three powerful technology trends have converged to fundamentally shift the playing field in most industries. First, the Internet has made information free, copious, and ubiquitous— practically everything is online. Second, mobile devices and networks have made global reach and continuous connectivity widely available. And third, cloud computing has put practically infinite computing power and storage and a host of sophisticated tools and applications at everyone’s disposal, on an inexpensive, pay-as-you-go basis.

My story shows an example of using the abundant, publicly available data and the free cloud computing to get interesting insights into the world of open-source software.

The people are on Facebook, and the open source developers are on Github where over 12 million geeks contributing to over 30 million projects every day. A large part of this community uses JavaScript — the most popular programming language on Earth — to write the code and the Node Package Manager to distribute it. NPM ecosystem is pretty convenient in a sense that it instantly provides developers with a bonanza of tools and libraries that allow to quickly get off the ground, or not so much ;); anyway a new javascript project nowadays starts with a couple of commands : npm init and npm install <list of modules>. The specified modules are downloaded from the npm repository to the local machine and could be then utilized in the code of the new project. The repository hosts over 250 thousands “packages of reusable code” and the npm web-site exposes the “most depended-upon” statistics that is based on the number of downloads in the last day/week/month. However, these numbers seem to be vanity metric because a package is downloaded every time when someone updates or installs the project dependencies. Is there any other method of ranking the npm repositories? This is what this story about!

In summer of 2016 Github and Google made the open-source data available for everyone in BigQuery, here are the mind boggling numbers:

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

And it’s possible to query these data by using the Google Computing Power for free unless you exceed the free monthly quota(which happened to me). This is exactly what Eric Schmidt talked about in his book. Now let’s check how we can rank npm repositories by using the content of open source github files!

In order to perform the steps shown below I signed in the Google Cloud Console, created the new project called “GithubDataQueries”, then opened Github Public DataSet, switched to my project and created the “NpmStat” data set, at this point the relevant UI looks like:

The projects that leverage npm keep the package information including the list of dependencies in the file called “package.json”, so let’s select these files from github_repos.files table and save the results in the new table for further usage:

How many files we are talking about here? Let’s check it out by getting the number of rows of the newly created NpmStat.package_json_files table:

So, there are over 8 million package.json files in the open source data. Take a note on how fast these queries are executed! It’s only feasible because we are using the most powerful computing engine in the world. But now let’s select and save in the new table the contents of package.json files:

Usually a package.json file contains 2 types of dependencies, I call them runtime and design-time : the latter ones are typically required to run the tests or build the documentation, the former ones are needed to actually use the package in your code. Here is the example from one of my projects:

The next query selects the runtime dependencies lists from the contents of package.json files by using the simple regular expression, then splits them by using “,” separator and finally removes the repository version producing the flat list of the npm repositories names found in the open source projects in Github:

SELECT Substr(dep, 1, instr(dep, ":") — 1) FROM(
SELECT Split(deps, ",") dep FROM(
SELECT Regexp_extract(content, r'\"dependencies\"\s*\:\s*\{([^}]*)\}') deps FROM [githubdataqueries:NpmStat.package_json_content]))

Selecting the flat list of the repositories

That’s it, now we can rank the repositories by using the frequency of repository name usage in package.json files:

Here is the top 10 repositories:

Let’s compare it to the npm statistics(the order is left-to-right, top-to-bottom)

The results are close, which is expected, but still they differ, for example “underscore” is 4th in the npm’s list, but it’s even not in top 10 in our list(in fact it’s 12th)! Using very same technolodgy one can calculate the PageRank of npm repositories and then build “the Google of npm”, how does that sound!

Though the calculations and results I used might not seem to be very useful, they showcase the entertaining(hopefully) example of taking advantage of the public data and cheap cloud services to obtain the quantitative metrics of various aspects of digital life. I will continue to explore the world of open-source javascript projects and will share the new story as soon as I find something interesting!

Written by Sergey Abakumoff