GitHub Cancer

Sergey Abakumoff
3 min readAug 16, 2016

--

Earlier today I’ve sent the following tweet:

I’ve stumbled upon “isArray” thing during the exploration of the public GitHub data available in the Google BigQuery platform. This story exposes some new interesting findings I’ve discovered.

The previous stories(1, 2) analyzed the contents of package.json files which are descriptors used for Node.js modules dependency management. Recently, I noticed some odd numbers in the Github data:

  1. Number of files called “package.json” is over 8 Million.
  2. Number of contents of the files called “package.json” is less than 1.5 Million.

Huh? What is going on?

The answer is pretty simple — there is a huge number of duplicate package.json contents all over the open source code hosted in GitHub. Google BigData keeps the unique contents only, based on their hash I guess. The contents table actually has “copies” column which indicates the number of duplicates, so let’s summarize the number of copies of package_json_content table that was described earlier:

Number of package.json files obtained from contents table

Cool, the total number of package.json files is now closer to the one from package_json_files table, they are not equal though, but let’s ignore it for now and move to the next query — what is the average number of the exact copies of a package.json file?

Average number of duplicates

Let’s have a look at this phenomenon closer. I’ve composed and run the following query to extract the number of duplicates for each package.json file along with its NPM name, URL in github(one URL per each duplicate) and number of copies. The query selected the records with number of copies greater than 100 in the name of simplicity.

The results have been saved in the new table called package_json_duplicates, then I executed yet another query that groups the data by the file id and sorts them by number of copies in reverse order:

Here are top 20 results:

JSON_ERROR in repo_name column means that a package.json file can’t be parsed, there are multiple reasons of that:

  1. Empty or invalid file, even best of us sometimes forget to clean up before pushing the commit.
  2. package.json is dynamically built on compile-time, here is the example:

But now let’s take a look at the 5th row of the results..It says that there are 47343 instances of package.json file of isArray repository..and actually there are more than that! The total number is 99692..!!WOW!!

Think about it — the module that exposes very basic code:

was used in thousands of projects all over Github, as the direct dependency or the “inherited dependency”(project B depends on project A which depends on “isarray”). Moreover, it seems that people don’t care about keeping the source code clean and commit node_modules folder of their projects carelessly, here is the snapshot of some paths to isarray’s package.json

Is it just me or it resembles cancer? A ton of duplicate , sometimes sick, code spread over all the GitHub/npm ecosystem and increased the size of the projects’ source code, like a tumor. This is why Github hosts over 8 Million package.json files and only ~1.5M of them are unique.

Okay, I got it..developers are pretty busy and don’t have time to write a line of code that checks if the input value is Array..but look, even if you are going to use that module, what on Earth makes you keep the node_modules folder in the source code? Is there any valid reason of that? Please check .gitignore file of your project and make sure that it excludes node_modules from the files to track, don’t contribute to Github Cancer development.

--

--