400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?

Tabs or spaces. We are going to parse a billion files among 14 programming languages to decide which one is on top.

Felipe Hoffa
Aug 30, 2016 · 3 min read
Image for post
Image for post

The rules:


Image for post
Image for post


I used the already existing [bigquery-public-data:github_repos.sample_files] table, that lists the files of the top 400,000 repositories. From there I extracted all the contents for the files with the top languages extensions:

SELECT a.id id, size, content, binary, copies, sample_repo_name , sample_path
SELECT id, FIRST(path) sample_path, FIRST(repo_name) sample_repo_name
FROM [bigquery-public-data:github_repos.sample_files]
WHERE REGEXP_EXTRACT(path, r'\.([^\.]*)$') IN ('java','h','js','c','php','html','cs','json','py','cpp','xml','rb','cc','go')
) a
JOIN [bigquery-public-data:github_repos.contents] b
ON a.id=b.id
864.6s elapsed, 1.60 TB processed

That query took a relative long time since it involved joining a 190 million rows table with a 70 million rows one, and over 1.6 terabytes of contents. But don’t worry about having to run it, since I left the result publicly available at [fh-bigquery:github_extracts.contents_top_repos_top_langs].

In the [contents] table we have each unique file represented only once. To see the total number of files and size represented:

SELECT SUM(copies) total_files, SUM(copies*size) total_size
FROM [fh-bigquery:github_extracts.contents_top_repos_top_langs]
Image for post
Image for post
1 billion files, 14 terabytes of code

Then it was time to run the ranking according to the previously established rules:

SELECT ext, tabs, spaces, countext, LOG((spaces+1)/(tabs+1)) lratio
SELECT REGEXP_EXTRACT(sample_path, r'\.([^\.]*)$') ext,
SUM(best='tab') tabs, SUM(best='space') spaces,
COUNT(*) countext
SELECT sample_path, sample_repo_name, IF(SUM(line=' ')>SUM(line='\t'), 'space', 'tab') WITHIN RECORD best,
SELECT LEFT(SPLIT(content, '\n'), 1) line, sample_path, sample_repo_name
FROM [fh-bigquery:github_extracts.contents_top_repos_top_langs]
HAVING REGEXP_MATCH(line, r'[ \t]')
HAVING c>10 # at least 10 lines that start with space or tab
ORDER BY countext DESC
16.0s elapsed, 133 GB processed

Analyzing each line of 133 GBs of code in 16 seconds? That’s why I love BigQuery.

Update: Press and social media

Reddit, Hacker News, Slashdot

Gizmodo, Business Insider, ADTMag, i-programmer

Gizmodo .es, Business Insider .pl, Le blog du Modérateur, Tecnoblog (pt), Biglobe (jp), My Drivers (cn), Genk .vn.


Want more stories? Check my medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store