All the open source code in GitHub now shared within BigQuery: Analyze all the code!


All the open source code in GitHub is now available in BigQuery. Go ahead, analyze it all. In this post you’ll find the related resources I know of so far:

Official sources:

In depth analysis

I’m waiting for your contributions — I will add them here:

A series of posts by Robert Kozikowski:


  • Don’t analyze the main [bigquery-public-data:github_repos.contents] table — at 1.5 TB, it will instantly consume your monthly free terabyte. Use instead the official [bigquery-public-data:github_repos.sample_contents] extract (~23 GB), or one of the full language tables I left at [fh-bigquery:github_extracts.contents_*].
  • How about doing a JOIN between this new dataset and the GitHub Archive to find the most starred files and their patterns? Sample code soon, but see how I played with GitHub stars and Hacker News previously.
  • I’m pretty excited about getting author and committer timezones. We’ll be able to perform some regional analysis here.