gitbase: exploring git repos with SQL

Francesc Campoy
Oct 30, 2018 · 4 min read

Note: a video version of this content is available at the bottom.

Git has become the de-facto standard for code versioning, but its popularity didn’t remove the complexity of performing deep analyses of the history and contents of source code repositories.

SQL, on the other hand, is a battle-tested language to query large codebases as its adoption by projects like Spark and BigQuery shows.

So it is just logical that at source{d} we chose these two technologies to create gitbase: the Code as Data solution for large scale analysis of git repositories with SQL.

Gitbase is a fully open source project ( that stands on the shoulders of a series of giants which made its development possible, this blog post aims to point out the main ones.

Image for post
Image for post

Parsing SQL with vitess

Gitbase’s user interface is SQL: this means we need to be able to parse and understand the SQL requests that arrive through the network following the MySQL protocol. Fortunately for us, this was already implemented by our friends at YouTube and their vitess project.

We simply grabbed the pieces of code that mattered to us and made it into an open source project that allows anyone to write a MySQL server in minutes (as I showed in my justforfunc episode CSVQL — serving CSV with SQL).

Image for post
Image for post

Reading git repositories with go-git

Once we’ve parsed a request we still need to find how to answer it by reading the git repositories in our dataset. For this we integrated source{d}’s most successful repository go-git.

This allowed us to easily analyze repositories stored on disk as siva files (again an Open Source project by source{d}) or simply cloned with git clone.

Image for post
Image for post

Detecting languages with enry and parsing files with babelfish

Gitbase does not stop its analytic power at the git history. By integrating language detection with our (obviously) Open Source project enry (src-d/enry) and program parsing with babelfish (bblfsh/bblfshd).

These two features are exposed in gitbase as the user functions LANGUAGE and UAST. Together they make requests like “find the name of the function that was most often modified during the last month” possible.

Image for post
Image for post

Making it go fast

Gitbase analyzes really large datasets — e.g. Public Git Archive, with 3TB of source code from GitHub (announcement) and in order to do so every CPU cycle counts.

This is why we integrated two more projects into the mix: Rubex and Pilosa.

Speeding up regular expressions with Rubex and Oniguruma

Rubex moovweb/rubex is a quasi drop-in replacement for Go’s regexp standard library package. I say quasi because they do not implement the LiteralPrefix method on the regexp.Regexp type, but I also had never heard about that method until right now.

Rubex gets its performance from the highly optimized C library Oniguruma (kkos/oniguruma) which it calls using cgo.

Speeding up queries with Pilosa indexes

Indexes are a well-known feature of basically every relational database, but Vitess does not implement them since it doesn’t really needs to.

But again Open Source came to the rescue with Pilosa (pilosa/pilosa), a distributed bitmap index implemented in Go which made gitbase usable on massive datasets.

Image for post
Image for post


I’d like to use this blog post to personally thank the Open Source community that made it possible for us to create gitbase in such a shorter period that anyone would have expected. At source{d} we are firm believers in Open Source and every single line of code under (including our OKRs and investor board) is a testament to that.

Would you like to give gitbase a try? The fastest and easiest way is with source{d} Engine. Download it from and get gitbase running with a single command!

Want to know more?

Check out the recording of my talk at the Go SF meetup.


The Data Platform for the Software Development Life Cycle

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store