gitbase: exploring git repos with SQL

Note: a video version of this content is available at the bottom.

Git has become the de-facto standard for code versioning, but its popularity didn’t remove the complexity of performing deep analyses of the history and contents of source code repositories.

SQL, on the other hand, is a battle-tested language to query large codebases as its adoption by projects like Spark and BigQuery shows.

So it is just logical that at source{d} we chose these two technologies to create gitbase: the Code as Data solution for large scale analysis of git repositories with SQL.

Gitbase is a fully open source project (github.com/src-d/gitbase) that stands on the shoulders of a series of giants which made its development possible, this blog post aims to point out the main ones.

The gitbase playground (github.com/src-d/gitbase-web) provides a visual way to use gitbase.

Parsing SQL with vitess

Gitbase’s user interface is SQL: this means we need to be able to parse and understand the SQL requests that arrive through the network following the MySQL protocol. Fortunately for us, this was already implemented by our friends at YouTube and their vitess project.

We simply grabbed the pieces of code that mattered to us and made it into an open source project github.com/src-d/go-mysql-server that allows anyone to write a MySQL server in minutes (as I showed in my justforfunc episode CSVQL — serving CSV with SQL).

Vitess is a database clustering system for horizontal scaling of MySQL.

Reading git repositories with go-git

Once we’ve parsed a request we still need to find how to answer it by reading the git repositories in our dataset. For this we integrated source{d}’s most successful repository go-git.

This allowed us to easily analyze repositories stored on disk as siva files (again an Open Source project by source{d}) or simply cloned with git clone.

A highly extensible Git implementation in pure Go.

Detecting languages with enry and parsing files with babelfish

Gitbase does not stop its analytic power at the git history. By integrating language detection with our (obviously) Open Source project enry (src-d/enry) and program parsing with babelfish (bblfsh/bblfshd).

These two features are exposed in gitbase as the user functions LANGUAGE and UAST. Together they make requests like “find the name of the function that was most often modified during the last month” possible.

babelfish is a self-hosted server for universal source code parsing, turning code files into Universal Abstract Syntax Trees (UASTs)

Making it go fast

Gitbase analyzes really large datasets — e.g. Public Git Archive, with 3TB of source code from GitHub (announcement) and in order to do so every CPU cycle counts.

This is why we integrated two more projects into the mix: Rubex and Pilosa.

Speeding up regular expressions with Rubex and Oniguruma

Rubex moovweb/rubex is a quasi drop-in replacement for Go’s regexp standard library package. I say quasi because they do not implement the LiteralPrefix method on the regexp.Regexp type, but I also had never heard about that method until right now.

Rubex gets its performance from the highly optimized C library Oniguruma (kkos/oniguruma) which it calls using cgo.

Speeding up queries with Pilosa indexes

Indexes are a well-known feature of basically every relational database, but Vitess does not implement them since it doesn’t really needs to.

But again Open Source came to the rescue with Pilosa (pilosa/pilosa), a distributed bitmap index implemented in Go which made gitbase usable on massive datasets.

Pilosa is an open source, distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.

Conclusion

I’d like to use this blog post to personally thank the Open Source community that made it possible for us to create gitbase in such a shorter period that anyone would have expected. At source{d} we are firm believers in Open Source and every single line of code under github.com/src-d (including our OKRs and investor board) is a testament to that.

Would you like to give gitbase a try? The fastest and easiest way is with source{d} Engine. Download it from sourced.tech/engine and get gitbase running with a single command!

Want to know more?

Check out the recording of my talk at the Go SF meetup.

Recording from my talk with the same title at the Go SF Meetup.