Why we picked Clojure

A common question we’ve been getting is why we picked Clojure as our primary server language at Metabase. Now, choosing a programming language for a project is usually decided on a mix of what a team is most comfortable with, beer-fueled debates on dynamic vs static typing, benchmarks of marginal relevance and that elusive quest for Dev-Hipster points. In our case, we ended up on our primary backend language by a rather circuitous path.

The beginning

Back in the dark ages, Metabase started out life as a built-in analytics system for one of Expa’s first products which happened to be written in Python using the Django web framework. After seeing a need for a similar system for every other company started at Expa, we factored it out of that project into its own standalone analytics system and extended it. In the end it comprised a BI server, an event collection + enrichment pipeline, 3rd party data importers and data warehouse management. To avoid rewriting it any more than necessary, it stayed a Django project. From early 2014 onwards it was used by Expa and its portfolio companies.

The BI server (which eventually became Metabase) at this point was a big ball of Python, with a Django backend and an Angular front end. Aside from a lot of CRUD operations on application data and our semantic model of a data warehouse, serving up the frontend Angular application, there were two main aspects to the application that were specifically tricky. First was that the main workhorse of our application was an API endpoint that accepted a query and executed it on a remote data warehouse. This call typically had a very wide band of latencies as some queries returned almost instantly while others ran in minutes (or worse with some especially complicated queries). The second aspect was that all such queries was expressed in our Query Language. These were serialized as a JSON dictionary. When a query was executed, it would be compiled down to a SQL or MongoDB query.

While we were running it internally, there was little motivation to change much about the architecture. Once we decided to open source the project, questions around stability and ease of deployment came to the fore. There were a couple of obvious pain points we had with our python codebase:

WSGI’s request model didn’t match our main api call, executing a query on a data warehouse. While we were able to work around this by using async work queues heavily it made both front and and backend code much more complicated.

Python’s Mysql and Postgres database drivers required compilation. This made both developer machine setup as well as deployment more complicated than it really needed to be.

There were lots of moving parts between Docker + nginx + uWSGI + Django + compiled drivers + search indexing (whoosh) processes + Celeryd + Celery workers + Redis. This lead to a fair amount of pain in setting up new developer machines. For spinning up new production installations, we had a set of fairly complicated Chef recipes that would be very difficult to idiot proof.

We also had a number of recurring jobs that we ran using Celery. It worked, but wasn’t especially easy to configure or debug. Overall, there was a significant support load to keep the overall system running.

Every couple of weeks, something about our developer machine setup, which added Vagrant and VirtualBox to the above list of moving parts, would go wrong and eat up a couple of hours.

Criteria for a new language

As we started the process of considering a port, we nailed down our criteria for selecting the new language. These were:

Solid model for async web requests

Fast + Easy to deploy by others

Productive to develop in

Wide variety of mature database drivers

Strong functional programming primitives to make compiling our Query Language simple

The contenders

Java

Scala

Clojure

Javascript/Node

Go

Python (twisted)

Python (tornado)

Given the hassles involved with dev machine setup with python, we crossed out Twisted and Tornado early on.

Go felt promising but too low level for us to express our query language productively. In addition it had relatively immature drivers for anything other than MySQL and PostgreSQL.

Javascript had a fair bit of support on the team but we felt like the database drivers weren’t mature enough overall.

The shortlist really came down to “something on the JVM” due to its mature and predictable threading model, solid database connectors and the ability to ship a single jar with embedded web server (Jetty) + an embedded database (H2). However, we had a strong aversion to using Java itself, so the real short list was Scala or Clojure.

A false start

After some deliberation over hiring in our network (lots of Scala folks in SF), a desire for type safety (not one of our criteria, but a strong additional consideration) and good interop with Spark and the JVM database drivers as a whole, we settled on Scala. The team all burned through some Scala tutorials and started prototyping things on Play and the various database DSLs (Slick, Squeryl and Anorm primarily).

After about a week of experimentation, we hit a sticking point. While we had a large number of “normal” CRUD operations, all of our queries on behalf of the end user, were generated dynamically. Most of the Scala database ecosystem is geared towards extending Scala’s type system to include database query type checking. Overall, our enthusiasm for Scala didn’t survive the first attempt to port our query language system.

Clojure for the win

It was at this juncture where we re-evaluated Clojure, and specifically Korma. A quick prototype of the application was done in Clojure, and the most complicated aspect of our application was very easy to express. We made more headway in the first day than we had gotten done in Scala over a week and in general felt like it was a much more natural fit. Clojure it was!

The move from an async work queue to lightweight threads, significantly reduced the complexity of our codebase. This simplicity reduced the number of possible error states and generally made for a simpler and better user experience. In addition, between threads and the overall better latency profile of Clojure, user perceived speed improved dramatically.

The improvement in developer productivity has been dramatic. The JDK is mostly trivial to install, and with Leiningen, getting a developer machine setup has been uniformly painless. Day to day, far fewer issues come up and the overall language has been much easier to get new developers up to speed on than expected.

As a result of switching to Clojure a number of unanticipated benefits emerged. The stability of the JVM, the simplicity of having a single jar as a deployment artifact, and a pretty disciplined testing and code review process has allowed us to manage a relatively large number of instances (~15 at last count) with no real effort on our part. The combination of Elastic Beanstalk and an uberjar has made our overall support footprint negligible. We haven’t had a significant server crash that required manual intervention in about a year of running in production.

When it came time to build a Mac OS X application, the simplicity of deployment made bundling the JRE + our uberjar extremely simple. Trying to do the same with our old combination of Django + Postgres + Celery + Redis would have been significantly more risky as well as fragile when running.

So, in summary, we chose Clojure for JVM threads + Database drivers, the ability to ship an uberjar and the ease of expressing tree manipulations which were the core piece of complexity in our backend.

Overall verdict: 10/10 would port again.

If you’ve made it this far, you can get more information about Metabase, an Open Source Business Intelligence server that sets up in 5 minutes at www.metabase.com. We also opine and rant on twitter @metabase.