Slicing and dicing data with druid

Conrad Lee
Engineers @ Optimizely
2 min readApr 13, 2015

We’ve been experimenting with an intriguing new technology called druid, which sets out to be a scalable, reliable, open-source implementation of an OLAP cube. An OLAP cube allows you to slice and dice your multi-dimensional dataset quickly enough for interactive data exploration.

If you’re unfamiliar with an OLAP cube, take a look at the crossfilter demo and try dragging selections across multiple charts. Crossfilter is essentially an OLAP cube implemented in javascript that runs locally on your browser. While crossfilter is nice for datasets containing a few hundred thousand rows, druid scales to datasets containing hundreds of billions of rows, i.e., one million times larger.

We evaluated a few different technologies that provide this type of OLAP functionality, and druid came out on top. The fundamental architecture choices are sound — a druid cluster consists of stateless nodes robust against any single node failure. Furthermore, the community is helpful, open, and actively developing improvements. We’ve confirmed that druid can create extremely compact index files, which allows you to keep much (or all) of your index in memory. In fact, our evaluation showed that in some cases the index files created by druid are orders of magnitude smaller than those produced by Solr and ElasticSearch.

In order to achieve scalability and reliability, druid requires a rather steep operations investment (take a look at the production configuration to get a sense of the complexity involved). A druid cluster contains between four to six different node types, each with their own configuration quirks, hardware requirements, and scaling behaviors.

To manage anything beyond a simple test cluster, you’ll probably want to use a dev ops framework such as chef, saltstack, or ansible. At Optimizely, we already use chef to manage some of our dev ops needs, so we were delighted to find an existing community cookbook, chef-druid, that can configure and deploy a druid cluster. However, that community cookbook hasn’t been updated for nearly a year and does not support the latest version of druid. We have therefore forked off our own version, optimizely/chef-druid, which supports the latest version of druid. We’ve also made a couple other changes to this fork: it now directly pulls and builds the druid source code from github and it enables metrics by default. We will maintain this fork for the foreseeable future and welcome contributions.

If you enjoyed this post, take a look at our careers page and come work with us.

--

--

Conrad Lee
Engineers @ Optimizely

Data Scientist contracting for Parsely. Python enthusiast.