How to radically streamline complex analytics

Published in

Brigade Engineering

4 min readFeb 13, 2016

At Brigade we are building a platform to revolutionize civic and social engagement. For us, data is king. We’ve used a number of off-the-shelf analytics products in our short existence as a company: Google Analytics, Kissmetrics, Branch.io and Tune, to name a few.

Off-the-shelf analytics tools are amazing — they are cheap, require little engineering work to integrate and can do a lot of powerful things. Sometimes, you just want to do something that the off-the-shelf tools can’t.

We hit a point where we wanted to query our data in a lot of complicated ways and to join datasets unavailable in the standard analytics tools. We felt it was worth building our own custom solution. Our first step into the world of custom analytics: start collecting activity data with as much contextual information as possible. Our initial architecture was pretty straightforward:

First we built out end-to-end data collection: apps and our website write to a simple collection service which we built using Finagle. The service then forwards everything to Kafka. On the other end a Camus job pulls the events out of various Kafka queues and stores them in HDFS. To analyze the data we coded up Spark jobs which ingest the events from HDFS. We quickly discovered that, as beautiful and succinct as Spark is, writing a job for every new pivot of our data was just too time consuming. Moreover the Camus layer didn’t allow for real-time access to events as they flowed in and our own custom data compaction mechanisms added complexity and brittleness to the system.

It’s a real-time world

We decide to rework our design a bit and instead of using HDFS, write the event data to Cassandra.

Presto — real-time access to our data and no need to worry about pesky data compaction issues. But writing the Spark job was still a hassle. Wouldn’t it be great if we could do the bulk of the analysis work in SQL?

SQL — saving time since 1992

Now that we have our event data in Cassandra, why not use the excellent Spark SQL bindings the fine people over at Datastax have implemented:

datastax/spark-cassandra-connector

spark-cassandra-connector - DataStax Spark Cassandra Connector

github.com

Now if we just had a simple UI to write and run our queries with. Enter the Zeppelin project:

apache/incubator-zeppelin

incubator-zeppelin - Mirror of Apache Zeppelin (Incubating)

github.com

Unfortunately Zeppelin didn’t entirely work out of the box for us. It handles Spark SQL queries fine. It handles Cassandra queries fine. But Spark + Cassandra was a no go. Undeterred, we leveraged the extendability of Zeppelin to whip up a beautifully simple custom interpreter. Behind the scenes our interpreter simply hands queries off to a globally available CassandraSQLContext.

We chose to bind our interpreter to a new “%csql” tag. So now we can execute queries as such:

The Datastax Spark Connector even understands how to unpack Cassandra UDTs and maps (as seen when calling “client.platform”).

We also went nuts and hacked in an “INTO” clause which lets us easily chain operations. Here’s an example of a pair of chained queries:

Each statement is separated by a colon — the last statement in the chain is displayed in the table.

Not just SQL

Another awesome feature of Spark SQL — it’s insanely easy to write User Defined Functions (UDFs). Here’s an example UDF definition followed by a usage of it (our interpreter allows the user to group UDFs with queries). In this case our UDF is implemented in Scala.

Datastore Polygamy

Another extremely powerful feature of Spark SQL — cross datastore joins. While we are writing event log data to Cassandra, the bulk of our state-full user data is in Mysql. Exposing it in Zeppelin is simple:

Now we have a Spark SQL accessible table (“analytics_users”) backed by the actual Mysql table that we can join against our Cassandra event log tables:

In Summary

Hurray, powerful fast SQL driven analytics. Our analysts and product managers are going to love us.

Users of our new system can get now get access in real-time to all of the event data we collect by writing straightforward SQL queries. Gone are the disparate data silos and the need to involve the data team in writing cumbersome Spark jobs.

In the coming months we plan to give back to the community by open-sourcing our Zeppelin modifications. In our next blog post we’ll cover a few more Zeppelin features we’ve add as well as how we’ve decided to organize and query our data in Cassandra.