Druid and SuperSet for real-time monitoring at scale

Brachi Packter
Aug 20, 2019 · 3 min read
Image for post
Image for post
Photo by John Schnobrich on Unsplash

Your company generates a lot of data, about several terabytes per day, and you want to find a tool that could take advantage of this data, to provide insights that increase your understanding of the business, and, of course, improve the profitability.

To do this, you need numerous dashboards that allow analyzing different metrics, in real-time.

In the last months, I evaluated how to achieve this.

Let’s sum my requirements:

  1. High throughput for queries.
  2. Support for HyperLogLog-HLL (To be able to perform distinct calculations on different dimensions)
  3. Support for HLL sketches ingestion.
  4. Has integration with dashboard tools.
  5. Available as a managed service.

Why not an SQL database?

However, I could not find a managed solution with support for HLL sketches.

Also, the real-time performance was lacking. I found myself dealing with read replicas and a load balancer between them.

So I had to continue searching….

Apache Druid as a solution:

Druid has good performance for aggregations operations, as required for my use case. Druid can ingest and responsively query high volumes of highly dimensional metrics data in real-time.

Plus, Druid is a highly available solution and can scale up easily by adding more nodes to the cluster. You can also scale any different component of the Druid cluster independently. For example, you can scale only the query nodes, or just the data nodes, depending on where your bottleneck lies.

Druid support rolls ups, meaning it won’t save the raws data but only aggregated results based on the granularity you require.

In my case, I perform the aggregations externally, in Apache Beam process, then stream aggregated data into Kinesis (AWS), and then ingests data into Druid via a native loader.

I still do the aggregations in Druid but in query time. No rolling.

SuperSet for Dashboard

It comes with great dashboard capabilities like comparing results to the previous day, annotations and more.

Image for post
Image for post

And designed to be highly available. It is “cloud-native” as it has been designed scale-out in large, distributed environments, and works well inside containers.

Druid as a managed service:

Imply simplifies Druid installation and troubleshooting and provides management control over the cluster.

Imply also offer its own tool for dashboards and charts, called Pivot.

Image for post
Image for post
Image for post
Image for post

In conclusion, with the solution that combines Druid, Imply and SuperSet, we can have a complete perspective on real-time data. Can monitor it, analyze it, slice and dice by any dimension combination and, without debate, make decisions and perform actions, based on the data, immediately.

Thanks, Rick Bilodeau (from Imply), on his help in this post!

The Startup

Medium's largest active publication, followed by +730K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store