Your company generates a lot of data, about several terabytes per day, and you want to find a tool that could take advantage of this data, to provide insights that increase your understanding of the business, and, of course, improve the profitability.
To do this, you need numerous dashboards that allow analyzing different metrics, in real-time.
In the last months, I evaluated how to achieve this.
Let’s sum my requirements:
- High throughput for record inserts.
- High throughput for queries.
- Support for HyperLogLog-HLL (To be able to perform distinct calculations on different dimensions)
- Support for HLL sketches ingestion.
- Has integration with dashboard tools.
- Available as a managed service.
Why not an SQL database?
Postgres is limited to 1k connections per second, and this was well below my requirements.
However, I could not find a managed solution with support for HLL sketches.
Also, the real-time performance was lacking. I found myself dealing with read replicas and a load balancer between them.
So I had to continue searching….
Apache Druid as a solution:
Eventually, I found Druid, which was a much better fit. Druid is a columnar database; it’s columns are separated into dimensions, metrics, and time series columns.
Druid has good performance for aggregations operations, as required for my use case. Druid can ingest and responsively query high volumes of highly dimensional metrics data in real-time.
Plus, Druid is a highly available solution and can scale up easily by adding more nodes to the cluster. You can also scale any different component of the Druid cluster independently. For example, you can scale only the query nodes, or just the data nodes, depending on where your bottleneck lies.
Druid support rolls ups, meaning it won’t save the raws data but only aggregated results based on the granularity you require.
In my case, I perform the aggregations externally, in Apache Beam process, then stream aggregated data into Kinesis (AWS), and then ingests data into Druid via a native loader.
I still do the aggregations in Druid but in query time. No rolling.
SuperSet for Dashboard
For analytics interface, I chose SuperSet, an open-source project designed for use with Druid. Following the SuperSet installation guide, I installed SuperSet it in an AWS EC2 instance and from there made dashboards and charts.
It comes with great dashboard capabilities like comparing results to the previous day, annotations and more.
And designed to be highly available. It is “cloud-native” as it has been designed scale-out in large, distributed environments, and works well inside containers.
Druid as a managed service:
Druid is a complex cluster to manage and require technical expertise. That’s why we choose to go with Imply.
Imply simplifies Druid installation and troubleshooting and provides management control over the cluster.
Imply also offer its own tool for dashboards and charts, called Pivot.
In conclusion, with the solution that combines Druid, Imply and SuperSet, we can have a complete perspective on real-time data. Can monitor it, analyze it, slice and dice by any dimension combination and, without debate, make decisions and perform actions, based on the data, immediately.
Thanks, Rick Bilodeau (from Imply), on his help in this post!