How we built a Streaming Analytics Solution using Apache Kafka & Druid


Recently, we developed a Streaming Analytics Solution using Apache Kafka & Druid for an e-commerce website. A number of shops and individual sellers sell their products via the website. Additionally, there are users who promote products from sellers by posting images of the products(these images are back-linked to the purchase page on the website — users earn for each referral and conversions)

Our requirement was to gather all events occurring on the website and build an application to run analytics on top of that data. The application was to be used by sellers and promoters on the platform. Data was to be consumed in real time for a few reports and was also used for showing cumulative money earned.


What we explored?

Rabbit MQ + MongoDB — Push data to RabbitMQ message queues for the different types of messages — clicks, view detail, share etc. Different consumer processes will then pick up this data and push it to MongoDB.

Kinesis + MongoDB — Use AWS Kinesis services to create a continuous stream of data and then push that to MongoDB.

Kafka + Druid — Use Apache Kafka for capturing all events and Tranquility Kafka to push all relevant info to Druid. This was the final tech stack that was integrated.


About Apache Kafka

Apache Kafka is a distributed streaming platform. It lets you publish and subscribe to streams of records, store streams of records in a fault-tolerant way, process streams of records as they occur. Here’s a brief overview of Apache Kafka

[Overview of Apache Kafka — By Ch.ko123 — Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=59871096]

About Druid

Druid is a high-performance, column-oriented, distributed data store. It allows us to explore events immediately after they occur and ingest data in streams or batches. Here’s a brief overview of Druid Architecture

Druid Architecture

About the solution

We wrote a private Node.js module on top of rdkafka to cater to all the business use cases we had. There were three Kafka topics to which events in three main categories were pushed. Data was streamed to Druid in real-time via Tranquility Kafka and also processed by two different consumers and passed on to Payments and other services which then wrote relevant data to Druid in batches.

Streaming Analytics Solution — Overview

We also wrote a React.js web app (for sellers) to view reports and integrated Superset (internal use) for all data visualizations


If you liked this article, click the👏 below, and share it with other people.

References, Credits & Further Reading

  1. About text for Kafka & Druid taken from https://kafka.apache.org/, http://druid.io/ & Images from Wikipedia.
  2. Understanding Druid Real-time ingestion
  3. Apache Kafka Security
  4. Using MySQL as metadata store for Druid
  5. Production Cluster Configuration for Druid

About Author:

Arpit is a seasoned technologist with vast experience in leading different technology teams. Arpit also consults clients on competitive market analysis, defining MVPs, product ideation, product monetization and go live strategies.

Arpit is also interested in early-stage investments in startups in design & fashion, finance, renewable energy, space, real estate, manufacturing domain.

Arpit believes we should all contribute back to society. He has set his goals for social work in five broad areas. You can read more about the same in his blog post “Do Good, Together” on Tumblr. Arpit is interested in working with people who want to contribute towards the same goals.

You can follow Arpit on Linkedin or on Twitter