Optimized Real-time Analytics using Spark Streaming and Apache Druid

Jatinder Assi
GumGum Tech Blog
Published in
5 min readJun 2, 2020

Our advertising data engineering team at GumGum uses Spark Streaming and Apache Druid to provide real-time analytics to the business stakeholders for analyzing and measuring advertising business performance in near real-time.

Our biggest dataset is RTB (real-time bidding) auction logs which amounts to ~350,000 msg/sec during peak hours every day. It becomes crucial for the data team to leverage distributed computing systems like Apache Kafka, Spark Streaming and Apache Druid to process huge volumes of data, perform business logic transformations, apply aggregations and store data that can power real-time analytics.

Real-time Data Pipeline Architecture with Kafka, Spark and Druid

Real-time Analytics Pipeline

Data Collection

Every ad-server instance produces advertising inventory and ad performance logs to a particular topic in the Kafka cluster. We currently have 2 Kafka clusters along with centralized Avro Schema Registry with over 30 topics with producer rate ranging from 5k msg/sec to 350k msg/sec.

Data format in Kafka is stored in Apache Avro for our biggest datasets to have compatibility for schema evolution over time and more compact than in-efficient JSON format (Read more on why use Avro with Kafka)

Process and Transform

In order to process huge volume of advertising data in near real-time, we leverage Spark Streaming on Databricks, by running spark application per Kafka topic. We then apply data transformations and build Druid Tranquility connection on every single spark worker to send transformed data in real-time to druid in a distributed fashion.

Druid Tranquility, ingests data via HTTP push in realtime to druid and providing abstraction for connecting to druid, handles partitioning, replication and seamless druid realtime node service discovery and data schema changes. Due to our nature of our advertising data and business logic we clean, validate and explode on various columns which are nested and list data-type, thus resulting in 10x more data that we send from tranquility to druid real-time nodes.

Data Aggregation and Storage