Our advertising data engineering team at GumGum uses Spark Streaming and Apache Druid to provide real-time analytics to the business stakeholders for analyzing and measuring advertising business performance in near real-time.

Our biggest dataset is RTB (real-time bidding) auction logs which amounts to ~350,000 msg/sec during peak hours every day. It becomes crucial for the data team to leverage distributed computing systems like Apache Kafka, Spark Streaming and Apache Druid to process huge volumes of data, perform business logic transformations, apply aggregations and store data that can power real-time analytics.

Real-time Data Pipeline Architecture with Kafka, Spark and Druid

Image for post
Image for post
Real-time Analytics Pipeline

Data Collection

Every ad-server instance produces advertising inventory and ad performance logs to a particular topic in the Kafka cluster. We currently have 2 Kafka clusters along with centralized Avro Schema Registry with over 30 topics with producer rate ranging from 5k msg/sec to 350k msg/sec. …


Forecasting is a common data science task at many organizations to help with sales forecast, inventory forecast, anomaly detection and many more applications. For GumGum’s advertising division, it is critical for our sales team to forecast available ad inventory in order to setup successful ad campaign.

Our Data Engineering team already uses time series forecasting for directly sold ad inventory (400+ million/day). As advertising industry is evolving, there has been exponential growth in programmatic ad buying and as a result GumGum have 30+ billion programmatic inventory impressions/day. Producing high quality forecasts is not an easy problem at GumGum’s programmatic advertising scale. …

Jatinder Assi

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store