Our advertising data engineering team at GumGum uses Spark Streaming and Apache Druid to provide real-time analytics to the business stakeholders for analyzing and measuring advertising business performance in near real-time.
Our biggest dataset is RTB (real-time bidding) auction logs which amounts to ~350,000 msg/sec during peak hours every day. It becomes crucial for the data team to leverage distributed computing systems like Apache Kafka, Spark Streaming and Apache Druid to process huge volumes of data, perform business logic transformations, apply aggregations and store data that can power real-time analytics.
Every ad-server instance produces advertising inventory and ad performance logs to a particular topic in the Kafka cluster. We currently have 2 Kafka clusters along with centralized Avro Schema Registry with over 30 topics with producer rate ranging from 5k msg/sec to 350k msg/sec. …
Forecasting is a common data science task at many organizations to help with sales forecast, inventory forecast, anomaly detection and many more applications. For GumGum’s advertising division, it is critical for our sales team to forecast available ad inventory in order to setup successful ad campaign.
Our Data Engineering team already uses time series forecasting for directly sold ad inventory (400+ million/day). As advertising industry is evolving, there has been exponential growth in programmatic ad buying and as a result GumGum have 30+ billion programmatic inventory impressions/day. Producing high quality forecasts is not an easy problem at GumGum’s programmatic advertising scale. …