Forecasting is a common data science task at many organizations to help with sales forecast, inventory forecast, anomaly detection and many more applications. For GumGum’s advertising division, it is critical for our sales team to forecast available ad inventory in order to setup successful ad campaign.
Our Data Engineering team already uses time series forecasting for directly sold ad inventory (400+ million/day). As advertising industry is evolving, there has been exponential growth in programmatic ad buying and as a result GumGum have 30+ billion programmatic inventory impressions/day. Producing high quality forecasts is not an easy problem at GumGum’s programmatic advertising scale. It becomes imperative for us to efficiently utilize distributed compute and storage to produce inventory forecasts with a short latency(< 30 secs).
In this part 1 of two-part blog post I will focus on data architecture and distributed data sampling in Apache Spark. Let’s start by discussing what is ad inventory and why we forecast inventory at GumGum.
What is an ad inventory?
An ad inventory is real estate to show potential ads in a publisher page. Ad inventory can be of various size and exist across multiple platforms (Figure 1 below)
Why Forecast Inventory?
Our sales team is trying to setup an ad campaign with certain targeting rules and would like to know GumGum’s publisher network has enough inventory to fulfill it.
Here are some scenarios for inventory forecasting:
Forecast the inventory available in US in cities Los Angeles and San Diego from premium websites for the next 30 days for ad product 1
Forecast the inventory available in US and Canada for pages related to Sports and Entertainment targeting males of age 25 to 40 for the next 30 days for bidder ABC and ad product 2
Business user looking to predict available inventory in near future will enter inventory targeting and restriction criteria like mentioned above, then the forecast application will search for matching inventory in past 365 days and run forecast model on matched inventory to predict future numbers.
In order to cost-effectively search for matching impressions and run forecast model on 365 days, we decided to build daily data sampling job in spark to reduce daily inventory size from 25 TB to 1.5 GB inventory samples per day.
Daily sampling spark pipeline:
This pipeline consists of 3 main modules to clean/prep inventory data and produce enriched sampled data (Figure 2 below).
- Data Transform: Read past day (24 hr) of inventory data in nested avro format(provided by upstream ad-server team), apply business logic to clean, validate and transform nested avro to flatten dataset and write back inventory data (~25 TB) to S3 in parquet format.
- Data Sampling: Read transformed inventory data in parquet format, run sampling algorithm per hour in parallel and combine the results to get daily sampled data, finally write back sampled data(~1.5GB) to S3 in parquet format
- Data Enrichment: Read sampled data and enrich inventory data by adding more dimensions from our metadata stores (DynamoDB and MySQL) and write back enriched sample data to S3 in Delta format.
Data Sampling in depth
As mentioned above it would be waste of compute and storage resources to process 25 TB/day for 365 days. Goal of data sampling is to keep small but representative portion of the data and query that instead. Once we query sample to get approximate results, we will use Estimator (Scale-up factor) that will let us relate results about the sample to the original dataset (Figure 3 below).
Lets look at different data sampling methods:
This is most basic form of sampling where there is an equal probability of selecting any particular item. This is easy to implement and very fast but has several downsides. For instance there might be many impressions from same user in a single day and advertisers want to cap inventory impressions per user (frequency cap) as they want to reach wider user audience. Uniform sampling will be biased towards commonly occurring items and will not be a fit our use case.
Distinct Item Sampling
As an alternative, we use a sampling scheme that samples uniformly from the distinct items in our set of inventory impression logs. The name of the algorithm is AMIND(M): Augmented min-hash distinct-item sampling, with at most M distinct items. This sampling scheme is described in the following research paper: Sampling Algorithms for Evolving Datasets, Rainer Gemulla.
Distributed AMIND (M) in Spark
We use hash value of user ip address as key to identify distinct-items, then we run AMIND(M) per hour in distributed fashion on multiple spark executors and separate hourly AMIND(M) samples are combined for every day.
- Create samples for all hours in parallel with M distinct hash values of user ip per hour
- Combines samples, group by hash values of user ip
- Sort, take items with M smallest hash values to generate daily sample and multiplier (scale up factor)
Once we have sampled inventory data in delta format, our business users will go to GumGum’s internal dashboard, enter specific campaign’s targeting criteria(browser inclusion/exclusion, city inclusion/exclusion, etc..) which will then submit a spark job to search samples and return matching impressions for past 365 days. Once we get matched impressions, we will run modified auto-arima forecasting model (written in R) on spark driver to get forecasted impressions back to the user (Figure 4 below) with 30 secs average response time.
Checkout our presentation about the same topic in DataCon LA 2019 below: