Decision Trees Optimization for Programmatic Media Buying

Bikash Singh
MiQ Tech and Analytics
6 min readMar 6, 2019

The invention of ad exchanges has completely revolutionized the digital marketplace. Advertisers and publishers can now buy and sell advertising space through real-time auctions. These auctions happen programmatically when a web page loads, triggering an ad at the same time with the help of highly optimized algorithms. Advertisers and agencies typically tie up with technologies like demand-side platforms(DSPs) to buy ad spaces and perform such auctions. They can, however, also build their own bidding technologies. But building such a technology is difficult as it involves a lot of challenges.

Appnexus, a customizable DSP has been able to solve these problems by empowering clients to insert algorithmic models into an already scaled system. These models are used by Appnexus for performing real-time decisions at the time of an auction, helping clients show the right ads to users. Models can be uploaded through an interface called APB - Appnexus Programmable Bidder.

Bonsai language

APB uses decision tree logic which is written in Bonsai Language. It has been developed to enable advertisers to write a decision tree logic for custom predictive models and uploads them through the APB interface to Appnexus for advertising campaigns.

The above Bonsai language can be graphically represented as below:

The first decision node is called the root node. A decision path in a tree is the route taken from the root node to the leaf. The leaf denotes the final outcome; the value to be used as the bid price or the bid multiplier, depending on the type of tree.

How do you optimize the values of a Node and the values of a Leaf?

Firstly let’s look at the problem statements:

  1. What should be the value of the node in a decision tree? Decision trees are written using Bonsai Language (as shown in the diagram above). This language tends to get longer if the right set of inputs to the language are not identified correctly. At MiQ, we make decisions on features that can be inserted as values to the nodes. The set of features considered for the decision trees were limited and the datasets used for finding these values did not have information about the correlation of features. The problem lay in inferring the right feature and feature-combinations against a larger set of data sources with minimal cost incurrences.
  2. What should be the bid value at the leaf nodes in the decision tree? Based on the performance of a feature combination, bid values can be inserted. If a feature combination is performing well in a campaign, its corresponding bid value can be increased.

For example, if we want to target all users from London visiting ebay.com from Firefox browsers, the feature combinations depicted above can be used while programming decision trees. To target the right audience, the values of the feature combinations (website + country + browser) have to be derived after processing huge datasets (~5TB). Processing such huge datasets was challenging and our existing pipelines failed.

How we addressed the above problem statements?

Both of the above problem statements can be addressed by using Decision Tree Optimization. Let’s delve into their multiple iterations.

The first iteration — Hive + HDFS as a Platform

APB allows advertisers to bid on almost 40 different features. The values of these features can be derived from various data sources. As APB allows you to not only bid on a single feature but also on a combination of features, these data sources have to undergo heavy processing to find the right set of feature- value combinations, which can be used while programming the decision tree.

The number of feature combinations can be derived using Combinatorics; nCr, where n represents the total number of features and r represents the length of combinations. These feature-value combinations can be huge, depending on the number of features that we want to target on APB (as represented in the diagram above).

In order to find out the values of these feature combinations, a set of Hive queries were run against a ~5TB dataset. This process was repeated for every feature-value combination for finding the relevant audience that can be targeted. With the increase in features, the number of feature-value combinations also grew and so did the time to process the 5TB dataset, causing issues such as MR failures and a heavy processing time of over 18 hours. This made workflows very unstable and costly in terms of cluster usage.

The second iteration — Hadoop-MR + HBase as an alternative approach

Instead of using Hive queries for processing, we tried an alternative approach of writing a MapReduce program and used HBase as a primary Key-Value Store. It involved an important step — all feature-value combinations were processed at once against the 5TB dataset, contrary to the first iteration where this dataset was getting scanned for each feature combination. As a result, we saved time in scanning the data multiple times.

The smaller dataset was inserted into HBase and the larger dataset was passed to MapReduce, which internally did a lookup into HBase to produce the desired output.

Problems :

  • HBase internally uses Apache Zookeeper for communication in distributed processing. Zookeeper couldn’t handle a large number of client requests when the MR job communicated with HBase.This led to slow MR jobs and thread-blocking issues.
  • Heavy garbage collections which resulted from Async I/O programming used in MR code. As such Async programming proved to be a curse.
  • HDFS Spills + Disk IO: Most of the time (roughly around 6 hrs) was being utilized in writing data to disk and reading from it. CPU cycles were being wasted as processing was done fast and the process has to wait for the completion of disk I/O.

Final iteration — Optimization using Spark + Databricks

The Databricks platform provides cloud-based big data processing using Spark. Following are the list of strategies that were adopted while optimizing decision trees.

Steps:

  • We wrote the entire map-reduce code for data explosion on Scala and Spark using high performant Spark APIs. This reduced the time from 1.5 hrs to ~20 mins.
  • Spark-Broadcast Joins: These joins are usually preferred between two datasets, one of them being relatively smaller in size. One of our datasets was smaller (~60Gb text), as compared to the other dataset (~5TB gzip compressed). Thus, using broadcast joins remarkably increased their performance.
  • spark.sql.shuffle.partitions: A parameter inside Spark that let you decide on the number of partitions that will be involved during shuffling, as the optimized value distributes the load across the cluster.
  • Databricks also supports smooth autoscaling of clusters which reduces the cost of running these workflows daily.

Of all the iterations, optimizations using Spark and Databricks proved the most efficient with a significant reduction of cost and time from 6 hours to 45 minutes when running on Hadoop

How can these findings be applied to the node and the leaf of a decision tree?

Through the above-optimized pipeline, we can process datasets faster for finding out relevant features and feature combinations. The data generated undergoes model training for finding out top performing features and feature combinations. Finally, they are converted into Bonsai Language(representing a decision tree) using Python based code.

Based on the data generated from the above-optimized pipeline, if the feature-value combination is (website + browser + creative size) then it’s advisable to place a higher bid on ebay.com as the CVR (Conversion Rate) is higher.

Summary :

A feature (value of a node) in an APB tree represents the characteristic of an ad. Earlier, these features and the datasets used to find them were fewer and limited (3–4 Feature Length + 400 GB data). With our current optimizations, we can process 7–8 length feature combinations on ~5TB datasets. This has enhanced our ability to write high performing decision trees.

Overall, there was a significant improvement in terms of SCALABILITY & COST.

--

--