Ad Viewability and Feature Selection for Big Data

Published in

Xandr-Tech

8 min readJan 4, 2016

In today’s post, I will unite the theory behind a machine learning technique with a real-world industry problem. I’ll describe one problem we needed to solve, how we solved it by building a new tool, and how you can also leverage it!

Hopefully after reading this, you’ll have a better idea how we use data at AppNexus. Or, maybe you can sympathize with me after encountering yet another memory issue, and this tool will help alleviate your pain!

Ad Viewability Prediction: THE PROBLEM

At AppNexus, we now have the technology to measure whether an ad impression was viewable — that is, whether the ad was in the viewable area of the screen for a given period of time, typically 1 second. See below for a nifty diagram:

Measuring viewability helps to make digital advertising more accountable — buyers can better understand which impressions have an impact, reducing waste and even ad fraud.

Now, while reporting on the result of an impression is great, what our clients really want is to use viewability to make smarter bidding decisions ahead of time. Using the vast measurement data that we collect, we can provide clients with the ability to bid with respect to an ad slot’s predicted viewability rate. For example, if we can predict that an ad slot will be viewable 60% of the time, we can easily modify how much the advertiser should pay for that ad slot by 60%. We will also use the predicted viewability rate as a foundation for a viewable-only marketplace — which we actually announced in November!

OK — That’s Nice. How did you do that?

For all of you data scientists out there, you can recognize this as a prediction problem: we need to predict the likelihood of an ad being viewed at the time of auction. In order to do this, we decided to use logistic regression with L1 regularization — we needed a flexible model that can easily incorporate new signals that may be coming through our data pipeline, and it is well-known and easy to understand. Furthermore, the results are always between 0 and 1 — perfect for a probability prediction.

Another important reason for choosing this technique was because L1 regularization conveniently gives us a way to perform feature selection. What does this mean? For example, our data has many different attributes (eg. site domain, creative size, country) that can all contribute varying degrees of additional information for our model. We can use this technique to only choose the top contributing attributes. If we didn’t do this, we could be wasting a lot of resources by ingesting unimportant data or contributing to overfitting. I’ll explain more in-depth later in this post.

One key theme that will persist throughout this post is the sheer amount of data that we have. At AppNexus, we process over 50,000 rows of data per second. With more data, our models can be more accurate. However, we need a way to ingest and build at scale. Therefore, the main restriction in choosing a tool to solve this problem is scalability.

More Math Deets, Pls

Let’s talk a little more about what logistic regression is doing. Its formulation follows a linear model, so let’s start with a more familiar topic: linear regression.

For linear regression, the model predicts a value by computing the coefficients for each feature that minimizes the least squares error. Let’s say that there are only two features, x1 and x2. Then,

where β and x are vectors. In the non-condensed form, you can think of β0 as the intercept, and β1, β2 are the coefficients of the first and second features, x1, x2, respectively.

With such a format, it’s easy to re-train this type of model when a new feature is included. Logistic regression looks similar in fitting coefficients to a linear model, but the dependent variable is categorical. This blog post concentrates on the binomial case — where there are 2 possible outcomes. In the case of viewability, y would be a success — “ad was viewed” or a failure — “ad was not viewed.” It utilizes the logistic function, which can take in any value but returns an output between 0 and 1.

This is accomplished by assuming that the log of the odds is a linear function of x:

where β and xi are vectors.

You can rewrite this statement as:

where pi will be your prediction per observation i.

This logistic function approaches 1 as the input approaches infinity and approaches 0 as the input approaches negative infinity, satisfying the requirement to return a value between 0 and 1.

There are many packages and solvers that are available — such as in scikit-learn, R, MLlib, etc. It’s also not too difficult to implement using gradient descent or another solving technique — most techniques just require finding the roots of the first derivative of the log likelihood with respect to the coefficients.

L1 Regularization

Now let me go a step further to explain why we chose to add in L1 regularization. As stated earlier, feature selection is important for machine learning, especially when you have a large number of features. Many of these attributes may not contribute to the model and may even lead to overfitting. L1 penalty, or regularization, is one technique to detect the most predictive features.

The formula to solve this regression problem looks like this, where ℓ is the log likelihood:

If the problem is highly constrained, C is tiny, and the β coefficients are forced to be very close to 0. As the limit of C approaches a large number (let’s say infinity), then you are left with the unconstrained problem (back to the original problem)!

Visually, as the C constraint is relaxed, or as you continue to increase it, the regularization path can be formulated:

High level: think of starting constraint C at the smallest value possible. At this point, all of the non-intercept βs (or the feature coefficients), are zero. On the graph, since the x-axis is the C constraint (the sum of absolute βs) this is starting at (0, 0). As C increases, the next highest contributing beta will “pop out”, therefore corresponding to the next most prominent feature. As you look from 0 to the largest C, you can visualize the most important features.

Since we were going to go ahead and use logistic regression, it was a logical extension to use this added benefit of feature selection.

Massive Data

Adding in L1 makes solving this problem more difficult because of the absolute value in the constraint, which is non-differentiable. Again, there are many solvers and packages out there that do deal with L1 regularization. However, we need something that will deal with the size of data that we have at AppNexus. We need to tackle this in 2 ways: 1) by reducing the size of the data and 2) by using distributed computing.

Reducing the data with weights

Many of solvers are implemented by formulating the log likelihood as follows:

There are n observations. The i denotes the index of an observation, p is the probability of a success (view), and yi is 1 in case of a success event.

Note that each observation is therefore a single trial. In the viewability case, each row of data is a single impression, either resulting in a view or not. However, our data size is too large, so if possible, it is better to weight the “same observations” and collapse them into a single row.

For example, this could be our original data set:

But to make the number of rows smaller, we can group all trials with the same features together like so:

In doing this, for a particular data set, we were able to shrink the number of rows to process from 5 billion to 300 million.

Mathematically, this means changing the log likelihood equation to the following:

Here, we can denote yi as the number of views and mi as the number of trials (impressions) per ith observation.

Using Multiple Machines

Furthermore, because of our data size, we needed a tool that utilizes distributed systems. We can’t hold all of our data in memory. There are some tools out there like MLlib on Spark that can handle L1, but we needed to be able to use a pre-aggregated row format.

OUR SOLUTION

To tackle our data size, we created a new tool that solves for the optimal
β coefficients by maximizing the new form of the log likelihood function. We solve for the βs by using coordinate descent, which is a well-known optimization algorithm for solving a non-differentiable objective function (Friedman).

This tool allows us to process in multiple machines by leveraging PySpark. Computationally, the bottleneck in the regression is in the summations. Since sums are communative, the PySpark map-reduce framework is eligible.

Furthermore, gains are made by manipulating the RDD of the PySpark partition. Instead of submitting a list of lists, the tool forces the partition to be a list of a 2-dimensional NumPy array. Since this NumPy array is contiguous in memory (and has the benefit of being written C), it’s a lot faster than using the typical RDD structure!

By doing this combination of Python, Spark, and NumPy, we were able to achieve the flexibility of a scripting language as well as the performance of languages such as Scala and C, the underlying languages of Spark and NumPy, respectively.

How Can I Use This?

To recap, this tool was made for the use-case of our scale, namely by supporting weighted trials and by parallelizing across many machines.

The great thing about this tool is that it can be extended to many other problems. At AppNexus, we work with a lot of probability predictions, making logistic regression a good candidate. We can therefore leverage the feature selection piece of this tool to cut down a lot of the attributes we use in our models, including video completion optimization and conversion predictions.

We have open sourced the code for you to use on our GitHub repo *.

Check it out!

* Big shout-out to fellow contributors on this project: Abe Greenstein, Sandesh Devaraju, Umar Aftab, and Moussa Taifi!

References

Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. “Regularization paths for generalized linear models via coordinate descent.” Journal of statistical software 33.1 (2010): 1.

Written by Stephanie Tzeng