Mobile Ad Click Prediction : Machine Learning Models behind the billion dollar Ad industry (Part-2)

An end to end case study on Avazu Ad click prediction data

Rohan_Thoma
16 min readJan 28, 2022

In the part-1 of this blog we have done the exploratory data analysis of the Avazu ad click data and engineered 5 new features which would be very useful for our problem. Now, we will see how to harness this data in the part-2 where we apply various machine learning models to tell apart ads that would be clicked and ads that won’t be clicked.

Let us look at the our performance metrics, machine learning objective and the constraints before we apply our models.

Mapping the business problem to a machine learning problem:

It is a binary classification problem as we predicting the probability that the ad will be clicked or not.

Performance metrics

  • Binary Log-loss: As we are dealing with probability values, log-loss is the best metric here as it not only penalizes when the model predicts the wrong output but also penalizes on the basis of how wrongly it predicts from the correct prediction.
  • Confusion matrix, precision and recall matrices: If we really want to measure how much value our model is adding to the real world , then log-loss alone cannot give us the picture. So, we really need to know how our model is performing for both the non-clicks and the clicks that’s where these metrics come into play and they are extremely useful in cases where the data is highly imbalanced like this scenario.

Machine Learning Objective

We need our model to have test binary log-loss as low as possible.

Constraints

  • Low latency: We only have seconds to grab a viewer’s attention and on mobile devices people tend to browse quickly, and skip things of no interest. So it is very important that we need to get the ads that are having highest probability of clicking and then display them in the ranking order in the actual business application. So low latency is extremely important.
  • Interpretability: We need to understand what kind of features make an ad more clickable so that we can get a better insight about how to maximize the ad revenue.

Data preparation

Here as we have the time stamps for the data we can perform time based splitting of the data and the data is already ordered according to the stamps , then we can split directly without shuffling.

Let us check the distribution of the classes in the train and the test data.

The distributions are similar and hence they have not changed much. So, we can proceed further.

Feature Encoding

Here as all the features are categorical in nature and with high number of unique values , then commonly used encoding techniques like one-hot encoding will result in large set of features leading to high-dimensionality problem. So, we can use label encoding here and the CTR per category of the feature set serves as the nice proxy for the label encoded values and also we can get only one column of data for each feature after encoding which also keeps the dimensionality same.

Here we define 2 functions , where one function fits on the train data and the other function transforms any given data.

  • Here we compute the CTR(Click Through Rate) for each value the categorical feature takes and for the new values the feature takes in the test data, the mean CTR for that feature is computed and then substituted for those cases. The method used to compute the CTR is same which we have followed during the exploratory data analysis.
  • Here the ‘response_fit()’ function takes the feature name and the data-frame as the parameters and computes the CTR and stores them in a dictionary named “vocab” and it also computes the mean CTR and then returns them both.
  • The ‘response_transform()’ function takes the vocab dictionary, mean CTR and the dataset and then encodes the data by matching the values to the values in the train data with the help of the vocabulary.

Now that we have defined the functions, we will encode the values with the code given below.

After transforming all the features , we can proceed for modelling but our performance metric is log-loss and we know that there is no upper bound for log-loss. So, we design a random model which predicts random output and then measure the log-loss for that model. Now this log-loss will become our upper bound and a means of comparison of how our actual machine learning models are performing as compared to the random model because at the very least our model cannot perform worse than some randomly predicted output.

We also have some utility functions to plot the confusion matrix along with the precession and recall matrices which you can see in the GitHub repo.

Machine Learning models

1. Random model

Now we will measure the performance of the random model as shown below.

  • Here we are generating 2 random numbers which represent the probability values for each class which sum up to 1, so we simply generate 2 random numbers and divide each number with the sum of 2 numbers which is just normalizing them.
  • Then we take the probability values of one of the columns and compute the log-loss by taking these values in place of predicted probability of class 1 y-values and comparing them to the actual class labels of the dataset.
  • Now we plot the confusion, precision and recall matrices below.

The confusion matrix values are totally random and that’s what we would expect from a truly random model.

2. Logistic Regression with hyper-parameter tuning

Here as we need the model to be fast , then logistic regression is good choice for a 1st model to try out as we will get some idea about the data whether it is linearly separable or not.

Here we plot the test log-loss for every value of alpha which is the regularization parameter.

  • We got the least test log-loss of 0.40037 for alpha = 0.00001
  • As we are having total data points as 5 million and then train data as 4 million data points and 1 million data points as validation data.
  • The test log-loss we got for Logistic Regression with hyper-parameter tuning is 0.40037 which is much better than the random model whose log-loss is at 0.8857. This means that our features are working.
  • Looking at the precision and recall matrices we can see that the model is predicting a lot of class 1 data points as class 0 data points but it is expected as this problem has class imbalance.
  • Let us see if we can improve the log-loss with the help of Gradient Boosted Decision Trees.

3. Gradient Boosted Decision Trees ( XGBoost )

Yeah, I know I am still using XGboost at the time when the current trend is using LightGBM and CatBoost because XGboost beats those in terms of sheer performance when we correctly tune the parameters by dedicating the resources and the time.

Here we are tuning the parameters with the help of the random search. Random search expects a scoring metric which means more the scoring function value , the better the model is. But the log-loss is a loss metric and not a scoring metric, so we need to pass negative of log-loss which could be done with the help of building a custom log-loss with make_scorer() function of the scikit learn.

After getting the best parameters, we train the model with the best parameters to get the best model and get the log-loss.

  • The test log-loss for Gradient Boosted Decision trees with hyper-parameter tuning is 0.39475 which is a very good improvement from the previous loss of 0.40037 which we have got for Logistic regression.
  • The precision value for the class 1 is also improved from the 0.587 to 0.609. If we observe this value , this is a marginal improvement from the linear model. So this shows that non-linear models work well when compared to linear models for the given data.

4. Field Aware Factorization Machines ( FFMs )

Most of the datasets in the CTR prediction contain features which are sparse in nature, so we need to extract meaningful features from the dataset, this is where the FFM’s come in to play.

You can read the original research paper here. The factorization machines are non-linear models which can be used to extract the hidden features in the dataset. When we apply them, they learn an implicit vector representation for each feature, and each hidden vector contains ‘k’ dimensions, so that the effects of feature combinations are modeled as the inner products between the two hidden vectors. Here FFM are a slight evolution to the FM where they take in to consideration field of the feature which is also represented in the form of hidden vectors. The idea of FFM comes from a PITF ( Pairwise Interaction Tensor Factorization ) which is used in the recommendation system with personalized tags. This is explained in this paper. For CTR predictions, features can generally be grouped into fields and FFMs will make full use of this available group information etc. A nice explanation is given in the below blog.

The original implementation by the authors of the original paper is riddled with its own set of problems and compatibility issues because they have created that implementation in C++. There is an alternative pure python implementation here named pyffm which is very easy to use and is exactly based on the original paper but in python.

Now we will use this model and see if we could get any better results. Here the y-label needs to be included along with the whole dataset with the column name of ‘click’, so we 1st include it and then proceed with the training along with hyper parameter tuning as shown below.

Here we plot the value of log-loss for each value of the regularization parameter tried.

  • Here we got a least test log-loss of 0.406 which is not any better than the logistic regression despite being a non-linear model in itself.
  • This may be due to the fact that we may have to pass the sparse encoded features to the model as mentioned in the corresponding paper.
  • Here we can see that the precision value for the class 1 data points also has taken a hit to as low as 0.487 which is worse than what we have obtained with the logistic regression, so this also shows the importance of the taking precision and recall values as a metric which gives us an idea about the real world performance other than a single value like log-loss.

Feature Engineering

GBDT features:

GBDT features which are proposed in this paper which describes the winner solution on e-commerce item recommendation by RecSys 2015. They have proposed various additional features such as GBDT features which the positions of an instance on each of the trees represents its relation to the cost function at various stages of the gradient descent process. This can be a very useful source of information. We use it in the following way: The ids of the trees are treated as different types of features and the indexes of the leaf nodes that an instance falls onto are treated as the values those features take. Here we can decide the number of trees and that will be equal to number of additional features that we generate.

First we will train a Gradient Boosting Classifier on the original features and then get the leaf node values for each tree for each data point by using the apply() method as shown below.

  • We have got an array of shape (4000000,100,1) so in this basically each element is an array and each data point has got converted in to 100 dimensional vector which is basically leaf node number in each of the base estimators in the gradient Boosted tree.
  • We need to reshape these arrays to 2D arrays and then join them to the train data-frame and then proceed for the modelling.
  • Here we have obtained 100 new features along with the 19 original features and 5 previously engineered features totaling to 124 features.

To check the effectiveness of the newly added features , we will try out with the simple logistic regression and see it improves over the previous score of log-loss.

5. Logistic regression with hyper-parameter tuning with GBDT features

The code is same as that we have seen above.

  • Here we got a least test log-loss of 0.3966 for values of alpha=0.01 as we can see in the graph.
  • This is certainly a very good improvement from the previous loss value of 0.40037 without the GBDT features. Hence this shows the effectiveness of the new features.
  • We can see the slight improvement in the precision value of class 1 from 0.59 to 0.603 although the it is small , it is still an improvement which is beneficial to us.
  • As these features look promising because they gave good result with the Logistic regression , so we can expect even better results by using non-linear models like GBDT’s as well.

6. XG-Boost with Hyper parameter tuning of 2 parameters

  • Here the dimensionality of our data has considerably increased with the addition of the GBDT features. Now as we have 124 dimensional data and all the features are dense , then we will run in to memory issues if we proceed to train the model with the previous method.
  • We need to train the model in batches, but there is no implementation of the batch training with the Random search of scikit learn even though the official documentation of the XG-Boost supports batch training. So, we need to write our own code for tuning as shown below.
  • We will 1st tune with 2 important parameters which are number of base estimators and the depth of the trees in grid search fashion with 12 possible combinations. We will train with a batch size of 0.5 million data points as we have 4 million data points in total for train data.

Here we are able to reduce the log-loss to 0.3936 by tuning just 2 parameters which is the best score till now. So, we will try to tune all the parameters and check if we could further improve the results.

7. XG-Boost with hyper-parameter tuning of 4 parameters with GBDT features

  • Here we will tune 4 parameters which are number of base estimators, depth of the tree, column sample by tree and subsample with 144 possible combinations in a grid search style. Yeah, it sounds crazy but if we need to try out all the parameter combinations to get the results and we need to leave the system overnight.
  • The code is very similar to above but with more loops. You can always have a look at the full code at my repo here.
  • After getting the best parameters , we need to train the model with best parameters to get the best model and get the metrics on the test data.
  • Here we are able to get a better log-loss with GBDT features and with XG-Boost and we got a best log-loss of 0.39065 till now which is better than the previous log-loss with tuning only 2 features where we have got 0.393065.
  • Here we are able to reduce the log-loss by 0.003 by tuning just 4 parameters which is a significant improvement, this change is also reflected in the results of the confusion matrix where we are able to increase the precision of the class 1 to 0.607 which is an improvement from 0.587 and there is not that significant improvement in the recall value of class 1 which has only improved from 0.072 to 0.075 this is mainly due to the highly imbalanced nature of the problem.

Feature Importance

Let us plot the feature importance, so that we get to know which features have contributed the most to our problem which we can get through the below code.

  • Here looking at the feature importance chart, we can see that GBDT features definitely played a very important role in classification of the data points and along with them there are some of our engineered features like day of the week and device id counts also played a significant role along with many of the anonymized categorical features such as C17 etc.
  • We can see that apart from GBDT features , ‘app id’ and ‘site domain’ seems to have contributed most among the originally given features.

Data pipeline

All the steps are written inside the function pipeline() whose code can be found in the GitHub repo. This function takes in the raw data points as input and returns the predictions which are the probabilities of the ad being clicked.

As low latency is one of our business constraints , then we need to keep the time it takes to get the predictions as low as possible. So we will measure the time it takes to predict the labels for 4.5 million test data points 1st and then we will measure the time taken for predicting the label for a single data point.

1. Prediction time for 4.5 million data points

This is actually the test data given by the Kaggle without the test labels , so to create a submission file , we can use this pipeline() function and along with it we can measure the time taken for prediction.

2. Prediction time for 1 data point

Here note that we need to pass the data point along with the column names , so we will have to get the data point by slicing the original data frame.

The models used in this case study are chosen keeping in mind the low latency requirement of the industry. So, we have measured the time taken for the final prediction with the tuned GBDT model on a new data point which is 0.083 seconds ~ 83 milli-seconds which is very practical in the fast paced world to determine whether the given ad will be clicked or not.

Final Conclusions

“ Here we have got a test log-loss of 0.39065 which is a very good value which is equivalent to score of the 160 to 170th position in the Kaggle leaderboard. “

The actual dataset given by the Kaggle is of 50 million data points but we have taken a 5 million data points and trained our model due to the resource constraints and the test data given the Kaggle itself is close to 5 million data points , so we will not get good predictions on this dataset using the current model because there will be a lot of unseen data points for which will not be able to predict correctly.

  • By using this current model on the Kaggle test data, it has generated a score of log-loss = 0.41 which is not bad for a model that is trained on just 5 million data points when the actual dataset is very large , this means that we have only trained on the 10% of the given data.
  • If we could train the model on whole 50 million data points , then the score would be much better but keeping in mind the real world application of the solution , the reduction in log-loss would contribute very little value in the business standpoint as we have not been able to see any considerable improvements in the values of the confusion matrix.
  • So we can conclude that by engineering 6 different types of new features , we are able to get one of the least log-losses 0.39065 on the test data even though we are not able to get significant improvement in the confusion matrix values.
  • So we can conclude that by engineering 6 different types of new features , we are able to get one of the least log-losses 0.39065 on the test data even though we are not able to get significant improvement in the confusion matrix values which is what matters in the actual use case of the model for the business application.
  • The results that we have got from all the models are tabulated as shown below.

--

--