ElasticNet4j: A Performant Elastic Net Logistic Regression Solver in Java
But does it scale? It’s probably the most destructive question in Machine Learning history. It does not matter how accurate a model is if it cannot be deployed.
Scale is particularly important for our digital advertising products at Xandr. We conduct well over a hundred billion auctions for online ads daily on our real time platform. The Xandr Invest bidding engines allow our buy-side clients to participate in these auctions which typically requires real-time computation of bids that use an assortment of automated optimization algorithms. These algorithms utilize machine learning, control theory, optimization theory, and Bayesian statistics to ensure that ad campaigns spend their budgets on the specified schedule while appropriately valuing impressions and optimally shading bids. They also use models trained on billions of ad impressions, so model training needs to be as efficient as possible. Since we maintain some of the lowest margins in the industry, simply throwing more machines at the problem to achieve scale is not an acceptable solution. As a result, the first thing anyone at Xandr asks when discussing a new algorithm is, “Does it scale?” Because if it does not, we cannot use it.
One of the primary engagement metrics used by digital advertisers is user clicks, so accurate real-time prediction of the click-through-rate (CTR) is often needed to determine when and how much to bid. One component of Xandr’s click prediction algorithm is a logistic regression model with an L1 penalty. It models the relationship between the probability of a click and several categorical predictors related to the online ad auction. There are many open source packages that train logistic regression models, but we couldn’t find one that scaled and satisfied our technical requirements. So, we wrote our own.
Product Requirements and Challenges
At Xandr, the combination of our industry, scale, and business model created many unique technical requirements that our logistic regression solver needed to satisfy. Some of these issues are detailed below.
We respect each advertiser’s data privacy. To enforce this, we create one model per Ad Campaign, each trained using its own historical data. We also perform cross validation to search for the best hyperparameter on a dynamically generated grid of regularization parameters. This entails iteratively training and evaluating hundreds of models for each ad campaign to obtain the best one to be used for bidding. In addition, concept drift and ever-changing market dynamics make frequent model updates preferable (currently each ad campaign gets a new model approximately every three hours). We also need to scale the training process up to accommodate tens of thousands of ad campaigns. As a result, when choosing a training algorithm, computational efficiency is a primary consideration.
The majority of our model training predictors are high-dimensional categorical type, that have undergone common feature engineering techniques such as one hot encoding and hashing. As a result, we end up with extremely large but very sparse design matrices. Our training algorithm needs to take advantage of this sparsity and handle matrix operations efficiently.
Model scoring happens in real-time and each ad campaign uses its own model, so our bidder needs to store and score a large number of models efficiently. As a result, we need sparse models, but ones that also do not compromise prediction accuracy. Sparse models are also easier to debug and interpret. To accommodate this, we use L1 regularization. However, L1 penalty is non-differential, so many existing regression packages do not cleanly support the L1 penalty.
Our training data has highly skewed feature distributions and extremely imbalanced target classes. During the research phase, we observed divergence issues in model training. As a result, we needed a training algorithm where most convergence issues could be detected and corrected.
Compatibility and Maintainability
Our model training pipeline must integrate into complex existing engineering systems that are written in Java. Therefore, the solution should be in the same language environment, have few dependencies, and not require a bunch of ‘glue code’ to integrate into our stack. As a result, many existing general purpose packages did not meet our needs.
Each ad campaign has different business strategies and can therefore have very different distributions of data. Freedom to modify our training algorithm is a big advantage. For example, we dynamically generate different regularization penalty factors for different features. This gives us significant ad campaign performance gains. The majority of the off-the-shelf packages lack that option.
When we started building an engineering solution to generate logistic regression models for predicting CTR, we experimented with different open source machine learning packages. We could not find one that satisfied our business and computational performance requirements. We also tried popular distributed frameworks that came with their own Machine Learning suite, but we could not meet scalable performance requirements with these tools.
After surveying different training algorithms and experimenting with them, we found that the Path-wise Coordinate Descent algorithm by Friedman, Hastie and Tibshirani was a good fit for our specific problem. The coordinate decent aspect makes it well suited for L1 regularization, scalable with wide datasets, and makes convergence issues easy to detect. Moreover, it is a second order method, but it does not require expensive and potentially unstable Hessian inversions or first order approximations of the inverse Hessian (which can lead to slow convergence when the objective function has low curvature). The algorithm can handle large datasets, takes advantage of feature sparsity and calculates model weights along the full regularization path. When solving for the full regularization path, the algorithm uses current parameter estimates as a warm start for the next parameter, which provides significant speed gains compared to other variations of Gradient Descent or quasi-Newton methods.
Based on these considerations we built our in-house Machine Learning library ElasticNet4j that both fits our business needs while achieving desired computational performance. The library implements the GLMNET logistic regression algorithm in Java to operate on sparsely featured data. When training click prediction models, each ad campaign's dataset is fit into the memory of a single machine for data-locality and the algorithm operates on the dataset on a single core for cache coherence to train models quickly. The implementation has optimizations to minimize the number of calculations and caches results for better performance.
Our LR Library ElasticNet4j was built to be simple, efficient and performant. Our engineering solution uses multiple instances of the library simultaneously in a multi-threaded environment on a single bare-metal machine or, more recently, Docker containers on Kubernetes to generate CTR models for thousands of ad campaigns every hour. This library makes it easy to incorporate new optimizations into the current training algorithm or create new trainers that plug in to the library seamlessly.