Lessons Learned from Building Scalable Machine Learning Pipelines

Moussa Taifi PhD
Published in
11 min readApr 23, 2018


Click Prediction ML Software Pipelines from the Trenches

(Joint work with Tian Yu and Yana Volkovich)

This work was presented at PAPIs Europe 2018.
View the video recording here.

The Data Science team at AppNexus is committed to developing powerful tools and solutions for our customers. We have been steadily investing in machine learning (ML) infrastructure to achieve these goals. In 2016/early 2017 we built the AppNexus Click Prediction Machine Learning Pipeline and we want to report back to the community our lessons learned and challenges to be aware of when building such a large scale ML system. This post is intended to ease the path of the future ML practitioners.

From a business perspective, the opportunity is tremendous. The revenue of digital advertising has been reported to climb to 84$ billions for 2017, and by 2021 digital ad spend is expected to represent a majority of all U.S. ad spend. To be competitive in the industry, advertisers have to rely on accurate CTR (click–through-rate) prediction to spend their budget effectively and hit their KPIs (key-performance-indicators). AppNexus helps advertisers to achieve their goals by providing accurate CTR predictions based on machine learning models. As part of this, AppNexus needs to tackle a complex set of Data Science challenges such as evaluating machine learning models efficiently to avoid timeouts, and to optimize memory usage to control technical costs.

The Data Science Challenge

The challenge of managing machine learning pipelines is one of the most central issues, because Engineering/Infrastructure teams are still discovering how to do it optimally.

Central Challenge for ML pipelines

In one direction, Engineering/Infrastructure teams value high reliability and low technical debt, but in the other direction data science research requires high flexibility for fast paced algorithmic experimentation.

The aim of this post can be summed up into 4 lesson learned. We recommend that all those lessons are to be addressed when building a new large scale ML system from scratch.

Lessons Learned for a successful ML pipeline

1. Research-Production Parity;

2. Preventing Pipeline Jungles;

3. Experiments Versioning and Collaboration;

4. Model Freshness Trade-offs.

Research-Production Parity:
Every ML team should take the below aspects very seriously because the interaction with Engineering/Infrastructure teams is crucial for a successful deployment. The main aspects that should be taken into account are:

  • Data Dependencies: Availability of required data sets in terms of look-backs, data types, and granularity.
  • Data Semantics: Stability of meaning of the data fields especially when IDs are used to encode values.
  • Model Runtimes: The platforms where researchers experiment should be as close to the production runtimes as possible.
  • ML Packages Dependencies: Care must be taken to make sure that ML packages used at research time matches the production packages in terms of assumptions, defaults parameters, and internal behavior. For example, hashing functions have to be the same in the experiment pipelines, the production pipelines, and the model serving layer.
  • Model Serving Infrastructure: Research team should have a deep understanding of the requirements of the serving layer in terms of feature transformation costs and latency requirements at prediction time.

Preventing Pipeline Jungles
Usually pipeline jungles are symptoms that manifest themselves when research and engineering groups are separate entities in an organization. The main purpose of such pipelines is to transform raw data into intermediate forms that can be fed to ML libraries. The following diagram shows a natural evolution that ML teams go through:

ML feature store evolution

This evolution starts from individual custom scripts that have the highest level of flexibility in terms of feature engineering, model training, optimization methods, and ML packages available to the researchers. As the scale grows the data science team needs larger resources, therefore, shared data pre-processing jobs are set-up to amortize common feature transformations. As the next step a Unified Data Store has to be developed to amortize costs even further. Finally, the feedback loop needs to be closed for an end-to-end ML product. This materializes in storing the ML Feature data, the Models and the Metrics in a unified location for easier monitoring and introspection of production releases as well as enabling faster iterations for new experimental directions.

Experiments Versioning and Collaboration
The key question for a reproducible experiment is: What are the boundaries of an ML Experiment? As ML products start scaling up, so does the size of the ML teams responsible for experimenting with new methods for improving the predictive performance of the ML product. Coordination and reproducibility issues start appearing. The Data Science team, therefore, benefits from determining the boundaries of their experiments as early as possible.

ML Experiment boundaries

In the diagram above ML experiments are composed from multiple pieces: the source code, the generated models and the resulting metrics produced by this source code. However, this might not be enough in terms of reproducibility. The Data Science team members need to make sure that they include the experiment metadata, the data, and package dependencies as well as specifications for the runtime/scale requirements for each experiment. In addition, logging out any experiment metadata in a centralized/shared database helps manage parallel experiments, and building dashboard to monitor the active experiments’ health.

Model Freshness Trade-offs
Model freshness is a major concern for real-time ML applications. The dynamic nature of the data sources make it imperative to update the deployed models as frequently as possible. On the other hand, re-training models can be a costly operation that should be balanced against the benefits from new models performance improvements.

Mutual interactions between ML model update frequency, cost and predictive performance

Fine tuning the ML training costs require custom scheduling methods. The evolution of the scheduling mechanism can start with simple cron jobs with no guarantees. Next, reliability concerns can be addressed with simple retry logic with capped retry counts. Further, instead of solely relying on a time-based frequency for training jobs, teams can use the predictive performance of their models to guide the frequency of the jobs. Finally to control cost, in addition to the predictive performance, the technical cost of running training jobs can be added to the equation and used to decide on the frequency of the training and model updates. The following graph shows the evolution of a model scheduler:

Evolution of ML Job Scheduling methods

Another observation is that using interpretable model types (e.g. logistic regression) allows the monitoring of the changes in the parameters of the models. Using this information, the model training jobs’ frequencies can be reduced if the models generated do not change dramatically. This can help the team save on training costs. In a way, this method can be used as a proxy for “concept drift” monitoring, and help guide the frequency of the update of deployed models to balance cost with predictive performance.

Now, as we learned 4 main lessons above, we are ready to walk through some of the details of our own AppNexus Click Prediction Machine Learning Pipeline.

Case Study

As a short case study we present an ML pipeline design for an advertising use case: Click prediction in an RTB (Real Time Bidding) setting. The goal of this ML pipeline is to gather data on inventory, users, and advertiser information, and to train ML models to predict the likelihood of someone clicking on an ad, at the time of auction. This is one of the core optimization parts of the Ad Tech business.

Life of an Ad Call from 50000 feet

RTB, in a nutshell, provides a way to determine the value of an ad call in real time. In the diagram above we give a 50,000 feet view of the life of an ad call. When an ad call happens, the publisher sends AppNexus a request for a bid for the tag/ad-slot. The impression bus (i.e. Impbus) carries these ad calls to multiple bidders, which in our case, use a machine learning model to determine the value of this ad call. This bid is transferred back to the Impbus that runs an auction to decide which creative/ad to serve along side the publisher content.

This pipeline faces many challenges and requirements:

  • Scalability to handle 400 Billion impressions seen per day and 10 Billion transactions per day at peak;
  • Cost Effectiveness to lead to a successful ML product;
  • Fast Iteration to enable research teams to quickly explore new model types to improve predictive performance;
  • Dynamic Data Sources that are coming from world-wide streams of impressions/bid requests;
  • Tight Model Serving Constraints due to bid requests’ volume and timeouts’ requirements (<100ms response time required per prediction).

ML System architecture

To perform this large scale machine learning task, many components work together to provide a reliable and efficient way for handling these ML product requirements.

ML System Architecture

A typical ML pipeline involves assembling train/test data sets from multiple data sources, performing feature engineering/transformations, training the ML models, evaluating and tuning the models as well as deploying, serving, and monitoring the models’ performance. The above system architecture shows the 4 main steps in the production ML pipeline.

We briefly highlight some of the components in terms of lessons learned described above.

ML Feature Data Warehouse vs. Just-in-time Feature Transforms

The above diagram shows the spectrum of options for managing ML feature data. On the far right hand side, we see the just-in-time feature transform method. This method provides the highest level of flexibility for researchers in terms of feature engineering. This can produce a high number of custom scripts to pre-process data sets and lead to pipeline jungles in some cases.

The method that we opted for was a unified ML feature data warehouse. By keeping the right level of granularity at the observation level in the ML feature data warehouse, research teams are able to use custom look-backs for both back and live testing without having to pre-process data sets in redundant ways. At the same time production jobs can amortize feature transformations using jobs with customizable frequency. This has a net effect of reducing ML pipelines jungles and new experiments can be performed with minimal data pre-processing effort.

Model Training and Hyper-Parameters Tuning

In terms of model training, logistic regression models produce sparse models which help with reducing prediction time and improving interpretability. In addition, we provided both Python and Java/Scala libraries for the model training component to enforce research-prod parity.

Another component of the model training is hyper-parameters tuning. This tuning is technically expensive if left without monitoring and prior-guidance. To reduce our initial search space we pre-constrain the space by running offline experiments to determine adequate ranges for the hyper-parameters. This helps tune the models freshness because reducing the training time allows for more frequent model updates if needed.

Model Serving

ML research does not traditionally take into account the constraints that come with model evaluation in real-time in the model serving infrastructure. In our RTB use case there exist multiple constraints that are introduced by the model serving layer. Bid responses have very tight response windows allowed (i.e. <100ms). Therefore, the model serving engine must at all costs reduce both the amount of time it spends on converting bid-request data to supported features and on calculating the score from the currently attached predictive models.

The above two constraints impact the design of the model candidates both in terms of model types allowed to be deployed and in terms of the features types/values selected. For example, requesting candidate feature transformations needs to be based on expected model predictive performance as well as runtime predictive performance. Adding different levels of data sizes generates different outcomes:

  • Adding a single integer id as feature: 😄
  • Full text of the ad context: 😅
  • Full video of the ad context: ❌
Feature Inclusion Loop

An example loop for feature inclusion is described above. The main message is that both predictive and runtime performances have to be taken into account when adding a new feature or feature transformation to the serving layer. This requires back testing new features and then evaluating them on live traffic before employing them in production.

Live Model Performance Monitoring, Alerting, and Safe Guards

Monitoring of ML outputs requires sophisticated methods and is usually ignored in traditional ML research. However, in real business situations
KPIs and model health are the first aspects that need to be tracked efficiently, and at the right level of aggregation/abstraction, while allowing deeper investigation for transient issues that are common in large scale distributed systems like ours.

The main lesson learned here is that different channels are needed for different stakeholders:

  • For Business metrics such as CTR, delivery rates and other business level KPIs, the product and engineering managers need high level aggregated dashboards that show trends and overall behavior of the product results.
  • For Model-specific metrics such as prediction bias, train/test/validation log-loss, model parameter ranges and other metrics related to the fine-grained health of the ML pipeline operation, the data science team members and prod engineering owners need tools and dashboards that allow them to dig into the details of the pipeline. This can be achieved with a mix of low level data analysis tools that allow for product unit level analysis.


This post is a brief overview of the details of an ML pipeline that we contributed to at AppNexus. The full details are under a second-round review and should be published in a PMLR paper later in 2018. The paper contains other details and solutions related to this pipeline that can help future ML practitioners design their ML pipelines, including some of the lessons learned that we found useful for our work.

This work was presented at PAPIs Europe 2018.
View the video recording here here.


We would like to thank Abraham Greenstein, Anna Gunther, Catherine Williams, Chinmay Nerurkar, Lei Hu, Megan Arend, Paul Khuong, Ron Lissack, Sundar Nathikudi, and Noah Stebbins for being active participants though the process of discussing, designing, developing, and implementing the AppNexus Click Prediction Machine Learning Pipeline.

Related work:
Hidden Technical Debt in Machine Learning Systems D.Sculley, Google
Scaling Machine Learning as a Service Li Erran Li, Uber

Product reference:
AppNexus Programmable Platform

We are hiring! Please checkout our open roles: https://xandr.att.jobs/job/new-york/data-science-platform-engineer/25348/12859712



Moussa Taifi PhD

Senior Data Science Platform Engineer — CS PhD— Cloudamize-Appnexus-Xandr-AT&T-Microsoft — Books: www.moussataifi.com/books