Facebook offers a wide variety of products and services, most of which leverage machine learning. Each machine learning model requires first a training phase and then, once deployed, an inference phase. For the training phase, the model can ingest hundreds of terabytes of data. For the inference phase, depending on the product, the model may be run tens-of trillions of times per day. Generally, this needs to be performed in real-time. System design, therefore, is key in order to meet these requirements and guarantee optimal performance.
The post will analyze two papers (, ) published by Facebook in order to highlight the importance of system design in machine learning, illustrating three lessons that will be useful for any machine learning engineer.
Group multiple models as a cascading classifier, store them in separate servers
Online inference has to be done by optimizing for latency. It is important to have models that produce very fast results without sacrificing the quality of the prediction. A common approach to solve this problem is to join multiple models using a funnel architecture or cascade classifier. This means initially using simple models that can do fast passes with large quantity of data but rapidly transition towards more complicated models with sparse embeddings.
In order to design the architecture of a cascading classifier, Hazelwood et al. argue that simple and complex models should be run on separate servers. Because of their sparse embeddings and high amount of parameters, complex models are memory intensive. Therefore, for later passes, they should be run on a separate server from the initial passes. Some models can even be run directly on users’ mobile devices to reduce latency and increase decoupling.
Combine online and offline training to build a complex but fast hybrid model
Cascading classifiers have many purposes, like predicting clicks on online advertising systems. At Facebook, online advertisement systems filter large quantities of ads and guarantee that only relevant and personalized ads are shown to clients. These systems have two important requirements: They need to have a very low latency and use freshly generated data.
The requirement of data freshness comes from the fact that data distribution changes over time and therefore prediction accuracy degrades the older the data gets. He et al  illustrate how the delay in days influence the model’s normalized entropy:
An additional requirement from the model is to generate feature transformations to improve the model accuracy. He et al. argue that Boosted Decision Trees (BDT) can be used to implement non-linear and tuple transformations. BDTs learn how to obtain the best feature transformations by interacting with the data and obtain good embedding representations.
The problem with these two practices (using fresh data and using BDTs) is that they are difficult to optimize jointly. This is because the BDTs need to be trained offline but this is detrimental to the data freshness requirement.
He et al. propose an interesting solution that uses a hybrid model that combines a BDT with a linear classifier. The joint model uses the output of each individual tree as a categorical input feature to the linear classifier.
The training process of the model can be broken in 2 parts: Boosted decision trees are trained offline ( in a 2 day span) and the linear classifier can be trained in near real-time. For the inference step the two models are joined.
This model would be used in the last stage of a click prediction model of a cascade classifier and presents a solution that tries to combine the best of both worlds: a complex model with multiple parameters that is trained offline and a simple and sparse model that is highly personalized and uses fresh data based on interacting with the user.
Design an online system to train the model
He et al.  introduce an experimental system that generates real-time data used to train the linear classifier (shown in the previous section) via online learning. This system is referred to as an “online joiner” because it joins the result of a user interaction (label) with the features of the user. This becomes the input for training the model creating a training loop:
He et al. discuss how to correctly label impressions and how to contend with the waiting window problem. Using too long a waiting window increases latency but a too short time window causes some mislabeling.
Finally because the process runs as a continuous loop, it is important to monitor for system failures because it can introduce a significant amount of noise in the training process.
Both He et al. and Hazelwood et al. show us creative ways of dealing with the challenges associated with training and inference in machine learning models. They propose ways engineers can design innovative architectures that will provide elegant solutions to common problems. If you are going to design a system or are interested in machine learning more broadly, Facebook’s research lab offers good lessons to start.
: He et al. (2016). Practical Lessons from Predicting Clicks on Ads at Facebook https://research.fb.com/wp-content/uploads/2016/11/practical-lessons-from-predicting-clicks-on-ads-at-facebook.pdf
: Hazelwood et al. (2017). Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective https://research.fb.com/wp-content/uploads/2017/12/hpca-2018-facebook.pdf