DeepR — Training TensorFlow Models for Production

Guillaume Genthial
Criteo Tech Blog
Published in
8 min readAug 6, 2020

Authors Guillaume Genthial, Romain Beaumont, Denis Kuzin, Amine Benhalloum

Links Github, Docs, Quickstart, MovieLens Example

Introduction

At Criteo we build recommendation engines. We make millions of recommendations every second with milliseconds latencies from billions of products.

While scalability and speed are hard requirements, we also need to optimize for multiple, often changing, criteria: Maximize the number of sales, clicks, etc. Concretely, this translates into dozens of models trained every day, quickly evolving with new losses, architectures, and approaches.

DeepR is a Python library to build complex pipelines as easily as possible on top of Tensorflow. It combines ease-of-use with performance and production-oriented capabilities.

At Criteo, we use DeepR to define multi-step training pipelines, preprocess data, design custom models and losses, submit jobs to remote clusters, integrate with logging platforms such as MLFlow and Graphite, and prepare graphs for Ahead-Of-Time XLA compilation and inference.

DeepR and TensorFlow

TensorFlow provides great production-oriented capabilities, thanks to (in part) the ability to translate graphs defined in Python into self-contained protobufs that can be reloaded by other services with more efficient backends. More importantly, this alleviates the need to rely on Python for serving and simplifies model deployment.

The switch to eager execution with TF2 opens new possibilities in terms of control flow. However, the static graph approach taken by TF1 and the Estimator API has some advantages in terms of serving capabilities and distributed training support. For legacy and maturity reasons, we chose to work with TF1 for now, waiting for TF2 to stabilize.

While pure Deep Learning libraries like TensorFlow, PyTorch, Jax, etc. typically focus on defining computational graphs, providing (relatively) low-level tools, DeepR is a collection of higher-level functionality to help with both the model definition and everything that goes around it.

Main pain-points are usually around job scheduling, configuration, logging, code flexibility and reuse rather than new layers implementation. As we keep adding more and more models, we want them to coexist into one consistent codebase, limiting backwards compatibility issues.

What are DeepR strengths?

DeepR comes with a config system built for Machine Learning. One of the biggest problems DeepR addresses is configuration, a major challenge in machine learning codebases, as parameters of any function may have to be exposed as hyper-parameters.

When a parameter is far down the call stack, it can be particularly tricky and cause all sorts of issues. Defaults might be difficult to change for example.

More importantly, it impacts how jobs are launched, and it is not uncommon to end up with long commands with hundreds of parameters, half of them not even being used but kept for backward compatibility.

DeepR comes with a config system, most similar to Thinc / Gin-config, that makes it possible to configure arbitrarily nested trees of objects, interfacing nicely with json.

The DeepR config system relies on dictionaries. Using the import string in the “type” key, the config system resolves the class to instantiate.

The config system, combined with Python-fire, results in a flexible and powerful Command Line Interface at no additional cost. This lets us submit jobs on a remote cluster, as sending a config file along with a command is usually the easiest way to interact with a job scheduler.

Parse and instantiate a config into an object using a config and macros

In other words, we allow the code to stay the same by moving “breaking” changes to the configs. Also, by being able to define the object’s dependencies at the config level, we do not need “assemblers” and other dependency injection mechanisms in the code, simplifying the overall design and maintainability. Finally, because parsing .json files is straightforward, we can reload and update a config programmatically, which is something we need to do for hyper-params search or scheduling.

DeepR encourages you to write jobs

A client submits jobs to yarn. Tf-yarn distributes training on other workers, while we can use Spark to speedup inference and validation.

Another problem that DeepR addresses is pipelining.

Training a model is just another ETL, where the input usually is a dataset and the output a protobuf with the model’s graph and weights.

Calling model.train()is merely one of the multiple steps. In general, we need to preprocess the data, initialize checkpoints, select the best model, compute predictions, export the model in different formats, etc.

DeepR adopts an approach similar to Spark with the Job abstraction that while being flexible, encourages modular logic and code reuse.

DeepR adopts an approach similar to Spark with the Job abstraction that while being flexible, encourages modular logic and code reuse.

As a bonus, jobs can be run on different machines with different hardware requirements: preprocessing probably needs a machine with a good IO, while training would be faster on a GPU.

And it’s worth the effort: we were able to reduce our memory footprint by 4 and speedup training by 2, not only saving cost on the infra side but also lifting limitations that existed because of memory.

A pipeline made of a build and a training job

DeepR works with Hadoop (HDFS, Yarn), MlFlow and Graphite

One of DeepR’s strengths is its tight integration with Hadoop, especially HDFS and Yarn, thanks in part to

  • tf-yarn (a library to train Estimators on yarn, also created at Criteo)
  • pex (a library to generate Python executables)
  • pyarrow (Apache Arrow binding)

In practice, this means that there is no additional development time to train a model locally or on a Yarn cluster.

Package the Python environment as a pex on HDFS, upload the config to MlFlow, and run the job on Yarn.

DeepR also provides a suite of tools to use MLFlow for logging metrics and parameters, with support for distributed training and job scheduling on remote clusters.

Add the ability to save config files as artifacts, and now we have full-reproducibility, an easy way to track and compare experiments as well as a centralized place for all config files, ready for deployment!

DeepR can also help with models definition for Estimators

DeepR also adopts a functional approach to model and layers definition, similar to TRAX, Thinc, or the Keras functional API.

While TensorFlow and PyTorch provide a low-level declarative approach to graph definition, the Estimator API around which DeepR is built works better with functional programming (especially true with TF1 variable management), and we found it easier to manipulate higher-level logic blocks (layers) as functions, chaining them in Directed Acyclic Graphs.

A model made of one embedding layer and a Transformer.

In that way, the Layer abstraction provided by DeepR can be seen as a simple way to define graphs for the Estimator API. However, note that it provides this capability as a bonus, since the rest of the code base makes no assumption on how graphs are created.

DeepR comes with a suite of tools to manipulate TF objects

Finally, DeepR comes with some custom hooks, readers, predictors, jobs, preprocessors, etc. that bundle TensorFlow code. It is very similar to the legacy tf.contrib module, as a collection of missing higher-level tools to manipulate native types.

Serialize a Dataset to tfrecords with ToExample and TFRecordWriter

Get started

You can use DeepR as a simple Python library, reusing only a subset of the concepts or build your extension as a standalone Python package that depends on deepr. DeepR includes a few pre-made layers and preprocessors, as well as jobs to train models on yarn.

For a short introduction on DeepR, have a look at the quickstart (on Colab).

The submodule examples of deepr illustrates what packages built on top of DeepR would look like. It defines custom jobs, layers, preprocessors, macros as well as configs. Once your custom components are packaged in a library, you can easily run any pipeline, locally or on Yarn with

deepr run config.json macros.json

A word about recommender systems at Criteo

Some of our recommendation systems adopt the following approach. Given a timeline of items represented as vectors, a model predicts another vector meant to capture the user’s interests in the same space. Recommendations are the nearest neighbors of that user’s embedding.

At training time, the model contains a lot of parameters: one embedding for each item as well as model parameters.

At inference time, we cannot reasonably imagine scoring a user embedding against all possible products. Using fast neighbor search algorithms like HNSW we delegate the ranking step to another service. The model only predicts the user embedding.

At inference time, the model sends a user embedding to an HNSW index instead of computing a Softmax over all possible movies.

This use-case illustrates how much more complex real-life machine learning pipelines are. Not only do we need to define a graph and train the model, but we also have to support a different behavior at inference time, export some of the variables to other services (in this example, the items embeddings), etc.

Such a formulation has the advantage of transparency, as you can easily retrieve similar products with nearest neighbors search. In this example, you can see that picking a very specific product from a Criteo partner (board games) returns very similar products in another partner completely, compared to other more standard approaches (Best Ofs, etc.)

Comparison of different recommendations

Using DeepR on the MovieLens dataset

MovieLens is a standard dataset for recommendation tasks. It consists of movie ratings, anonymously aggregated by users. For a given user with some viewing history, the goal is to make the best movie recommendations.

We implement a simple baseline. Each movie is associated with an embedding and a bias. Given a user, a representation is computed as the average of the embeddings of movies seen in the past. The score of any recommendable movie is the inner product of the user embedding with the movie’s embedding + the movie’s bias.

During training, we train the embeddings and the biases, optimizing a BPR loss that encourages “good” recommendations to get better scores than “bad” recommendations.

It is also possible to use fancier models, like a Transformer, but we found it to be relatively unstable to train and not necessarily worth the effort from a production perspective.

You can have a look at the AverageModel as well as the corresponding BPRLoss implementations on Github, or train your model using either the config files or the Notebook on Google Colab.

The pipeline is made of 4 steps

  • Step 1: Given the MovieLens ratings.csv file, create tfrecords for the training, evaluation, and test sets.
  • Step 2: Train an AverageModel (optionally, use tf-yarn to distribute training and evaluation on a cluster) and export the embeddings as well as the graph.
  • Step 3: Write predictions, i.e. for all the test timelines, compute the user representation.
  • Step 4: Evaluate predictions, i.e. look how similar the recommendations provided by the model are to the actual movies seen by the user in the “future”
Parameters and metrics (screenshot from MlFlow) — recall@50 is 37%

We run the pipeline on Yarn, monitoring progress with MlFlow, and then reload the embeddings to visualize their properties.

Five most similar movies

Conclusion

Go ahead, start playing with the notebooks, and please report any feedback on issues!

Links Github, Docs, Quickstart, MovieLens Example

--

--