Expedia Group Technology — Data

Speeding Up Inference Pipelines with Model Libraries at Expedia Group

Enabling machine learning model inference for time critical applications.

Karl Lessard
Expedia Group Technology

--

Photo by Jeremy Bishop on Unsplash

At Expedia Group™, we usually deploy machine learning models as distinct services that are invoked remotely by other services and applications to retrieve AI predictions. This setup is useful as it entirely decouples the machine learning lifecycle from the other software running in the company, granting more flexibility to machine learning science team for deploying and experimenting their models in production.

But in some cases, the additional cost of making a remote call to invoke a machine learning model is not something that can be afforded easily. For example, some of our systems may process millions of transactions per seconds. Introducing AI predictions in a workflow with such high throughput might require scaling up hundreds of new instances of a model server on expensive hardware. Another example are models that can process a large amount of data to provide better accuracy, but due to the latency induced by the data transfer over the network, client services are forced limit the size of their inference requests.

To accommodate such use cases, the Expedia Group ML Platform team has introduced a new and innovative feature that minimizes the latency of introducing machine learning models in a production workflow: Model Libraries.

What is a Model Library?

In addition of deploying machine learning models as a service, the ML Platform enables the distribution of these models as libraries that can be easily attached to any services running at Expedia Group. How does that work?

Like in most of large enterprises, JVM-based languages (Java, Kotlin, Scala…) are frequently used in the development of the applications running at Expedia Group. For that reason, the ML Platform team has developed and distributed internally a Java framework specialized in the execution of machine learning inference workflows in production systems.

After training a model, its resulting artifacts are bundled into a single archive that is published to our model repository. Developers can use the inference framework to download these archives from the repository, embed their model in the application process and run inference to get predictions with only a few lines of code. Soon, developers will also be able to attach a specific version of a model library to their application at compile time.

The key here is that model inference is happening locally, within the same process as the JVM in which the service is running on.

Application process invokes the inference framework for downloading models and running inference
Local inference overview

Model Inference on a JVM

How can Java run model inference? Most of machine learning runtimes already provide a Java API to invoke their native libraries for inference, the same ones used by their Python API or their model servers (e.g. TensorFlow Serving).

Not only the cost of calling a native library from a Java process is negligible but it also leverages all advantages of calling it from a JVM. For example, Java supports well multi-threading so it is easy to achieve model inference concurrency, which is harder using Python because of its GIL. Nothing prevents machine learning scientists to continue building and training their models in Python, as long as they are saved in a language-agnostic format, which is the case with many popular machine learning runtimes.

As illustrated in the previous diagram, our inference framework follows a modular architecture where computing capabilities are provided by registering automatically all modules that are present in the application classpath. Each machine learning runtime supported by the platform is distributed as a distinct module wrapping up its Java bindings and native library. Developers just need to include a dependency to the framework and select the desired runtime modules to start embedding machine learning models into their applications and produce predictions locally.

Technical Challenges

The following section illustrates a few challenges that were encountered when using model libraries to enable local inference, and how we solved them.

Large or computation-intensive models

By letting application developers selecting only the machine learning runtime modules they need, we prevent increasing the size of their application by including unnecessary native libraries that are often quite large. For example, the TensorFlow JAR is about ~100Mb.

But as we all know, the size of machine learning models can also vary between being relatively small to extremely large, often depending of their number of parameters. Not only a model library need to be downloaded and stored on disk, but its model should also fit into memory. In addition, running inference can take a lot of CPU (or GPU), depending on the complexity of the model. The host application therefore needs to allocate and scale enough of these resources to accommodate the models it is planning to run.

The application might also load more than one model, whether it is to produce different type of predictions or to conduct online evaluation of multiple variants of a model. Again, this needs to be considered beforehand and resources should be adjusted accordingly.

Models trained on runtimes not supporting Java

While only a few machine learning runtimes do not provide a Java API on top of their native libraries (e.g. scikit-learn, LightGBM, …), it might be desirable to build a library for a model trained with one of these. In such case, you should convert the model artifacts to another portable format that is compatible with Java.

ONNX is a great choice, since not only there is already a lot of utility libraries in Python that can take care of converting your model to it (e.g. onnxmltools), but it is also known to be one of the fastest machine learning runtime for model inference (we have observed some models running 6x faster just by converting them to ONNX).

An important note is that some discrepancies might be observed between the results returned by the ONNX model and its original version. Likely this is because most computations in ONNX are using 32-bit floating points while the original runtime was using a higher precision. These differences are normally unnoticeable on the overall accuracy of your predictions, still you might want to run a few additional validations before going to production.

Data pre/post-processing code written in Python

Frequently, input data needs to be processed before being fed to the model and the prediction might also need additional adjustments before being returned to the user. Machine learning scientists tend to prefer writing some arbitrary Python code rather than altering an already trained model. This highly couples the execution of the model to Python, which is inherently less performant than other languages like Java or C++.

One solution is to convert this Python code to Java and distribute it as a library. You can then invoke the Java code from Python during the training process by using a Python-to-Java bridge, like JPype, and import the same code in your host application for inference. One drawback of this method is that machine learning scientists now need to write and maintain Java code, a language they might not be familiar with. Also, even if Java routines can be very fast, they are lacking of convenient support for vectorized operations.

A better solution is to rewrite this logic with a machine learning runtime that can produce models for processing the data natively. For instance, NumPy code can be converted to a TensorFlow model using the TensorFlow NumPy API. Does that mean that you’ll need to execute up to three models to get a prediction (i.e. pre-processing, the trained model and the post-processing)? Not necessarily. You might be able to join all of these models into a single one depending on the runtime you chose. For instance, Keras allows you to combine multiple TensorFlow models into one. ONNX also exposes interesting composing capabilities. The resulting artifact can then be embedded and executed natively, delivering the best performance.

Models of various runtimes, like TensorFlow and LightGBM, get converted to ONNX and merged into a single one.
Conversion of multiple models into a single ONNX model

Conclusion

Embedding your models into applications that are closer to the customer can dramatically improve your users experience while maximizing the benefits of adding machine learning into your workflows. But it has its challenges as well that need to be considered carefully before opting for this approach. The recommendation is to use model libraries only when high performance and low latency are critical to unblock use cases that would not be possible or would be too expensive to do using standard model server deployments.

--

--

Karl Lessard
Expedia Group Technology

Currently working at Expedia Group as a Principal ML Engineer, and leading the TensorFlow SIG JVM, authors of TensorFlow Java