EXPEDIA GROUP TECHNOLOGY — DATA

Accelerate Machine Learning with the Optimal Deployment Pattern

Maximize business results with real-time, streaming and batch inferencing

Eric landry
Expedia Group Technology

--

Photo showing a small yacht and a large ship entering the Miraflores Locks, Panama Canal
Miraflores Locks, Panama Canal. Photo by the author.

Efficiently and rapidly integrating with end user applications maximizes business value more than finely tuning model training techniques, squeezing marginal gains from an algorithm, or choice of data processing technique. Our three years of integrating machine learning models with end user applications taught us techniques and best practices for integrating machine learning to maximize business results. This knowledge evolved from both failures and successes. Three essential integration and deployment patterns have emerged from these lessons.

Decorative separator

The primary model deployment patterns are real-time, streaming and batch inferencing. The model deployment approach is largely dependent on the use case. Are the inferencing results required real-time, or can they be calculated offline? What are the operational requirements such as latency and throughput, and what is the cardinality of the key space? These are the basic questions that need answers prior to decisions about model deployment, integrations and architectures.

Machine learning model deployments in practice

A containerized technique is favored in part because machine learning (ML) is often integrated into systems with a different runtime. The containerized approach also solves the scaling/system requirements mismatch, as the ML model container can be scaled independently from the integrating system.

Machine learning models are very different from traditional back-end systems. The code for ML models is usually written in Python, Scala, or R, while most of the backend systems at Expedia Group™ use various JVM languages, Node.js and others. This necessitates a model “handoff” from one runtime to the next and makes apples-to-apples testing extremely difficult.

Compute and memory requirements are also significantly higher for ML than traditional backend systems. This means that adding machine learning directly into a backend system will likely change the requirements for the whole system, even if it’s only needed for the ML model.

Monitoring machine learning systems is more challenging as well. All the usual monitoring for uptime and latency apply, but it’s also important to monitor predictions as a means of detecting bias, train-test skew, and model drift.

Finally, the skill sets between data scientists and backend developers are also very different. Managing software in production requires a completely different skill set than building a powerful model. With the model integrated directly into the backend, the data scientists need to know a lot about that system, and the backend developers must understand the computational and memory requirements of the model.

Decorative separator

All of these differences slow down iteration and experimentation. Redeploying the entire backend stack for a critical system is slow and risky, even with appropriate application infrastructure. Translating the model from one runtime to another takes time, and testing is very tedious. The single most important thing a data scientist can do with a model to increase its economic impact is to test it online as much as possible. Decoupling model deployments from the system to enable independent iterations, validation and testing of the ML model is key to accelerating learning and bringing the full business impact rapidly.

Deployment patterns

Let’s look at the high-level deployment patterns that have evolved from experimentation and many model integrations. The three patterns mentioned previously were:

  • real-time synchronous inferencing,
  • offline batch inferencing,
  • and offline stream inferencing.

We have developed a platform to support these deployment patterns. The platform was built in stages, usually driven by data scientist and software engineering integration requirements. The result is a series of loosely coupled components that were developed based on real-world use cases. The modularity, feature support and self-service concepts have contributed to rapid development, integration and deployment of ML solutions.

Real-time synchronous

Diagram showing the caller interacting directly with a model service

This is a straightforward request/response pattern. The caller sends the payload containing the model features, then the service invokes the model in-memory and performs the inference, sending the result back to the caller. This pattern is very simple and follows a RESTful way of calling the prediction logic of models. The platform developed for deployment is entirely self-service and requires only a small amount of code on the data scientist’s part for instrumentation and packaging. With a micro-service architecture it can support integration with any Python, Java, or Scala machine learning framework.

An example use case for real-time synchronous inferencing is a natural language processing intent model in support of a virtual assistant. Real-time is necessary because it is impossible to precompute all of the language for potential inquiry questions.

Risks and challenges

The obvious disadvantage to this pattern is clear when the model requires a significant amount of computational power, or when the required latency is low. Since the model is being invoked per request, latency can be quite high depending on the model. Model latency is primarily determined by complexity of the model and/or features required. In cases where real-time is required, there might need to be tradeoffs made between accuracy of the model and latency requirements. Model accuracy is measured by ML measures such as AUC, F-score or MSE. There should always be testing to determine that an improved business outcome is achieved by a more accurate model. A more accurate model by these measures may not always achieve a better business outcome. We will also need to monitor the model not only for uptime, latency and failures but also model drift. For various reasons, the model results may drift from an acceptable range or historic distribution, so the platform supports alerting and monitoring in these cases.

Offline batch inferencing

Diagram showing a batch job that operates on a data store populated by clients.

With some use cases, the model scores are not required to be computed at the time of the request and so can be precomputed. If the inference scores can be precomputed in an offline environment, they can be loaded into a key-value store as long as there’s a suitable key. Examples of common keys at Expedia Group are property ID, traveler ID and destination ID. Then callers will only need to know the key to retrieve the scores. This transforms what was a complicated inference per-prediction into a lookup, reducing the latency as described for real-time by at least an order of magnitude. Offline batch inferencing is achieved typically in an Airflow pipeline or other scheduled compute pipeline that inserts the inferencing result into an online key-value store.

An example of this type of use case would be monthly income predictions for vacation rental properties. In this example predictions only need to be calculated once a month with data from the prior twelve months. This also has a positive impact on cloud cost savings due to not having a micro-service always online and as would be necessary for real-time.

Risks and challenges

This approach has drawbacks too. If the key space has a very high cardinality then pre-computation and storage become impractical, thus necessitating a move to the aforementioned synchronous inference. Another drawback is the frequency of updates to the key-value store. In many cases a simple daily or hourly update will suffice; however, some integrations require the precomputed values to be updated continuously as new data arrives in the system. In this instance, the age of the predictions can be a determining factor. Updating traveler “profiles” based on their searches as they search on the site is an example of such a use case. Since the inferencing computation is offline, care needs to be taken for validation of source data and output for sudden or unexpected changes. Or the source data might become unavailable. Alerts need to be active for both input availability and quality, and output values.

Stream inferencing

A stream feeds a processor which feeds a model service. The processor also interacts with a data store and thence clients.

By combining an asynchronous stream consumer with the key-value store, we are able to listen to a stream of values (via Apache Kafka) and perform the predictions as the inputs arrive, rather than when the predictions are needed. The assumption, of course, is that the predictions won’t be needed immediately, but sometime after the features become available on the stream. We sometimes call this pattern stream2cache.

When the model input (including key for lookup) arrives on the stream, a stream processor service issues a request to the model to calculate the score. Then the stream processor inserts the score into the key-value store, where going forward it will be available at low latency without having to recompute the predictions. This results in continual updates to the key-value store at high frequency. Should the number of input events become large for a short period of time, it is possible to scale the model service to handle the traffic, or if a delay is acceptable, wait until the volume of stream events returns to normal levels.

Risks and challenges

As described in batch inferencing, the challenge regarding key space applies here as well. As in batch, frequency of the inferencing schedule does not apply due to the continuous nature of the stream. However, if the application request and the inferencing are both triggered by the same event, the inference result might not arrive in the key-value store prior to the caller request.

Summary

The combination of straightforward self-service deployment, standardized infrastructure toolkits, flexible ML framework support, and separation from the integrating system has significantly improved the iteration speed (and therefore effectiveness) of data scientists at Expedia Group. Since the development and adoption of the ML platform, we have been able to accelerate the number of use cases powered by ML algorithms with ongoing improvements and updates to those algorithms.

Decorative separator

Special thanks for contributions from Tim Renner and Robert F. Dickerson.

Learn more about technology at Expedia Group

--

--

Eric landry
Expedia Group Technology

Director, Machine Learning Engineering at Expedia Group