Expedia Group Technology — Platform

Enabling Core Machine Learning Platform Capabilities

The journey of building the core components of the Machine Learning Platform

Anna Kelecsényi
Expedia Group Technology

--

At dawn, a traveler stands at the top of a rock face, looking out over the valley below.
Photo by David Marcu on Unsplash

In our previous article, the journey to creating a Unified Machine Learning (ML) Platform identified a number of missing or duplicated capabilities. This article will describe how we enabled these capabilities.

We will discuss how we built a centralized model repository to become the single source of truth in Expedia Group™️, with the aim of enabling discovery and reuse of models within the business. We will also look at different deployment patterns and how we leveraged an Open Source Control Plane project to bootstrap live inference deployments.

Why did we start to build the core components?

After observing the existing ML technology stack at Expedia Group, we’ve identified two processes that are common and essential in every ML use case, but had no centralized, unified solution across the business:

· Model Accountability
· Model Deployment

There was no solution for keeping track of models, their versions and metadata in a centralized way. Often, teams would have Excel sheets with the model details, which were hard to maintain, became outdated fast and did not provide a way of discovering or sharing models across Expedia Group.

There were multiple solutions for deploying models, all of which were working differently. There was no way of knowing how many machine learning model deployments exist in production and what model versions they used or who owned these deployments. It often became a challenge to reproduce a model deployment, since no common pattern was enforced.

These challenges gave us a good indicator that we needed to implement a Model Repository Service and a Model Deployment Service which provide a common interface for Machine Learning Scientists and Engineers to solve the above-mentioned issues in a generic way, suitable for various use cases. The two services are the main core components of the ML Platform at Expedia Group.

High-level architecture of the core ML Platform components
High-level architecture of the core ML Platform components

Model Repository Service

The Model Repository Service (MRS) is designed to be a single source of truth for models developed in Expedia Group, providing customers the ability to:

· Register models and model versions,
· Store model artifacts and metadata,
· Discover and reuse existing models.

Model and model version registration

Models evolve over time. To keep track of these changes, models are versioned in our Model Repository Service. Each model may have multiple model versions and each model version has a complete copy of all the artifacts that are needed to deploy the model. Models and model versions are immutable by design, only their metadata can be updated after registration. These two characteristics — complete copy of all artifacts and model immutability — ensure that model deployments, using models registered via MRS, are reproducible.

Model versions can either be exploratory or automated versions. Exploratory model versions are intended for model development and allow quick iteration and exploration. Automated model versions are created via an automated CI/CD pipeline and can be deployed all the way to production.

Model artifacts storage

MRS offers an API that can be used to upload model artifacts to an AWS S3 bucket for permanent storage. These artifacts then can be retrieved via the MRS API or can be downloaded directly from the S3 bucket for model deployment purposes.

Besides storing model artifacts, additional metadata that relates to the model can be stored in MRS.

Model discovery and compliance

MRS can be used to discover models and model versions. During the model and model version registration, some mandatory data must be provided, which is persisted into an underlying database and later can be retrieved.

To further improve model discovery, accountability and transparency as well as enabling audit compliance, a catalog with metadata describing the model can be created for every model. This entity is called model cards. Model cards contain many details about the specific model, such as

· A detailed description regarding the model’s purpose;
· An example request, that other Machine Learning Scientists, Engineers can use to call the model;
· An example response, that matches the example request and is the output of the model;
· GitHub repository link, that stores the source code of the model;
· Whether the model uses sensitive data;

Model cards are mandated for models that are deployed in production, to make sure the Expedia Group ML Platform clients always know what models are available for consumption.

Model Deployment Service

The Model Deployment Service (MDS) provides an API for deploying models as web services for online inferencing purposes, as well as managing and monitoring the existing deployments.

The ML Platform users are not required to have a deep understanding of how the models are deployed or how the low-level details work. MDS aims to abstract the deployment details from Machine Learning Scientists and Engineers, allowing them to provide the minimum configuration required to deploy models. The service standardizes the deployment mechanism and serving runtimes.

There are features that should be enabled for every model deployment. MDS has centralized control over cross-cutting concerns and can apply improvements globally for every model deployment.

Deployment event flow

When an Expedia Group ML Platform user wants to deploy a model, the user first needs to register the model with the Model Repository Service. MDS retrieves the models from MRS, making the registration process a mandatory precondition for deploying models.

Once the model is registered, the platform user provides a configuration that will be used to trigger the deployment. This configuration is validated and processed by MDS before the model is deployed. After the validation passes, the configuration is stored in an underlying database for future discovery of existing deployments and MDS initiates the model deployment creation process.

Under the hood, MDS utilizes an open-source model inference platform, to deploy the different model flavors supported by the Expedia Group ML Platform. In addition, Large Language Models (LLM) and other ML models are supported as custom model deployments.

To determine the status of the model deployment, MDS offers an endpoint that can be used to check the status of the deployment.

Model deployment event flow
Model deployment event flow

How do we utilize the core components when it comes to traceability?

As it was pointed out in the beginning of the article, an important aspect of the core components’ design was the ability to be able to keep track of the existing models and model versions in a centralized way. We’ve enabled custom metrics for both MRS and MDS. These metrics are used to display service adoption details for the platform’s developers and users. By aggregating the metrics gathered from both MRS and MDS, we can answer many questions, such as:

· How many models and model versions are registered?
· How many models are deployed in production?
· Which models are deployed in production and when were they deployed?
· How long does it take to get a model to production after it was registered?

In this article, we’ve shared the motivation behind the creation of the core ML Platform components.

We understood the key capabilities of the Model Repository Service, such as
· Model and model version registration,
· Model artifacts storage, and
· Model discovery and compliance.

We looked at how models can be deployed with the Model Deployment Service.

Finally, we mentioned how the core components are used when it comes to traceability.

Learn about life at Expedia Group

--

--