EXPEDIA GROUP TECHNOLOGY — PLATFORM

Unified Machine Learning Platform at Expedia Group

Unifying the machine learning journey across Expedia Group

Hisham Mohamed
Expedia Group Technology

--

Photo by Ryan Christodoulou on Unsplash
Photo by Ryan Christodoulou on Unsplash

AI and Machine Learning (ML) are collectively core capabilities, enabling several products and services at Expedia Group™️ (EG). These products allow us to personalize traveler and partner experiences, provide competitive pricing, detect fraud, predict market changes, and much more!

To enable Expedia’s ML community, we are unifying our ML systems to provide a single robust ML platform. This would allow our ML community to quickly train, deploy and optimize ML models, leading to faster times to market of our intelligence capabilities.

Why did we start building a unified ML Platform?

In 2020, the pandemic hammered the travel industry, which had a big impact on EG. We used this time to reinvent ourselves in all directions of business, operations, and technology stack. ML was one of the key areas we decided to rebuild. That year, we started to look closely at the ML technology stack at EG - we found the following:

  • There were 9 legacy ML systems that were fragmented and isolated;
  • None of the 9 legacy systems supported the end-to-end model development life cycle. Some of them only focused on live inferencing, while others focused on batch inferencing etc.;
  • Given the number of legacy ML systems, there was no standardized way to deploy ML models in production across the teams, preventing us from reproducing models or quickly transferring ownership between teams;
  • There was no accountability when it came to the number of models, their purpose, and what versions were in production;

Based on the above information, we started taking incremental steps to shape and structure our ML tooling across EG;

Machine learning playbook

Before building a new tech stack, we first defined a deployment pattern using the technologies we already had in EG, which we called the Machine Learning Playbook. The ML Playbook helped us to:

  • Define and reinforce engineering and ML best practices;
  • Improve code reproducibility;
  • Reduce failure risks;
  • Build the vision, mission and, strategy to build our ML Platform;

In the playbook, we covered the three following areas: Available Tools, End-to-End ML flow and Missing Capabilities.

Available tools

We took a closer look at the tools available in EG to define and cover the end-to-end ML journey. We built a simple table as shown below that covers the end-to-end journey; under each area, we listed the tools that were available at EG. This helped us to form a holistic overview of all the capabilities in the company and to define common patterns from the already available tools.

Mapping the ML journey to available tools
Mapping the ML journey to available tools

End-to-end ML flow

As we mentioned before, the tools are fragmented and isolated, so even if we had tools that covered all of the above-mentioned steps, we still would not have a common flow. Each team followed its own pattern. As an example, we had internally built a Feature Store; however, most of the teams were not using it to store features but instead to only serve features for live inferencing models. Therefore, it was essential to define the flow the teams should follow.

Below is an example of how a flow would look in our ML playbook (technology will be explained in future articles).

Common flow to build ML models
Common flow to build ML models

Missing capabilities

Going through the tools and the flow allowed us to identify the capabilities we were missing in EG. For example, in early 2020, we found out that there was no central storage for all our models. Models and their metadata were scattered across different places. In some situations, teams would rebuild the same models instead of using existing ones. We also found out that there was no common/one tool to monitor model performance (Drift, etc.) in production.

How are we constructing our ML Platform?

In building our ML platform, we follow a three-phased approach:

Phases of ML Platform construction (source: author)
Phases of ML Platform construction (source: author)

1. Building

In 2022, we made great progress and delivered the first version of our platform that covered the core capabilities for the ML journey. Here is how we did it:

Connecting the dots and providing the missing capabilities: as I mentioned before, we found out that we were missing some tools that are critical in the ML lifecycle, and that there were some repetitive ones. After identifying the internal tools that we would use, we started to build connectivity between them to facilitate the end-to-end journey for the users. In parallel, for the missing tools that did not exist at EG prior, we followed the “Buy vs. Build” strategy, which allowed us to use our resources in the best way. “Buy vs. Build” seems easy in theory, but it is hard in practice. We always approach that from 3 different angles:

  • Features/Functionality: understanding the needs of the EG ML community and providing the needed functionality through a “unified access pattern”. The unified access pattern is key to ensure easy maintenance and a unified experience across the company.
  • Alignment with EG tech stack: it was imperative that we run our ML within Expedia’s well-established technology stack using our already existing cloud infra, CI/CD tools, monitoring, etc. We wanted to abide by the principles of reusability as much as possible to reduce cost and effort while focusing on components that were unique to ML. Running ML in a separate tech stack from the rest of the company without having a unified access pattern will not bring value to EG in the long term.
  • Cost: buy vs. build! There are costs associated in both cases, either through resource costs or licensing costs. We had to make the right balance to bring the right value to EG.

Advertising and educating people: we set up training and onboarding sessions for teams, so they become familiar with the tools we have. This is a challenging task, and it has a learning curve that should stabilize over time. We also built an ML community at the company where people can share their projects, the technology used and the lessons learned.

Decommissioning: we set decommissioning timelines for the legacy systems that we do not need and asked the team to use the new pattern/tools instead. We managed so far to decommission 6 out of 9 legacy systems with a target to decommission the last three by Q3 2023.

2. Optimizing

After delivering the first version, we started thinking about how to optimize the existing platform with more capabilities that will help the ML community in EG to deliver fast value to the market. Some of the features we are looking for are Low/no code AI solutions, low latency embedded models, AI workbench, and much more.

3. Enhancing

After finishing the optimization phase, we will be looking at how to externalize our AI-powered services and insights as well as our platform capabilities to travelers and partners outside of EG. We will also follow the principle set for our open API technology platform.

Expedia Group ML Platform

The Diagram below shows, at a high level, the main components we are working on for our ML Platform. These components are either built internally, open source, or purchased. Four concepts to highlight from this diagram:

High-level view on the EG ML Platform
High-level view on the EG ML Platform

Exploration/development

  • As shown in the diagram, there is a clear separation between the development environment which happens on Notebook and test/production deployment. Test/production deployment goes through a common code template and CI/CD pipeline, ensuring code reproducibility and easy maintenance.
  • From the notebook environment, the users can read production data/features/models to be able to build real models in the exploratory/development environment. However, writing to production is not possible from the exploratory environment to ensure high-quality data/features/models in production.
  • AI workbench is also available to bring together all the ML services under a single unified UI, enabling EG ML community to explore the ML models deployed in test and production.

Test/prod deployment

  • The deployment is defined through a common CI/CD pipeline so that teams do not need to rebuild their CI/CD pipeline.

End-to-end integration

  • We provide a set of SDKs and tools to allow the ML community to link all the available services, facilitate the development life cycle and avoid duplicated efforts.
  • The SDKs facilitate the integration with other services that we do not own and are available across Expedia such as the A/B testing framework, system monitoring dashboards etc.

Monitoring & impact

  • We built an integration between the ML monitoring tool and the deployment pipelines, ensuring fresh/good models in production.
  • We are also working on a way to measure the impact of the model in production and how much business value they bring.

We will deep dive into this diagram and show all the technology/tools used and why we chose them in the next articles.

Summary

AI and ML are key capabilities in EG that help us deliver the best experience to our travelers and partners. Unifying our ML tech stack and tooling allows our ML community to deliver value to the market and easily and quickly bring new capabilities to our travelers and partners. To achieve that, we took the following steps:

  1. Understood the status of the tools we have
  2. Connected the tools we need and simplified the end-to-end journey
  3. Chose between buy/build to ensure we are utilising our resources in the best way
  4. Educated people about the vision, strategy and mission, and got them onboarded
  5. Terminated the legacy systems

In the next set of articles, we will cover the technologies and tools we are offering in our ML platform.

Learn more about technology at Expedia Group

--

--