How Agoda Streamlines ML Development with Low-Code Model Pipelines

Agoda Engineering

Published in

Agoda Engineering & Design

8 min readSep 4, 2024

by Chen-Ting Huang

Introduction

At Agoda, data scientists and machine learning (ML) engineers work together to build ML pipelines that predict various metrics such as cancellation probability (CXL), margin, and customer value for business analysis. These batch-based pipelines often process large volumes of historical data and produce well-defined model features to ensure data explainability.

While scientists focus on analyzing data to improve model accuracy and collaborating with the business stakeholders to discover business opportunities, engineers concentrate on implementing and maintaining ML pipelines and the improvement of optimizing resource consumption and execution time.

The pace of ML model research and the release cycle is crucial for a rapidly moving company. This includes meeting requirements that frequently change the pipeline for research and experimental purposes. The ability to modify pipelines for experiments and accuracy improvements is essential. Additionally, having the right tools to support research work and the productionization of model pipelines can significantly accelerate model release iterations and improve the overall velocity of model enhancements.

The purpose of the approach we cover in this article is to enhance model release iterations by minimizing the effort required to develop ML and data pipelines, thus better supporting research activities and model pipeline launches.

Overview of Technology: Automating ML Workflows

ML pipelines play an important role in automating the ML model training and prediction workflows. To make the pipeline scalable, efficient, and fail-safe while minimizing resource costs, data scientists usually collaborate with ML engineers to build ML pipelines in the model release iteration.

The development cycle of model release iteration can be shown as above.

Research: Scientists often need to change and reuse the existing sub pipeline to explore ideas to improve the model.
Development: Engineers start to build the “ready for production” pipeline which adapts the requirements from scientists and ensures the SLA and the performance in terms of Execution Time and Resource usage.
Verification: Engineers deploy the new pipeline to prelive environment and make sure the model behaviors align with expectation from Scientists and Stakeholders.
Release: Engineers launch pipeline to production environment and enable the monitoring and fail-safe mechanisms

Both scientists and engineers participate in the development cycle; therefore, efficient collaboration between all code contributors is critical to speeding up the release iteration.

Since utilizing larger scale dataset for models usually has a positive impact on the accuracy, using Spark pipeline to process big data for model training and prediction is common practice in Agoda. However, these Spark pipelines often require extensive execution times, which can hinder the model development and the data preparation for research. Consequently, we have begun exploring approaches to reduce the development efforts associated with Spark pipelines.

Common Challenges in ML Pipeline Development

In the development cycle, the speed of development decreases over time as the pipeline codebase expands to support various data processing tasks required to meet business needs.

Tech debt increases when teams move fast to create pipelines to support many ML experiments, which also causes code contributors to exert more effort to maintain the codebase’s scalability.
Onboarding time of new joiners increases when the codebase grows. For example, newly joined data scientists might want to learn more about the data processing used to generate models before starting their research journey. They might spend significant time reading the codebase and communicating with the team to understand the details.
The required debugging time increases when more transformations are applied to the Spark pipeline. For example, investigating the root cause of data issued from a pipeline usually requires implementing an ad hoc integration test to inspect data change or modifying partial codes and resubmitting the pipeline to reproduce the required data. It often takes longer for pipelines that deal with large-scale data with complicated business logic.

Solutions and Best Practices

To address the slowdown in development speed caused by technical debt and increased onboarding times, we began creating a mechanism to decouple pipeline definitions from the existing codebase through the following steps:

Modularizing reusable ETL components.
Using configuration as pipeline definition to compose ETL components.
Converting the pipeline configuration to a runnable Spark pipeline.

This decoupling mechanism provides an approach to build/adjust pipelines with less code changes as developers can change the pipeline to adapt to the dynamic requirements of ML experiments by modifying the configuration instead of changing codes.

Furthermore, it reduces the effort on maintaining the codes of pipeline definition hence developers can focus on the scalable and reusable components developing to reduce the codebase size.

To support the mechanism, we built a spark library (Pipeline Converter) that converts a configuration (i.e. pipeline definition) to a Spark pipeline.

ETL Modules: The ETL modules follow the Reader, Transformer and Writer interfaces to modularize the reusable components from existing codebase. For the sake of integrating Spark MLib functionalities, we reuse the “org.apache.spark.ml.Transformer” interface to build Transformer modules.

Pipeline Definition: A HOCON configuration that supports sequentially composing module references along with their associated parameters to define the pipeline.

Pipeline converter will convert the above pipeline definition to the below spark script to execute the pipeline.

After adopting this approach, pipeline development coding tasks have decreased, and technical debt growth is better controlled. This is because pipeline creation and modification have become easier, reducing both coding efforts and maintenance workload. For example, the specific module can be removed easily by moving the related configs out of the pipeline configuration when the corresponding features are no longer required, which also helps developers focus on deprecating or improving the low reusability components to control tech debt growth.

Moreover, since the configuration provides a better overview of pipeline definition, new joiners can easily understand the pipeline flow and identify the module they plan to change. It also reduces their onboarding time and simplifies the knowledge transfer.

To further minimize the onboarding time for new joiners, we aim to address the additional time spent due to differences in preferred programming languages among developers. For instance, new joiners might be more comfortable with PySpark, which supports the Pandas API, rather than Scala. At the same time, they may wish to leverage existing modules from Scala common libraries to reduce development effort.

To address this, we developed the pipeline converter using PySpark, which supports composing and invoking both Python and Scala modules within the same pipeline. By leveraging PySpark’s ability to initialize and invoke Scala modules through the Java Gateway from the Spark session, the converter wraps Scala module invocations into Python functions. This allows for seamless execution of both Python and Scala modules as a unified PySpark pipeline.

With this wrapping, data scientists can build modules in their most preferred language, Python, while engineers can create modules using Scala to leverage the power of functional programming. Both groups can compose these modules within the same pipeline configuration, reducing duplicated coding efforts for similar functionalities and improving the efficiency of collaboration among code contributors.

To address the challenge of increased debugging time, we start by using a notebook that mirrors data from the production environment and includes related transformation codes. However, this approach requires manually copying the code and dependencies for a specific sub-pipeline (such as a subset of transformations) into the notebook. This step is crucial to recreate the staging data and initiate the bug-tracking process.

To streamline this process, we have extended the functionality of the Pipeline Converter to support converting the pipeline configuration into a pipeline notebook (Spark Notebook). This decoupling approach allows us to inspect data “transformation by transformation” since the pipeline is composed of a sequence of transformations in the configuration.

We make the converter map each module (i.e. readers, transformers and writers) configuration to a single notebook cell. This enables developers to run the notebook “cell by cell” allowing them to reproduce the data for any sub-pipeline for debugging and data issue inspection.

This pipeline notebook serves as a data playground, significantly enhancing the efficiency of data research work because data scientists can analyze data by reusing any sub pipeline easily. For example, scientists can change and run specific cells for ML experiments without re-submitting the entire pipeline which saves time on duplicated Spark initialization and staging data reproduction (by using cache).

Moreover, developers can also leverage the notebook as a debugging tool to easily reproduce the bug data to identify the root cause of data issues. This reduces the time required for production issue investigation and verification during development. After launching the auto-generated notebook features for the Merge Request (MRs) of our spark pipeline project, development speed has increased significantly. The “number of pipeline triggers per MR” decreased by 64% compared to the quarterly average, saving 15.7 hours of execution time for related ML pipelines per MR.

Future Directions

There are still challenges to address, such as the lack of syntax validation for pipeline definitions and the duplicated configurations between pipelines. The advantage of building in-house tools is that we can efficiently improve them by incorporating feedback from our daily use cases. More importantly, we can easily extend their functionalities by integrating other tools (e.g., monitoring and alert systems) from our current data ecosystem.

Conclusion

Accelerating Data/ML research iterations is crucial for enabling companies to make fast, data-driven decisions. In this article, we propose an approach to streamline ML pipeline development by reducing the time required for developing, debugging, and onboarding new team members for Spark pipelines. The Pipeline Converter supports us to build “Low code model pipelines” that reduces the development efforts by modularizing the reusable pipeline components and using configuration as proxy to compose the pipeline.