How a shared ML infrastructure can simplify an entire data science organization

Published in

Yotpo Engineering

8 min readSep 29, 2022

An ML project’s lifecycle starts with extensive research over data — requiring a lot of attention, skill, and experience from data scientists. Often, these data scientists will find themselves writing their own ad-hoc infrastructure in order to support the research and development of their models, even though they are not necessarily software engineers who are skilled in developing production-grade systems.

In this example, the transition of research results into production becomes a highly non-trivial task that will either result in reimplementing large segments of code or blindly integrating these ad-hoc infrastructures into production. Both scenarios can be avoided by developing shared ML infrastructures that are used both in research and production. Yet, it requires hands-on experience with engineering systems and methodologies.

At Yotpo, we see ML engineers as not only complementary to the data scientists in terms of the ML lifecycle, but also as a key position involved at every stage of an end-to-end model pipeline. As an ML engineer, I see great value in enabling data scientists to focus entirely on research and model development, by giving them a strong infrastructure, designed to their needs.

In this post, I’ll share our process of building Yotpo’s ML infrastructure and its main capabilities, which support our data scientists in research and simplifies the transition from research to production.

The ML lifecycle — ML Engineers are involved in every stage

Yotpo’s ML infrastructure

When we designed our ML infrastructure, we aimed to separate the logic and the infrastructure code levels. We envisioned a package that is easily installed and used, without any need to look under the hood or dive into the source code. We wanted to make it as “plug-and-play” as possible and simply let our researchers research.

Also, we noticed that the engineering time and resources spent on taking a trained model to production are higher than expected, every single time. It usually required reimplementing the research code to fit our production systems, which is very error-prone, and caused a lot of headaches of simply understanding someone else’s code. Our goal was to simplify this process and shorten the time from successful research to a working ML service in production.

Our first step was to define a set of interfaces — one for each logical component of an ML pipeline such as data cleaning, feature extraction, model training, and other transformations. What we quickly realized was that all of the logical components share the same interface fit() and transform(). Even basic transformations such as data normalization are usually set based on the training set (fit) and then applied to other datasets (transform). Moreover, we noticed that all of the logical components required the same set of utilities. For example, serialization and versioning. That made us think of the “everything is a model” approach.

“Everything is a model”

We consider a model an abstraction that follows the “fit -> transform -> predict” pattern of classic machine learning models. We use this abstraction for a process, ML-related or other, that uses data manipulation to output results. This pattern is repeated through a lot of our ETL pipelines, and fitting them to a single flow that follows similar steps simplifies development and collaboration in our data science team.

Let’s go over a few examples in the ML pipeline, other than obvious ML models, where we apply this concept:

Data cleaning/feature extraction
Say there is a set of rules to clean data in the data preparation phase. In this case, we can consider the “training” phase as data aggregations and statistics that we calculate over the raw data, and the “prediction” phase as applying these rules using the trained parameters.
Model monitoring
Here, we take the training and prediction datasets, and “train” the monitors by calculating metrics, such as data drift and model performance, and learn a set of thresholds for each metric. The “prediction” phase is the alerting layer, where we take new prediction data and the trained thresholds, and alert upon unusual behavior.

To accomplish the principle of “Everything is a model,” we defined a single interface for the “model” entity that is being used across all of our projects. The model interface includes functionality for the main stages of the ML lifecycle and is applicable for both research and production.

Following is a code snippet of our model interface, with its basic functionality.

Model serving

So far, we’ve defined an interface that helps us apply a logical flow of a model or a process. But what should we do with the results of each stage, and how should we expose our models as a service?

Here, I’ll dive into the prediction stage of our ML pipeline, and show how we deal with its complexity through our interface.

Model predictions are consumed in multiple ways. From a single prediction to streaming predictions, we had to design our interface to support every possible serving scenario:

Real-time:
A real-time prediction is an ad-hoc call for a prediction of a single record or a small batch of records that requires fast results, roughly in tens to hundreds of milliseconds. As part of our model interface, we expose a predict() function, commonly as a REST API or a Python SDK, for low latency predictions. For example, triggering a recommendation system upon request.
Batch:
Most data science Python packages work with Pandas dataframes, which are stored locally and don’t scale. This makes it very difficult for data scientists to predict a large batch of data. For example, a large test set for model evaluation, or a scheduled process in production (e.g. via Airflow) that runs predictions for hundreds of millions of records and doesn’t require real-time results. So, we expose a function, called batch_transform(), that enables predictions in scale.
Here’s how it’s done:
We use Spark cache to broadcast the trained model to all executors. Next, we apply PySpark’s Pandas UDF, which takes chunks of the data for prediction by partition for processing in the executors. Then, inside the executors, the predict() function, written by data scientists, is executed locally. Finally, each executor returns its data chunk with predictions, which are combined with the predictions of all other executors, and returned as a Spark dataframe.
Streaming:
In some cases, we need to trigger predictions upon data events. One very common use case for this behavior is data enrichments, where we want to add learned data points to event-based data. For example, a customer made a new order or a store added a new product are events that happen constantly.
Luckily, we support these scenarios by enabling the deployment of ML streaming applications in both Kafka-to-Kafka and Batch-to-Kafka modes. We use Spark Structured Streaming, which splits the data into micro-batches, and our model infrastructure (i.e. the batch prediction mode) to enable fast and online predictions while keeping minimum lag between the consumer and producer.
We show this architecture in the following diagram:

So far, so good: we have a single model interface that supports model serving in various ways. Although this is good for a single model, what happens when one model is simply not enough?

Models by segments

At Yotpo, we work with many eCommerce stores, each one with its own distinct characteristics and purchase patterns. Therefore, in most cases, a single ML model for all stores cannot focus on the unique features of a single store, which can result in bad model performance. For example, on the one hand, identifying profanity in reviews doesn’t require a specific store’s data, it is trained with text, whatever the source. On the other hand, churn prediction models require the unique data of each store, because they have different data patterns and customer behavior.

We wanted to allow our researchers to seamlessly apply a single model’s logic, in both training and prediction phases, separately on multiple stores, by using it in the same way they use every model — a single fit() and predict() functions. For that, we created a “Model of models” object, that initializes a single model for each store, and encapsulates all models into one model entity that follows our model interface. It gets a single ML model object and a Spark dataframe containing data from all stores and enables training and predictions per store in a way that is invisible to the data scientists.

Today, this object is used in most of our projects and supports handling tens to hundreds of thousands of models per project.

The diagram below shows the processes of training and prediction in this type of model. The part in the middle, in both phases, remains untouched by the data scientists.

Wrapping up

To complete the picture, we created a managed Python package, called Yotml, that is installed in every ML project in our domain. We use a poetry package dependency manager, to enable locking package versions, making sure that all developers and machines work with the right dependency versions.

Also, we expose an API for MLFlow, for experimental management and model versioning, and an API for a Feature Store database, for real-time predictions and offline training. With Yotml, our data scientists can integrate with these tools easily and intuitively, there is no need to drown in confusing documentation or waste a lot of time understanding how to work with them.

Yotml has helped our data science team see the bigger picture. We are constantly thinking of whether a new feature or use case can be generalized and used as infrastructure. This intuition made us share knowledge and experience across projects, by trying to continuously identify common grounds.

Impact

As you can see, there is a huge gain when it comes to an ML shared infrastructure. Using Yotml, our data scientists don’t need to worry about infrastructure code, scaling in training and prediction, and model version control.

Our shared ML infrastructure has helped us:

Speed up research-to-production processes, from months to one week.
Improve the communication and collaboration between our data scientists and ML engineers.
Support multiple ML projects at the same time, in both research and production.

Moreover, now it’s very simple to take the experimental code and make it production-ready since both modes share common interfaces and code blocks.