To production and beyond: How Metaflow helps us turn proof-of-concepts into products

Published in

ProSiebenSat.1 Tech Blog

7 min readJun 7, 2023

by Samuel Patterson, Niclas Hönig, Atef Attia, Manuel Heller, Zuzanna Czechowska and Chris Lykourinou

As a team of machine learning (ML) engineers and data scientists working in a media company, we build and maintain a variety of innovative solutions, ranging from content generation, through recommendation systems, to program optimisation.

Frequently, we are tasked with testing out new ideas for use cases across the company and are responsible for turning our successful proofs-of-concepts into value-adding products for the business.

One particular product we develop, analyses and optimises TV ad campaigns. It enables our customers to make data-driven decisions about their advertising strategies and optimise their TV campaigns for maximum impact.

Recently, we needed to evaluate and tune a new ML model within this product, then deploy it in production with the assurance of quality and without spending a huge effort in development or operations.

This is where Metaflow is coming in handy for us. Metaflow is a Python library that simplifies the development, deployment, and scaling of ML workflows.

In this article, we will show you how we use Metaflow to evaluate and productionise a new ML model for batch prediction of website traffic in our campaign optimisation product. We will also cover some of the key challenges we faced and how we overcame them using Metaflow.

How we are using Metaflow

Our work on the new model was conducted in two main phases, a model evaluation and selection phase, and a model productionisation phase.

These phases can be seen in the high-level MLOps process we follow below. Here you can also see where we are using Metaflow for our three main pipelines: model evaluation, model training, and inference.

Figure 1: The new model comprises two main phases: the model evaluation and selection, and the model productionisation.

The model evaluation pipeline is where we tackled the main complexity of developing our new model. Addressing this complexity early in the project within a product like Metaflow meant that the later work to develop the training and inference pipelines for production became quite straightforward, but more on that later.

The complexity in model evaluation comes from the requirement to compare many different configurations that the model can have, from feature selection, imputation, and algorithm choice, to hyperparameters, as well as cross-validation for different training and test splits and across different sets of independent customer data.

The aim of comparing all of these is to find the optimal configuration for the model to use in production. It also helps us understand trade-offs in these choices, like the minimum amount of training data needed for a new customer, and the diminishing improvement of model quality as we use more training data.

To do this, we used Metaflow’s for-each-loop functionality to run and evaluate multiple experiments in parallel, which significantly sped up the development cycle time sped. The diagram below shows what the resulting pipeline looks like.

Figure 2: Diagram of the evaluation pipeline

In the evaluation configuration input to the pipeline, we can specify multiple settings for each option. For example, we could specify a training duration of one week, two weeks, four weeks, or six weeks.

Then in the cross-product step of the Metaflow pipeline, we combine all the settings for the model training options with each other to come up with a collection of training experiments to execute. For example, if we wanted to test two different models, with four different training durations, across ten train-test time splits, then the cross-product step would result in 2x4x10=80 different experiments to run for each step. Using Metaflow to do this meant that the potential issues with scaling out experiments like this became easily resolved.

Finally, since not all experiments are directly comparable with one another in a meaningful way, we then implemented a dedicated scoring and aggregation step to join all of the experiments to calculate the relevant metrics we are interested in. This step is also responsible for publishing the aggregated results in a format that we can then use for our manual review of the results to select the best options to use in production.

Challenges we overcame using Metaflow

Moving from a proof-of-concept to a model in production

Productionising a proof-of-concept can be a difficult transition for any product. With ML products in particular, there are many things to consider, such as the scalability of the pipelines, the reliability of the model, as well as the infrastructure required to run it.

Metaflow makes this process easier by providing a flexible framework for building and deploying ML pipelines that can be tested locally, adapted easily, and re-deployed without manual effort and with minimal differences between environments.

By implementing our model evaluation logic in a Python package and using that across the evaluation pipeline, we were able to re-use the core functionality when developing the training and inference pipelines, reducing the effort to get them implemented and avoiding potential mistakes that can come from say copying code from a Jupyter Notebook into a productive codebase.

Developing locally and on AWS without overhead

Working with model and pipeline logic both locally and on AWS can be time-consuming and resource-intensive. The convenience of working locally can be useful to begin with but scaling to run multiple experiments at the same time is quite difficult without the ability to hand execution over to a cloud environment.

With Metaflow, this process is simplified by providing a standard interface for running workflows on both local machines and AWS infrastructure. This allows us to develop new logic locally, test that it works both locally and in AWS, and then finally deploy it as a productive Step Function in AWS.

Standardization for repeatable experiments

To ensure that our evaluation experiments are repeatable and that the resulting model configuration will perform the same in the productive pipelines, we standardise the way that a model is trained and tested.

As mentioned above, we made use of the Metaflow for each loop functionality to run the same experiment logic on multiple model configurations and train-test datasets. Additionally, we leveraged the scikit-learn pipeline framework, which allows us to easily plug-and-play alternative models with different features, data normalisation, and imputation steps.

After we were happy with the model evaluation, the result of having a standardised logic for training and testing many experiments was that the same logic became easily transferred into the productive training and inference pipelines.

Regular evaluation of models on new data

To ensure that our ML models remain accurate over time, we need to regularly evaluate them on new data.

Metaflow makes this process easy by providing a way to schedule workflows to run at regular intervals. Instead of having our evaluation in a Jupyter Notebook, or in a different codebase, or using a different infrastructure environment, our evaluation pipeline runs using the exact same underlying unit-tested codebase as our productive solution, and via Metaflow, it runs in an equivalent deployment environment as well.

This allows us to evaluate our models on new data regularly to ensure that they are still accurate and identify potential opportunities to improve our configuration while being confident that any required change to the model will be just as easy to transfer to production as the first time.

Next steps

Monitoring training and inference

Now that the model is running stably in production, we are setting up monitoring to track the quality of the training and inference results to ensure they remain as expected. As with the other activities from moving to production, since we have packaged code from the evaluation used for scoring experiments, this is proving to be a straightforward process of re-using the logic, with the assurance that the quality metrics are equivalent to what we used in the original evaluation.

Evaluating new options

Alongside regularly re-running the evaluation pipeline on new data to ensure our model configuration choices are still correct, we also plan on using the evaluation pipeline to explore new ideas for ML algorithms or features, and easily compare them against our current model as a baseline. If we are able to show that a new idea for a model set-up outperforms the current production model, then all we should need to do is switch over the production model configuration to use the improved set-up.

Conclusion

Metaflow is a powerful tool for developing, running, and scaling ML workflows.

By using Metaflow in conjunction with a well-organised codebase for housing the underlying logic, we were able to overcome some common challenges with developing a new model in a short timeframe while still considering many different options for how to train it.

This solution then helped us move our accepted model configuration into our production training and inference pipelines with the assurance of model quality and without major effort or surprises.

Along the way, using Metaflow also helped speed up the development, by allowing us to develop locally and on AWS without overhead, standardising our experiments for repeatable and comparable results.

Finally, the evaluation pipeline we used in the original model development can be used to regularly evaluate our models on new data, with a clear pathway to productionising new model configurations if we find an improvement.

From our experiences of how it helped us get from idea to production, we certainly recommend Metaflow to other teams and are excited to continue benefitting from it in our product.