A Guide to MLOps with Airflow and MLflow

Published in

TheFork Engineering Blog

11 min readNov 6, 2023

Introduction

As more and more companies are nowadays considering Machine Learning models to solve their business problems, the need to implement and operate the overall workflow becomes crucial and essential.

I. Literature review about MLOps practice

The global widespread adoption of Machine Learning has created a need for a systematic approach towards building efficient ML systems.

What is MLOps ?

MLOps stands for Machine Learning Operations. It is built on the DevOps core fundamentals in order to efficiently write, deploy and run enterprise applications.

By adopting a scalable approach, MLOps enhances the collaboration between the operational, engineering and data science teams. This will result in a quick and responsible path from proof of concept to production Machine Learning systems while enabling reproducibility and above all a continuous improvement process of the company’s data assets.

MLOps hierarchy of needs

You can think of Machine Learning systems as Maslow’s hierarchy of needs. Since an ML system is also a software system, it requires DevOps and data engineering best practices in order to work in an efficient and reliable way. Delivering the true potential of Machine Learning to any organization requires DevOps, Data automation and platform automation basic fundamental rules being put in place. Let’s dive into each step of the ML hierarchy and make sure we have an extensive knowledge of implementing them.

DevOps: it is the methodology of automating and continuously integrating the processes between software and IT teams.
Data automation: it entails leveraging algorithms, scripts, and tools to automatically extract, transform, and load (ETL) data workflows.
Platform automation: once the Data flow is automated, we can use high-level platforms to build Machine Learning solutions in a repeatable and scalable way.

MLOps key pillars

There are a multitude of MLOps tools that allow to efficiently track ML experiments, orchestrate workflows and pipelines, version data and ensure a structured model deployment, serving and monitoring in production.

Successful ML deployments generally take advantage of few key MLOps principles, which are built on the following pillars :

In this article, our focus will be on experiment tracking and workflow orchestration tools.

We will address the model deployment tools as well as best practices of model monitoring in production in upcoming articles.

What is an orchestrator ?

Like most pipeline based systems, MLOps systems need an orchestrator. According to ZenML:

The orchestrator is an essential component in any MLOps stack as it is responsible for running your machine learning pipelines. To do so, the orchestrator provides an environment which is set up to execute the steps of your pipeline. It also makes sure that the steps of your pipeline only get executed once all their inputs (which are outputs of previous steps of your pipeline) are available.

In this sense, our focus will be on Airflow as a common tool used for orchestrating data workflows.

Airflow

Airflow is a platform for building and running workflows, represented as a DAG (a Directed Acyclic Graph), and contains individual pieces of work called Tasks, arranged with dependencies and data flows taken into account to say how they should be executed.

Since every Data Science project needs its own configuration, essentially metadata and model versions, we can wrap up everything inside a single DAG, build and schedule it in the enterprise ETL.

Here is a basic example of a DAG :

It defines multiple tasks and dictates in which order they have to run and which tasks depend in what others. The DAG is only concerned with how to execute the tasks, the order to run them in, in which frequency, how many times to retry them and if they have timeouts, and so on.

You will find further explanations of how DAGs can be orchestrated in Airflow documentation.

MLflow

MLflow is a an open source platform which manages end-to-end Machine Learning lifecycle including experimentation, reproducibility, logging, hyper-parameters tracking and model deployment in production.

By including MLflow in ML projects lifecycle, we can ensure having a transparent and standardized approach of training, tuning and deploying ML models.

It is composed of 4 main components :

You can refer to MLflow documentation for more details.

Kedro

Kedro is a development workflow that applies software engineering best practices to Data Science projects code in a way that makes it modular, reproducible and maintainable.

Kedro is a pipeline-based system and each pipeline is made up of data source and nodes with inputs and outputs.

Check below an example of a Kedro pipeline of one of TheFork Data Science projects :

Kedro pipeline

Kedro has a very well explained and complete documentation as well, be sure to check it out.

II. MLOps orchestration system at TheFork

The lifecycle of a Data Science project at TheFork can be summarized in the following figure.

A 360* view of TheFork Data Science stack

It relies on multiple steps from conception to industrialization and operationalization, also supported by other inherent components like the processing engine, data layer and visualization.

Conception : During this phase, we want to quickly experiment and test potential ML approaches in order to build Proofs of Concept (PoCs) related to the product using Sagemaker notebooks and MLflow for tracking.
Industrialization : Once the PoC is validated, we industrialize the solution, following engineering best practices. For this purpose, we use the kedro framework to structure our code into production ready packages. We also register and version our code in Github before managing the integration and the deployment of the project in Jenkins, which facilitates the collaboration between all team members.
Operationalization : At this step, we want to operate and run our project in production in order to integrate it in a global data ecosystem.

Several components also support this lifecycle :

The Data layer: it allows to access to the data (structured data in Snowflake, or unstructured data in S3).
The processing engine: it represents the computation capabilities to support the execution of our code at scale (AWS Batch for Python jobs and AWS EMR for Pyspark jobs).
Visualization: all the tooling that allows visualization : Streamlit for interactive apps, Tableau for production dashboards and Snowflake for ad-hoc (exploratory) dashboards.

III. Use case of how to use Mlflow & Airflow to orchestrate a project workflow

The ability of Airflow to automate and orchestrate workflows, along with MLflow’s core concepts allow to the Data Science team an easy, standardized and shared process to iterate through ML experiments.

In Machine Learning, we distinguish between training and inference. While the training phase requires the use of training and validation data to develop an ML model, the inference phase allows to apply this model to a dataset and producing an output or “prediction.”

Below, you can find an example of how we could orchestrate an inference job workflow at TheFork. This orchestration is already implemented for one of our Data Science projects where we set up some tools as part of a continuous POC.

In this example, the kedro inference pipeline is run in AWS Batch which is triggered by the daily Airflow DAG. The job fetches new source data in Snowflake, checks its quality using Great Expectations and loads the registered model from MLflow to compute new predictions.

As the kedro inference pipeline is composed of several nodes, the parquet dataset inputs and outputs of each node are versioned in AWS S3 buckets. Further output metrics are pushed into Cloudwatch in order to monitor the model in production and trigger alerts when a new data point is detected as an anomaly.

For the output of the Data Science job to be leveraged by the business teams, the parquet files put in S3 should afterwards be ingested. The collection of the output of the DS job can only start when the Data Science job is successfully completed. This is done by orchestrating an extra step in the Airflow DAG which takes the AWS Batch as an input dependency.

All of the ingesting steps are managed by our Data Platform team and generally encompass the collection, loading, validation and freshness check steps.

Use case of project workflow orchestration with MLflow and Airflow

Let’s deep dive into Airflow and MLflow configuration for this use case.

Airflow configuration

As a matter of fact, all Data Science jobs at TheFork used to be launched using AWS Step Functions that are triggered by a cron job. Gradually, we have been integrating them in the company’s ETL using Airflow. Therefore, we will not rely anymore on arbitrary hours to trigger our jobs, instead they will be executed once the input data will be available in the ETL.

Below, we can visualize some tasks related to one of our Data Science projects at TheFork : for instance which tasks have failed and on which days it did happen.

Example of a transformation DAG in Airflow

Set up of a batch job inference pipeline in Airflow

The orchestration of Aws batch jobs in Airflow can be presented as follows : First, we trigger an Aws Batch Operator in Airflow (1) which is basically a custom Kubernetes operator that runs a Python code to call Aws API and to submit batch jobs. We then get the job configuration from a YAML file (2) where the name, queue and kedro command related to the Data Science job are defined. Finally, the operator triggers the job in Aws (3) and waits for the job status to be succeeded (4).

Below, you can find an example of a kedro job configuration in the YAML file in order to be triggered in Airflow

ml_net_revenue_estimation.cancellation_rate_forecasted__aws_batch:
  name: net-revenue-estimation-d-regression
  definition: job-def-net-revenue-estimation-{env}-0-0-0
  queue: arn:aws:batch:eu-west-1:<aws_account_id>:job-queue/dataplatform-{env}-ds-queue
  command: kedro run --pipeline regression_inference --env {env_kedro} --params bucket:bucket-{env},database:{ENV}_MACHINE_LEARNING,n_days:1,date:{today_date}

where :

<env_kedro>: the kedro environment passed to the command.

The name, definition and queue parameters must be specified in order to submit a new AWS Batch job.

These YAML instructions are custom to TheFork data platform, and only shown here for illustration purposes to give the reader an idea of the type of instructions we use internally.

The following graph shows an example of an AWS Batch job orchestrated in TheFork daily transformation DAG.

Example of steps orchestration in a daily Airflow DAG

Next, the AWS Batch transformation step is also scheduled at the level of the daily workflows transformations YAML configuration file as follows :

- aws_batch_transformation: ml_net_revenue_estimation.cancellation_rate_forecasted
  inputs:
    - snowflake_transformed: <table_input>

where <table_input> represents the input dependency for the AWS Batch job. Meaning, the kedro job won’t be launched unless the table input is completely fresh and ready to be used in the Data Platform.

MLflow configuration in Kedro jobs

In this section, we will detail the steps in order to use MLflow models in a Kedro project. You can follow these steps in order to set up a kedro environment for your project.

1. Add the packages dependency to the requirements file
In your kedro project, update the requirements.txt file in order to add these two dependencies before building the whole requirements. Please refer to the kedro-mlflow plugin documentation to check the latest versions.

kedro-mlflow
mlflow

2. Initialize your project locally

In order to initialize your project and add the plugin-specific configuration file, you should first run the init command which updates the template of your kedro project.

kedro mlflow init

you should see the following message :

‘conf/local/mlflow.yml’ successfully updated.

You can then start setting up the newly created file mlflow.yml in your project folder as follows :

Configure the tracking server : The mlflow.yml have this key set to mlruns by default. This will create a mlruns folder locally at the root of your kedro project by default and enable you to use the plugin without any setup of a mlflow tracking server. In order to save the experiments in your entreprise mlflow server, you can provide it to the mlflow_tracking_uri variable.
Deactivate tracking under conditions : You may want to avoid tracking some runs (for instance while debugging to avoid polluting your mlflow database). You can specify the name of the pipeline(s) you want to turn off.
Configure mlflow experiment : Mlflow enables you to create “experiments” to organize your work. The different experiments will be visible on the left panel of the mlflow user interface. You can create an experiment through the mlflow.yml file with the experiment key.
Configure the run : When you launch a new kedro run, kedro-mlflow instantiates an underlying mlflow run through the hooks.

3. Launch MLflow server

To launch the ui (if you didn’t specify the mlflow server, by default it will open a localhost server).

kedro mlflow ui

You can then navigate to the local or enterprise mlflow server and have parameters and models displayed for your experiment.

4. Track experiments in MLflow

First you need to set the tracking uri and experiment name :

mlflow.set_tracking_uri('<tracking_uri>')

mlflow.set_experiment('<experiment_name>')

To track parameters, metrics and other objects you can simply use this code that will create a new run. The tracked parameters will be recorded under the new created run_id :

# Log model's tuning parameters, metrics and training artifacts
with mlflow.start_run(nested=True):
    # Log parameters
    mlflow.log_params(params)
    # Log metrics
    mlflow.log_metrics(metrics)
    # Log model 
    mlflow.sklearn.log_model(gbm, artifact_path='model')
    # Log training data in csv format (artifact)
    mlflow.log_artifact(local_path = 'train_df.csv')
    # Log encoder in pickle format (artifact)
    mlflow.log_artifact(local_path = 'encoder.pickle')
    # Log image as an artifact
    mlflow.log_artifact("confusion-matrix.png", "confusion_matrix")

5. Register and deploy MLflow models

Model registry in MLflow

6. Load registered models in your kedro project

For this purpose, we will use the MlflowModelLoggerDataSet introduced by kedro-mflow that makes it possible to load from and save to the mlflow artifact store. It uses optional run_id argument to load and save from a given run_id which must exist in the mlflow server you are logging to.

pipeline_inference_model:
  type: kedro_mlflow.io.models.MlflowModelLoggerDataSet
  flavor: mlflow.pyfunc
  pyfunc_workflow: python_model
  run_id: ${run_id}

Summary

The use of MLflow and Airflow has greatly improved the management of our Data Science projects. Being able to track experiments in an easy way and to ensure a complete model governance in MLflow helps us to take our projects to the next level. Not to mention that the recent integration with Airflow has resulted in a more effective orchestration of our jobs by taking into account the input dependencies and the freshness statuses.

This being said, we have also started implementing I/O quality checks and metrics monitoring in production as part of our MLOps framework that will be discussed in future articles.

Keep in mind to always start with small wins ;) and please feel free to let me know if you have any comments or questions on this topic or other MLOps related topics.

References

https://www.datacamp.com/blog/top-mlops-tools

https://www.zenml.io/home

Oreilly, Practical MLOps, Operationalizing Machine Learning Model