Deploying and Versioning Data Pipelines at Scale
Tom Goldenberg, Data Engineer, Musa Bilal, Machine Learning Engineer, QuantumBlack
QuantumBlack’s open-source release of Kedro in June has generated excitement in the data science and data engineering communities. Whether it is through the GitHub repository (currently +1,600 stars), on HackerNews, or via social media, we get a lot of questions about the library and what it does and does not do. One trending question from our growing community of Kedro users has been, “How does Kedro integrate with experiment tracking tools?”
We encourage the extension of core Kedro functionality with experiment tracking tools. While we leverage an internal advanced experiment tracking tool called PerformanceAI, which you can read about here; we also work with open-source technology. This article is focused on explaining a community query using an open-source tool, MLflow, a model tracking tool open-sourced by Databricks. We will suggest a possible workflow approach with MLflow.
Kedro? MLflow? What are these tools?
We describe Kedro as: “A development workflow framework that implements best practices for data pipelines with an eye towards productionising machine learning models.” It acts as a starter template for data and ML pipeline projects written in Python with functionality to construct environment-agnostic pipelines with data abstraction, built-in coding standards for developing robust code, pipeline visualisation and much more. Kedro is actively maintained by QuantumBlack and has been used on more than 100 QuantumBlack and McKinsey projects across many industries.
MLflow is an open-source library created by Databricks. It is described as: “An open-source platform to manage the ML lifecycle.” It has three main APIs: Tracking, Model, and Project.
Do They Play Nice Together?
On the surface, it might appear that Kedro and MLflow have redundant and / or conflicting functionalities.
For example:
- Both have commands to run projects (
kedro run
vs.mlflow run
) - Both have model or artifact versioning
- Both have deployment approaches (
kedro package
vs.mlflow deploy
)
To answer these doubts, we explored how Kedro and MLflow would work together in an AI architecture within Databricks. We found that while there is some overlapping in functionality, they function in a complementary way.
A Combined Experience
Kedro and MLflow can be used together because they solve orthogonal problems. If we use the analogy of a car assembly factory, Kedro provides the assembly line architecture and structure, where MLflow is the tracking system you can use in your factory to record metrics and visualise them in order to fine-tune your assembly line segments.
We were able to leverage the MLflow APIs (Tracking, Project, Model) within a Kedro project and found that both tools are enhanced:
- Kedro provides a seamless developer experience through the data abstraction and code organisation not offered in MLflow.
- MLflow provides model tracking (and visualisation of metrics) beyond that offered by Kedro.
Below is a table that compares the different features of the tools:
Getting Started
Let’s go over how you would incorporate MLflow in a Kedro project.
Setup
You will start off this process by creating a new Kedro project according to the the Hello World
example here. If these instructions are followed then you should call your project Mlflow Test
.
Next, install MLflow with pip install mlflow
. And create two files, MLproject
and conda.yaml
, they will need to be saved to the root of your Kedro project directory.
MLproject
name: My Project
conda_env: conda.yaml
entry_points:
main:
command: "kedro run"
conda.yaml
name: mlprojectschannels:
- defaultsdependencies:
- python=3.6
- pip:
- kedro==0.15.1
- databricks-connect>=5.4
- mlflow==1.0.0
- scikit-learn
Adding MLflow code
In src/mlflow-test/nodes/example.py
you will find the Python functions that make up this Kedro pipeline. You will need to add MLflow code to the train_model()
and report_accuracy()
nodes. The rest of the Kedro pipeline remains the same and new MLflow code additions are highlighted in bold:
src/mlflow-test/nodes/example.py
import logging
from typing import Any, Dictimport numpy as np
import pandas as pd
import os
import mlflow
from mlflow import sklearn
from datetime import datetimedef train_model(
train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
) -> np.ndarray:
num_iter = parameters["example_num_train_iter"]
lr = parameters["example_learning_rate"]
X = train_x.values
Y = train_y.values # Add bias to the features
bias = np.ones((X.shape[0], 1))
X = np.concatenate((bias, X), axis=1) mlflow.log_artifact(local_path=os.path.join("data", "01_raw", "iris.csv")) weights = []
# Train one model for each class in Y
for k in range(Y.shape[1]):
# Initialise weights
theta = np.zeros(X.shape[1])
y = Y[:, k]
for _ in range(num_iter):
z = np.dot(X, theta)
h = _sigmoid(z)
gradient = np.dot(X.T, (h - y)) / y.size
theta -= lr * gradient
# Save the weights for each model
weights.append(theta)# Return a joint multi-class model with weights for all classes
model = np.vstack(weights).transpose()
sklearn.log_model(sk_model=model, artifact_path="model")
return modeldef report_accuracy(predictions: np.ndarray, test_y: pd.DataFrame) -> None: # Get true class index
target = np.argmax(test_y.values, axis=1)
# Calculate accuracy of predictions
accuracy = np.sum(predictions == target) / target.shape[0]
# Log the accuracy of the model
log = logging.getLogger(__name__)
log.info("Model accuracy on test set: {0:.2f}%".format(accuracy * 100))
mlflow.log_metric("accuracy", accuracy)
mlflow.log_param("time of prediction", str(datetime.now()))
mlflow.set_tag("Population", 2019)
Initiate MLflow environment variables
Create and run a bash script that initiates MLflow environment variables.
Note: Replace the bold values below with your own values
config.sh
#!/usr/bin/env bash# result of ‘which conda’ ending in “anaconda3”export MLFLOW_CONDA_HOME=”/Users/user/your/path/to/anaconda3"# result of ‘pwd’ in project root plus “mlruns”export MLFLOW_TRACKING_URI=”/Users/user/your/project/mlruns”
export MLFLOW_EXPERIMENT_NAME=”getting-started”
Run MLflow
Run mlflow run
and check out the results in your browser — localhost:5000
Deployment
As for deployment, there are several options. One that we found works particularly well is deploying on Databricks. This is because Databricks is so closely integrated with MLflow.
To do this, follow the steps outlined in the Kedro documentaion for deploying to Databricks. When you invoke the Kedro project main()
function, do the following:
import mlflowfrom my_project_name.run import main
with mlflow.start_run(experiment_id="MY_EXPERIMENT_ID"):
main()
And that’s it! You’ll find that the “Runs” tab on Databricks contains your metrics over several runs. Or, you can create a MLflow experiment on Databricks and use the experiment ID to consolidate runs across multiple notebooks.
Conclusion
This was a very simple workflow for how Kedro and MLflow could be used together. Following this example we believe this workflow could be further optimised and it has been exciting to see different community approaches to this integration. This includes a workflow, documented here, that creates an MLflowDataSet
class for logging artifacts, mlflow.yaml
for parameterising all MLflow features through a configuration file and a new kedro
command which makes it easy to share runs with co-workers.
We hope that this article has helped demystify how Kedro functionality could be extended to include experiment tracking. If you’re interested in learning how we think about experiment tracking and deployed model performance monitoring then we also encourage you to check out PerformanceAI.
Sources
- Kedro: A New Tool For Data Science
- Introducing Kedro: The open source library for production-ready Machine Learning code
- Introducing MLflow: An Open Source Machine Learning Platform
- Tracking and sustaining the performance of predictive models
- Use MLflow for better versioning and collaboration
Special Thanks
This article could not have been possible without contributions from Bruce Philp, Jessica Fan, Prashant Rathi and Yetunde Dada.