Deploying and Versioning Data Pipelines at Scale

Published in

QuantumBlack, AI by McKinsey

6 min readNov 18, 2019

Tom Goldenberg, Data Engineer, Musa Bilal, Machine Learning Engineer, QuantumBlack

QuantumBlack’s open-source release of Kedro in June has generated excitement in the data science and data engineering communities. Whether it is through the GitHub repository (currently +1,600 stars), on HackerNews, or via social media, we get a lot of questions about the library and what it does and does not do. One trending question from our growing community of Kedro users has been, “How does Kedro integrate with experiment tracking tools?”

We encourage the extension of core Kedro functionality with experiment tracking tools. While we leverage an internal advanced experiment tracking tool called PerformanceAI, which you can read about here; we also work with open-source technology. This article is focused on explaining a community query using an open-source tool, MLflow, a model tracking tool open-sourced by Databricks. We will suggest a possible workflow approach with MLflow.

Kedro? MLflow? What are these tools?

We describe Kedro as: “A development workflow framework that implements best practices for data pipelines with an eye towards productionising machine learning models.” It acts as a starter template for data and ML pipeline projects written in Python with functionality to construct environment-agnostic pipelines with data abstraction, built-in coding standards for developing robust code, pipeline visualisation and much more. Kedro is actively maintained by QuantumBlack and has been used on more than 100 QuantumBlack and McKinsey projects across many industries.

MLflow is an open-source library created by Databricks. It is described as: “An open-source platform to manage the ML lifecycle.” It has three main APIs: Tracking, Model, and Project.

Do They Play Nice Together?

On the surface, it might appear that Kedro and MLflow have redundant and / or conflicting functionalities.

For example:

Both have commands to run projects (kedro run vs. mlflow run)
Both have model or artifact versioning
Both have deployment approaches (kedro package vs. mlflow deploy)

To answer these doubts, we explored how Kedro and MLflow would work together in an AI architecture within Databricks. We found that while there is some overlapping in functionality, they function in a complementary way.

A Combined Experience

Kedro and MLflow can be used together because they solve orthogonal problems. If we use the analogy of a car assembly factory, Kedro provides the assembly line architecture and structure, where MLflow is the tracking system you can use in your factory to record metrics and visualise them in order to fine-tune your assembly line segments.

We were able to leverage the MLflow APIs (Tracking, Project, Model) within a Kedro project and found that both tools are enhanced:

Kedro provides a seamless developer experience through the data abstraction and code organisation not offered in MLflow.
MLflow provides model tracking (and visualisation of metrics) beyond that offered by Kedro.

Below is a table that compares the different features of the tools:

Getting Started

Let’s go over how you would incorporate MLflow in a Kedro project.

Setup

You will start off this process by creating a new Kedro project according to the the Hello World example here. If these instructions are followed then you should call your project Mlflow Test.

Next, install MLflow with pip install mlflow. And create two files, MLproject and conda.yaml, they will need to be saved to the root of your Kedro project directory.

MLproject

name: My Project

conda_env: conda.yaml

entry_points:
  main:
    command: "kedro run"

conda.yaml

name: mlprojectschannels:
  - defaultsdependencies:
  - python=3.6
  - pip:
      - kedro==0.15.1
      - databricks-connect>=5.4
      - mlflow==1.0.0
      - scikit-learn

Adding MLflow code

In src/mlflow-test/nodes/example.py you will find the Python functions that make up this Kedro pipeline. You will need to add MLflow code to the train_model() and report_accuracy() nodes. The rest of the Kedro pipeline remains the same and new MLflow code additions are highlighted in bold:

src/mlflow-test/nodes/example.py

import logging
from typing import Any, Dictimport numpy as np
import pandas as pd
import os
import mlflow
from mlflow import sklearn
from datetime import datetimedef train_model(
    train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
) -> np.ndarray:
    num_iter = parameters["example_num_train_iter"]
    lr = parameters["example_learning_rate"]
    X = train_x.values
    Y = train_y.values    # Add bias to the features
    bias = np.ones((X.shape[0], 1))
    X = np.concatenate((bias, X), axis=1)    mlflow.log_artifact(local_path=os.path.join("data", "01_raw", "iris.csv"))    weights = []
    # Train one model for each class in Y
    for k in range(Y.shape[1]):
        # Initialise weights
        theta = np.zeros(X.shape[1])
        y = Y[:, k]
        for _ in range(num_iter):
            z = np.dot(X, theta)
            h = _sigmoid(z)
            gradient = np.dot(X.T, (h - y)) / y.size
            theta -= lr * gradient
     # Save the weights for each model
     weights.append(theta)# Return a joint multi-class model with weights for all classes
    model = np.vstack(weights).transpose()
    sklearn.log_model(sk_model=model, artifact_path="model")
    return modeldef report_accuracy(predictions: np.ndarray, test_y: pd.DataFrame) -> None:    # Get true class index
    target = np.argmax(test_y.values, axis=1)
    # Calculate accuracy of predictions
    accuracy = np.sum(predictions == target) / target.shape[0]
    # Log the accuracy of the model
    log = logging.getLogger(__name__)
    log.info("Model accuracy on test set: {0:.2f}%".format(accuracy * 100))
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_param("time of prediction", str(datetime.now()))
    mlflow.set_tag("Population", 2019)

Initiate MLflow environment variables

Create and run a bash script that initiates MLflow environment variables.

Note: Replace the bold values below with your own values

config.sh

#!/usr/bin/env bash# result of ‘which conda’ ending in “anaconda3”export MLFLOW_CONDA_HOME=”/Users/user/your/path/to/anaconda3"# result of ‘pwd’ in project root plus “mlruns”export MLFLOW_TRACKING_URI=”/Users/user/your/project/mlruns”
export MLFLOW_EXPERIMENT_NAME=”getting-started”

Run MLflow

Run mlflow run and check out the results in your browser — localhost:5000

Deployment

As for deployment, there are several options. One that we found works particularly well is deploying on Databricks. This is because Databricks is so closely integrated with MLflow.

To do this, follow the steps outlined in the Kedro documentaion for deploying to Databricks. When you invoke the Kedro project main() function, do the following:

import mlflowfrom my_project_name.run import main

with mlflow.start_run(experiment_id="MY_EXPERIMENT_ID"):
  main()

And that’s it! You’ll find that the “Runs” tab on Databricks contains your metrics over several runs. Or, you can create a MLflow experiment on Databricks and use the experiment ID to consolidate runs across multiple notebooks.

Conclusion

This was a very simple workflow for how Kedro and MLflow could be used together. Following this example we believe this workflow could be further optimised and it has been exciting to see different community approaches to this integration. This includes a workflow, documented here, that creates an MLflowDataSet class for logging artifacts, mlflow.yaml for parameterising all MLflow features through a configuration file and a new kedro command which makes it easy to share runs with co-workers.

We hope that this article has helped demystify how Kedro functionality could be extended to include experiment tracking. If you’re interested in learning how we think about experiment tracking and deployed model performance monitoring then we also encourage you to check out PerformanceAI.

Sources

Special Thanks

This article could not have been possible without contributions from Bruce Philp, Jessica Fan, Prashant Rathi and Yetunde Dada.