Introducing Kedro Hooks

Simplifying the process of extending the framework

--

Lim Hoang, Software Engineer, Kiyohito Kunii, Software Engineer, Jo Stichbury, Technical Writer and Editor, QuantumBlack

Kedro is an open source development workflow framework that implements software engineering best-practice for data pipelines. It’s built upon a collective knowledge gathered by QuantumBlack, whose teams routinely deliver real-world machine learning applications as part of McKinsey.

In this article, we will assume that you are familiar with basic Kedro functionality so we won’t explain further what it is (but you can read our introductory article to find out more. We also provide links to other articles and podcasts in our documentation).

We are instead going to describe a new feature, Hooks, that we’ve introduced into our latest release. We will describe the design thinking process we used and show you how to use Hooks with a simple example. Finally, we will point you towards resources to find out more about Hooks.

Photo by Clint Adair on Unsplash

The motivation for Hooks

Over the past few months, a number of semi-related problems have surfaced surrounding the architecture and capability of Kedro as a framework.

One common theme was the fact that users have to understand the low-level implementation details of Kedro, and its execution context, to extend Kedro’s default behaviour. For example, PerformanceAI developed internally at QuantumBlack (to add a machine learning model tracking capability similar to MLflow to a Kedro pipeline) had to overwrite the entire runmethod of KedroContext, which greatly complicated its implementation. The difficulty in extending Kedro was also preventing it from being integrated with other tools in the data science ecosystem.

User-centric design thinking

The design process of every Kedro feature is user-centric and Hooks were not an exception. In the ideation phase of the feature, we first collected all major use cases where users have to extend Kedro. The use cases we collected ranged from pipeline visualisation and deployment to data validation.

For each use case, we visualised the Kedro execution timeline and how a user’s extensions would interact with it, and we noted down their pain points.

User journey mapping for Kedro execution timeline

Having the visualisation of the use cases and their pain points in one place gave us a holistic view of the problem and helped us notice a common thread: all extensions we examined need to interact with the lifecycle of different components in the Kedro execution timeline.

For example, for data visualisation, the extension needs to access the input dataset of a node before the node runs. On the other hand, for machine learning model tracking, the extension needs to access the model output of a model-training node after the node runs.

From that insight, we were able to map out all major lifecycle points in the execution timeline that Kedro users currently need to interact with (shown as red points in the figure below):

Lifecycle points in the execution timeline
  1. After the data catalog is created
  2. Before a pipeline run
  3. Before a node run
  4. After a node run
  5. After a pipeline run

We came to the conclusion that, at each of these points, we needed to provide a mechanism for user to hook their extension into and provide a custom behaviour. Thus, Hooks were born.

What are Hooks?

Hooks are a mechanism to allow a user to extend Kedro by injecting additional behaviour at certain lifecycle points in Kedro’s main execution. The following lifecycle points, known as Hook Specifications, are provided in kedro.framework.hooks:

  • after_catalog_created
  • before_pipeline_run
  • before_node_run
  • after_node_run
  • after_pipeline_run

You might have noticed that the name of these Hooks map directly to the numbered use cases that we discovered during our design process. Beside these “happy path” Hooks, we also introduce a couple of Hooks for the “unhappy path”, namely:

  • on_node_error
  • on_pipeline_error

This is the minimum set of Hooks that we identified in the ideation phase to address all existing use cases. More Hooks may be introduced in the future as more use cases emerge.

Below are some examples of the extensions the user can add to Kedro execution using Hooks:

  • Adding a transformer after the data catalog is loaded.
  • Adding data validation to the inputs, before a node runs, and to the outputs, after a node has run. This makes it possible to integrate with other tools like Great Expectations.
  • Adding machine learning metrics tracking, e.g. using MLflow, throughout a pipeline run.
  • Adding pipeline monitoring with StatsD and Grafana.

How to use Hooks

The general process you should follow to add Hooks to your project is:

  1. Identify the Hook Specification(s) we need to use.
  2. Provide Hook implementations for those Hook Specifications.
  3. Register Hook Implementations in ProjectContext.
Overview of Hooks registration process

Example: Using Hooks to integrate Kedro with MLflow

The following section will illustrate this process by walking through an example of using Hooks to integrate Kedro with MLflow, an open-source tool to add model and experimentation tracking to your Kedro pipeline. (Previous versions of Kedro required hard coding the MLflow integration logic inside their nodes, as previously described in this article).

We will now show to use Hooks to achieve the same integration with a more flexible interface and better reusability of tracking code. In this example, we will:

  • Log the parameters after the data splitting node runs.
  • Log the model after the model training node runs.
  • Log the model’s metrics after the model evaluating node runs.

To follow along with this tutorial, we assume that you have a Kedro project in place with Kedro >= 0.16.1, as well as MLflow installed.

Step 1: Identify what Hook Specifications we need to use

To identify what Hook Specifications are needed, we need to think the lifecycle points in the Kedro execution timeline that we need to interact with.

In this case:

  • We will need to start an MLflow run before the Kedro pipeline runs by implementing the before_pipeline_runHook Specification.
  • We want to add tracking logic after a model training node runs, so we need to implement the after_node_runHook Specification.
  • After the Kedro pipeline runs, we also need to end the MLflow run by implementing the after_pipeline_run.

Step 2: Provide Hook implementations

Having identified the necessary specifications, we need to implement them. In the Kedro project we create a Python package called hooks in the same directory as the nodes and pipelines and then create a module called hooks/model_tracking_hooks.py with the following content:

Notice that Hook Implementations are created by using the @hook_impl decorator, and related Hook Implementations should be grouped in the same class.

Step 3: Register Hook implementations in ProjectContext

After defining Hook Implementations with model-tracking behaviour, the next step is to register them in the ProjectContext in run.py as follows:

Step 4: Run the pipeline

Now we are ready to run the pipeline that has been extended with the MLflow machine learning tracking capability. Run the pipeline with kedro run and open the MLflow UI to see the tracking results. This is an example of a model tracking run.

The parameters are those we use to run Kedro and the Artifact is the model produced by that run

If you want more inspirations for your MLflow integration, you can check out the MLflow plugin made by Kedro community user Galileo-Galilei.

Find out more!

Further examples for using Hooks to implement data validation and pipeline monitoring can be found on our Github repo for Kedro examples.

We are very excited about the release of this feature in Kedro 0.16.0. We believe that it will help our users extend Kedro execution and accelerate the creation of additional, useful, integrations for the ecosystem.

If you have any question or feedback, please do let us know by raising a Github issue.

--

--

QuantumBlack, AI by McKinsey
QuantumBlack, AI by McKinsey

An advanced analytics firm operating at the intersection of strategy, technology and design.