Introducing Kedro Hooks
Simplifying the process of extending the framework
Lim Hoang, Software Engineer, Kiyohito Kunii, Software Engineer, Jo Stichbury, Technical Writer and Editor, QuantumBlack
Kedro is an open source development workflow framework that implements software engineering best-practice for data pipelines. It’s built upon a collective knowledge gathered by QuantumBlack, whose teams routinely deliver real-world machine learning applications as part of McKinsey.
In this article, we will assume that you are familiar with basic Kedro functionality so we won’t explain further what it is (but you can read our introductory article to find out more. We also provide links to other articles and podcasts in our documentation).
We are instead going to describe a new feature, Hooks, that we’ve introduced into our latest release. We will describe the design thinking process we used and show you how to use Hooks with a simple example. Finally, we will point you towards resources to find out more about Hooks.
The motivation for Hooks
Over the past few months, a number of semi-related problems have surfaced surrounding the architecture and capability of Kedro as a framework.
One common theme was the fact that users have to understand the low-level implementation details of Kedro, and its execution context, to extend Kedro’s default behaviour. For example, PerformanceAI developed internally at QuantumBlack (to add a machine learning model tracking capability similar to MLflow to a Kedro pipeline) had to overwrite the entire run
method of KedroContext
, which greatly complicated its implementation. The difficulty in extending Kedro was also preventing it from being integrated with other tools in the data science ecosystem.
User-centric design thinking
The design process of every Kedro feature is user-centric and Hooks were not an exception. In the ideation phase of the feature, we first collected all major use cases where users have to extend Kedro. The use cases we collected ranged from pipeline visualisation and deployment to data validation.
For each use case, we visualised the Kedro execution timeline and how a user’s extensions would interact with it, and we noted down their pain points.
Having the visualisation of the use cases and their pain points in one place gave us a holistic view of the problem and helped us notice a common thread: all extensions we examined need to interact with the lifecycle of different components in the Kedro execution timeline.
For example, for data visualisation, the extension needs to access the input dataset of a node before the node runs. On the other hand, for machine learning model tracking, the extension needs to access the model output of a model-training node after the node runs.
From that insight, we were able to map out all major lifecycle points in the execution timeline that Kedro users currently need to interact with (shown as red points in the figure below):
- After the data catalog is created
- Before a pipeline run
- Before a node run
- After a node run
- After a pipeline run
We came to the conclusion that, at each of these points, we needed to provide a mechanism for user to hook their extension into and provide a custom behaviour. Thus, Hooks were born.
What are Hooks?
Hooks are a mechanism to allow a user to extend Kedro by injecting additional behaviour at certain lifecycle points in Kedro’s main execution. The following lifecycle points, known as Hook Specifications, are provided in kedro.framework.hooks
:
after_catalog_created
before_pipeline_run
before_node_run
after_node_run
after_pipeline_run
You might have noticed that the name of these Hooks map directly to the numbered use cases that we discovered during our design process. Beside these “happy path” Hooks, we also introduce a couple of Hooks for the “unhappy path”, namely:
on_node_error
on_pipeline_error
This is the minimum set of Hooks that we identified in the ideation phase to address all existing use cases. More Hooks may be introduced in the future as more use cases emerge.
Below are some examples of the extensions the user can add to Kedro execution using Hooks:
- Adding a transformer after the data catalog is loaded.
- Adding data validation to the inputs, before a node runs, and to the outputs, after a node has run. This makes it possible to integrate with other tools like Great Expectations.
- Adding machine learning metrics tracking, e.g. using MLflow, throughout a pipeline run.
- Adding pipeline monitoring with StatsD and Grafana.
How to use Hooks
The general process you should follow to add Hooks to your project is:
- Identify the Hook Specification(s) we need to use.
- Provide Hook implementations for those Hook Specifications.
- Register Hook Implementations in
ProjectContext
.
Example: Using Hooks to integrate Kedro with MLflow
The following section will illustrate this process by walking through an example of using Hooks to integrate Kedro with MLflow, an open-source tool to add model and experimentation tracking to your Kedro pipeline. (Previous versions of Kedro required hard coding the MLflow integration logic inside their nodes, as previously described in this article).
We will now show to use Hooks to achieve the same integration with a more flexible interface and better reusability of tracking code. In this example, we will:
- Log the parameters after the data splitting node runs.
- Log the model after the model training node runs.
- Log the model’s metrics after the model evaluating node runs.
To follow along with this tutorial, we assume that you have a Kedro project in place with Kedro >= 0.16.1, as well as MLflow installed.
Step 1: Identify what Hook Specifications we need to use
To identify what Hook Specifications are needed, we need to think the lifecycle points in the Kedro execution timeline that we need to interact with.
In this case:
- We will need to start an MLflow run before the Kedro pipeline runs by implementing the
before_pipeline_run
Hook Specification. - We want to add tracking logic after a model training node runs, so we need to implement the
after_node_run
Hook Specification. - After the Kedro pipeline runs, we also need to end the MLflow run by implementing the
after_pipeline_run
.
Step 2: Provide Hook implementations
Having identified the necessary specifications, we need to implement them. In the Kedro project we create a Python package called hooks
in the same directory as the nodes
and pipelines
and then create a module called hooks/model_tracking_hooks.py
with the following content:
Notice that Hook Implementations are created by using the @hook_impl
decorator, and related Hook Implementations should be grouped in the same class.
Step 3: Register Hook implementations in ProjectContext
After defining Hook Implementations with model-tracking behaviour, the next step is to register them in the ProjectContext in run.py
as follows:
Step 4: Run the pipeline
Now we are ready to run the pipeline that has been extended with the MLflow machine learning tracking capability. Run the pipeline with kedro run
and open the MLflow UI to see the tracking results. This is an example of a model tracking run.
If you want more inspirations for your MLflow integration, you can check out the MLflow plugin made by Kedro community user Galileo-Galilei.
Find out more!
Further examples for using Hooks to implement data validation and pipeline monitoring can be found on our Github repo for Kedro examples.
We are very excited about the release of this feature in Kedro 0.16.0. We believe that it will help our users extend Kedro execution and accelerate the creation of additional, useful, integrations for the ecosystem.
If you have any question or feedback, please do let us know by raising a Github issue.