Unboxing Kedro 0.17.0 and Kedro-Viz 3.8.0 šŸŽ

ā€™Twas the night before the holidays began and all through the houseā€¦there was much excitement because the elves in QuantumBlack Labs released a new version of Kedro.

QuantumBlack, AI by McKinsey
QuantumBlack, AI by McKinsey
9 min readDec 17, 2020

--

ā€œTop View Photography of Blackboard Between Baublesā€ (with edit) by Giftpundits.com License

This article explains whatā€™s new in Kedro 0.17.0 and Kedro-Viz 3.8.0, how to unwrap them and what to expect. Donā€™t forget to read the manual before operation. Batteries not included.

First things first.

What is Kedro?

Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. It borrows concepts (like modularity, separation of concerns and versioning) from software engineering and applies them to machine-learning code.

At QuantumBlack, we routinely deliver real-world machine learning applications as a part of our work. We created Kedro to build upon the knowledge and experience we had gathered. It is designed to eliminate delays in collaborative projects and guarantee production-ready analytics code.

Since we open-sourced it in July 2019, Kedro has been used on 500+ internal and open-source projects. It won Best Technical Tool or Framework for AI in the 2019 Awards AI competition and a merit award for the 2020 UK Technical Communication Awards, is listed on the 2020 ThoughtWorks Technology Radar and the 2020 Data & AI Landscape.

This article doesnā€™t explain further what Kedro is, so if you are new to the framework, head over to the documentation or take a look at the Kedro code on Github. There are also a host of articles, podcasts, talks and Kedro showcase projects in the kedro-community Github repository.

What is in Kedro 0.17.0?

TL;DR We are super-enthusiastic about this new release of Kedro because we have focussed on decoupling the framework components. This gets us closer to making Kedro more stable with fewer breaking changes and provides an easier model for extensibility. You can also use the Data Catalog as an individual component while easily transitioning into the rest of the Kedro framework. We have added a chunk of improvements and fixed a few bugs along the way.

The full list of changes is, as always, in the release notes on the Kedro Github repository.

Towards a stable release

In a significant change, we have introduced KedroSession, which is responsible for managing the lifecycle of a Kedro run and is the home for any state that needs to be maintained for the lifetime of a run. This change allows us to have a much simpler KedroContext with fewer responsibilities, making it easier to understand and modify by plugin developers and advanced users. It also opens up possibilities for frequently asked but more advanced features in 2021 (stay tuned!). Weā€™ve also brought Kedro into line with Python best practice by adding settings.py and pyproject.toml (to replace .kedro.yml) for project configuration.

A clearer deployment model for Kedro pipelines

We have minimised dependencies between framework components to make a clear separation between project metadata and project settings. This change enables a much clearer deployment model for Kedro projects and opens the door to easy deployments onto many more platforms and workflow orchestrators. Plugin developers will also benefit, since the chances of their plugins becoming broken by a new Kedro release is now dramatically reduced. You will also find extensions to our documentation to add a whole new section on deployment to Kubeflow, Argo, Databricks, AWS Sagemaker, AWS Batch, Prefect and more. You should also check out a guide to deploying Kedro pipelines on AWS Sagemaker, published on the AWS Open Source Blog.

The DataCatalog as a standalone component

We conducted user research to understand how to transition users into the full Kedro framework would like to thank everyone that participated in this exercise.

One insight received was that you want the freedom to use specific Kedro components without commitment to the broader framework. The feature most frequently mentioned was the Data Catalog.

We have introduced Mini-Kedro to make it possible to configure and explore data sources in a Jupyter notebook using the Data Catalog. The Data Catalog allows you to specify data sources that you interact with for loading and saving purposes using a YAML API. The advantage of this approach is that you never need to specify file paths for loading or saving data in your Jupyter notebook.

See an example:

# In conf/base/catalog.ymlexample_dataset_1:
type: pandas.CSVDataSet
filepath: folder/filepath.csv

example_dataset_2:
type: spark.SparkDataSet
filepath: s3a://your_bucket/data/01_raw/example_dataset_2*
credentials: dev_s3
file_format: csv
save_args:
if_exists: replace

This makes it possible to use something like df = catalog.load("example_dataset_1") and df = catalog.save("example_dataset_2") to interact with data in a Jupyter notebook. You can see an example of this below.

Using the DataCatalog as a standalone component in a Jupyter notebook

What is in Kedro-Viz 3.8.0?

Kedro-Viz shows the structure of your Kedro pipeline. It displays data and machine-learning pipelines in an informative way, and emphasises the connections between datasets and nodes. In this release, we can showcase some exciting improvements, such as a new node, dataset and parameter metadata side-panel, graph layout engine and lazy loading for long lists of nodes and datasets.

Kedro-Viz and the new code, dataset and parameter metadata side-panel

An enhanced developer experience

Kedro-Viz is growing up to be a tool to support the developer experience. Over time, weā€™ll release a set of new features that help your workflow: the first in this series is the node, dataset and parameter metadata side-panel.

By clicking on a graph element, you can see additional information about that element. Some examples include the parameters that are used in a node and the code description of a node. Future additions to this panel will consist of a way to see relevant code blocks.

A new graph layout engine

Our visualisation wizards have been working to improve how lines and graph elements are arranged on a pipeline visualisation. Pipeline layout becomes a significant problem when a pipeline visualisation has more than 500 components.

This feature was released in beta in an earlier version of Kedro-Viz, but weā€™re now ready to make it the default representation for your pipeline visualisations.

We are still looking for feedback on this layout engine. If you run into any problems using it, then you can disable new layout and revert to the old one by added the ?oldgraph=true to the end of the URL in your browser when running Kedro-Viz, e.g. http://localhost:4141/?oldgraph=true.

Loading time support for large pipeline visualisations

Some pipelines may have hundreds of elements and tags, slowing down the loading time of Kedro-Viz.

We have added lazy list rendering and memoisation to the sidebar. This improves time to first paint and first interaction when you load a pipeline with a large number of elements.

This is a beta feature and needs to be enabled with a flag. Add the ?lazy=true to the end of the URL in your browser when running Kedro-Viz, e.g. http://localhost:4141/?lazy=true, to try it out.

Migration guide

Danger by Nick Youngson CC BY-SA 3.0 Alpha Stock Images

The Kedro 0.17.0 release contains some breaking changes. If you update Kedro to 0.17.0 and then try to work with projects created against earlier versions of Kedro, you may encounter some issues when trying to run kedro commands in the terminal for that project.

Hereā€™s a step-by-step guide to getting your projects running against the new version of Kedro.

As always, if you hit any problems, please check out our documentation to find out more about Kedro and get your questions answered.

To get an existing Kedro project to work after you upgrade to Kedro 0.17.0, we recommend that you create a new project against Kedro 0.17.0 and move the code from your existing project into it. You will want to create a new virtual environment to perform these changes as existing plugins might create conflicts in terms of dependencies.

Letā€™s go through the necessary stepsā€¦

The first change youā€™ll see in this new release is that, if you create a new Kedro project with Kedro 0.17.0 you are no longer asked whether you want to include the boilerplate code for the Iris dataset example. Weā€™ve removed this option and you should now use a Kedro starter if you want to create a project that is pre-populated with code. If you simply want a blank Kedro project to drop your existing code into, you can create one, as always, with kedro new.

1. Update pyproject.toml

Copy the following three keys from the .kedro.yml of your existing Kedro project into the pyproject.toml file of your new Kedro 0.17.0 project:

[tools.kedro]
package_name = ā€œ<package_name>ā€
project_name = ā€œ<project_name>ā€
project_version = ā€œ0.17.0ā€

2. Check your source directory

If you defined a different source directory (source_dir), make sure you also move that to pyproject.toml.

3. Copy files from your existing project

  • Copy subfolders of project/src/project_name/pipelines from existing to new project
  • Copy subfolders of project/src/test/pipelines from existing to new project
  • Copy the requirements your project needs into requirements.txt and/or requirements.in.
  • Copy the project configuration from the conf folder. Take note of the new locations needed for modular pipeline configuration data (move it from conf/base/pipeline_name/catalog.yml to conf/base/catalog/pipeline_name.yml and likewise for parameters.yml).
  • Copy from the data/ folder of your existing project, if needed, into the same location in your new project.
  • Copy any Hooks from src/<package_name>/hooks.py.

4. Update project settings

Copy settings in src/<package_name>/settings.py to the new project as needed.

5. Update your new projectā€™s README and docs as necessary

6. Update settings.py

For example, if you specified additional Hook implementations in hooks, or listed plugins under disable_hooks_by_plugin in your .kedro.yml, you will need to move them to settings.py accordingly:

from <package_name>.hooks import MyCustomHooks, ProjectHooks

HOOKS = (ProjectHooks(), MyCustomHooks())

DISABLE_HOOKS_FOR_PLUGINS = ("my_plugin1",)

7. Migration for node names

From 0.17.0 the only allowed characters for node names are letters, digits, hyphens, underscores and/or fullstops. If you have previously defined node names that have special characters, spaces or other characters that are no longer permitted, you will need to rename those nodes.

8. Copy changes to kedro_cli.py

If you previously customised the kedro run command or added more CLI commands to your kedro_cli.py, you should copy-paste them into <project_root>/src/<package_name>/cli.py. Note, however, that the new way to run a Kedro pipeline is via a KedroSession, rather than using the KedroContext:

with KedroSession.create(package_name=...) as session:
session.run()

9. Copy changes made to ConfigLoader

If you have defined a custom class, such as TemplatedConfigLoader, by overriding ProjectContext._create_config_loader, you should move the contents of the function in src/<package_name>/hooks.py, under register_config_loader.

10. Copy changes made to DataCatalog

Likewise, if you have DataCatalog defined with ProjectContext._create_catalog, you should copy-paste the contents into register_catalog.

11. Upgrade Kedro-Viz to the latest version

Kedro 0.17.0 will not work with previous versions of Kedro-Viz. You have to use Kedro-Viz 3.8.0.

What are the other changes we have made?

Hereā€™s a brief summary of the key changes in Kedro 0.17.0, but take a look at the full release notes on Github for details.

  • We introduced KedroSession to manage the lifecycle of a Kedro run.
  • There are additional hook specifications in the DatasetSpecs class to run Hooks before and after datasets are loaded from/saved to the catalog.
  • We added kedro catalog create. For a registered pipeline, it creates a <conf_root>/<env>/catalog/<pipeline_name>.yml configuration file with MemoryDataSet entries for each dataset that is missing from DataCatalog.
  • We replaced .kedro.yml with settings.py and pyproject.toml.
  • We removed ProjectContext from src/<package_name>/run.py.
  • We made registration Hooks the mandatory way to customise the ConfigLoader or the DataCatalog used in a project.
  • We made a set of breaking changes and bug fixes, documented in the release notes.
  • We deleted the deprecated kedro.cli and kedro.context modules and gave notice of a set of upcoming deprecations in Kedro 0.18.0 (kedro.framework.context.load_context, kedro.framework.cli.get_project_context, the decorator API for both node and pipeline and the Transformers API).

Written by: Dmitrii Deriabinā€” Machine Learning Engineer, Ivan Danov ā€” Machine Learning Engineer & Tech Lead, Jo Stichbury ā€” Technical Writer and Editor, Liam Brummittā€” Front-End Engineer, Lim Hoang ā€” Software Engineer, Lorena Balan ā€” Software Engineer, Merel Theisen ā€” Software Engineer, Yetunde Dada ā€” Product Manager

--

--

QuantumBlack, AI by McKinsey
QuantumBlack, AI by McKinsey

We are the AI arm of McKinsey & Company. We are a global community of technical & business experts, and we thrive on using AI to tackle complex problems.