Unboxing Kedro 0.17.0 and Kedro-Viz 3.8.0 š
āTwas the night before the holidays began and all through the houseā¦there was much excitement because the elves in QuantumBlack Labs released a new version of Kedro.
This article explains whatās new in Kedro 0.17.0 and Kedro-Viz 3.8.0, how to unwrap them and what to expect. Donāt forget to read the manual before operation. Batteries not included.
First things first.
What is Kedro?
Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. It borrows concepts (like modularity, separation of concerns and versioning) from software engineering and applies them to machine-learning code.
At QuantumBlack, we routinely deliver real-world machine learning applications as a part of our work. We created Kedro to build upon the knowledge and experience we had gathered. It is designed to eliminate delays in collaborative projects and guarantee production-ready analytics code.
Since we open-sourced it in July 2019, Kedro has been used on 500+ internal and open-source projects. It won Best Technical Tool or Framework for AI in the 2019 Awards AI competition and a merit award for the 2020 UK Technical Communication Awards, is listed on the 2020 ThoughtWorks Technology Radar and the 2020 Data & AI Landscape.
This article doesnāt explain further what Kedro is, so if you are new to the framework, head over to the documentation or take a look at the Kedro code on Github. There are also a host of articles, podcasts, talks and Kedro showcase projects in the kedro-community Github repository.
What is in Kedro 0.17.0?
TL;DR We are super-enthusiastic about this new release of Kedro because we have focussed on decoupling the framework components. This gets us closer to making Kedro more stable with fewer breaking changes and provides an easier model for extensibility. You can also use the Data Catalog as an individual component while easily transitioning into the rest of the Kedro framework. We have added a chunk of improvements and fixed a few bugs along the way.
The full list of changes is, as always, in the release notes on the Kedro Github repository.
Towards a stable release
In a significant change, we have introduced KedroSession
, which is responsible for managing the lifecycle of a Kedro run and is the home for any state that needs to be maintained for the lifetime of a run. This change allows us to have a much simpler KedroContext
with fewer responsibilities, making it easier to understand and modify by plugin developers and advanced users. It also opens up possibilities for frequently asked but more advanced features in 2021 (stay tuned!). Weāve also brought Kedro into line with Python best practice by adding settings.py
and pyproject.toml
(to replace .kedro.yml
) for project configuration.
A clearer deployment model for Kedro pipelines
We have minimised dependencies between framework components to make a clear separation between project metadata and project settings. This change enables a much clearer deployment model for Kedro projects and opens the door to easy deployments onto many more platforms and workflow orchestrators. Plugin developers will also benefit, since the chances of their plugins becoming broken by a new Kedro release is now dramatically reduced. You will also find extensions to our documentation to add a whole new section on deployment to Kubeflow, Argo, Databricks, AWS Sagemaker, AWS Batch, Prefect and more. You should also check out a guide to deploying Kedro pipelines on AWS Sagemaker, published on the AWS Open Source Blog.
The DataCatalog as a standalone component
We conducted user research to understand how to transition users into the full Kedro framework would like to thank everyone that participated in this exercise.
One insight received was that you want the freedom to use specific Kedro components without commitment to the broader framework. The feature most frequently mentioned was the Data Catalog.
We have introduced Mini-Kedro to make it possible to configure and explore data sources in a Jupyter notebook using the Data Catalog. The Data Catalog allows you to specify data sources that you interact with for loading and saving purposes using a YAML API. The advantage of this approach is that you never need to specify file paths for loading or saving data in your Jupyter notebook.
See an example:
# In conf/base/catalog.ymlexample_dataset_1:
type: pandas.CSVDataSet
filepath: folder/filepath.csv
example_dataset_2:
type: spark.SparkDataSet
filepath: s3a://your_bucket/data/01_raw/example_dataset_2*
credentials: dev_s3
file_format: csv
save_args:
if_exists: replace
This makes it possible to use something like df = catalog.load("example_dataset_1")
and df = catalog.save("example_dataset_2")
to interact with data in a Jupyter notebook. You can see an example of this below.
What is in Kedro-Viz 3.8.0?
Kedro-Viz shows the structure of your Kedro pipeline. It displays data and machine-learning pipelines in an informative way, and emphasises the connections between datasets and nodes. In this release, we can showcase some exciting improvements, such as a new node, dataset and parameter metadata side-panel, graph layout engine and lazy loading for long lists of nodes and datasets.
An enhanced developer experience
Kedro-Viz is growing up to be a tool to support the developer experience. Over time, weāll release a set of new features that help your workflow: the first in this series is the node, dataset and parameter metadata side-panel.
By clicking on a graph element, you can see additional information about that element. Some examples include the parameters that are used in a node and the code description of a node. Future additions to this panel will consist of a way to see relevant code blocks.
A new graph layout engine
Our visualisation wizards have been working to improve how lines and graph elements are arranged on a pipeline visualisation. Pipeline layout becomes a significant problem when a pipeline visualisation has more than 500 components.
This feature was released in beta in an earlier version of Kedro-Viz, but weāre now ready to make it the default representation for your pipeline visualisations.
We are still looking for feedback on this layout engine. If you run into any problems using it, then you can disable new layout and revert to the old one by added the ?oldgraph=true
to the end of the URL in your browser when running Kedro-Viz, e.g. http://localhost:4141/?oldgraph=true
.
Loading time support for large pipeline visualisations
Some pipelines may have hundreds of elements and tags, slowing down the loading time of Kedro-Viz.
We have added lazy list rendering and memoisation to the sidebar. This improves time to first paint and first interaction when you load a pipeline with a large number of elements.
This is a beta feature and needs to be enabled with a flag. Add the ?lazy=true
to the end of the URL in your browser when running Kedro-Viz, e.g. http://localhost:4141/?lazy=true
, to try it out.
Migration guide
The Kedro 0.17.0 release contains some breaking changes. If you update Kedro to 0.17.0 and then try to work with projects created against earlier versions of Kedro, you may encounter some issues when trying to run kedro
commands in the terminal for that project.
Hereās a step-by-step guide to getting your projects running against the new version of Kedro.
As always, if you hit any problems, please check out our documentation to find out more about Kedro and get your questions answered.
To get an existing Kedro project to work after you upgrade to Kedro 0.17.0, we recommend that you create a new project against Kedro 0.17.0 and move the code from your existing project into it. You will want to create a new virtual environment to perform these changes as existing plugins might create conflicts in terms of dependencies.
Letās go through the necessary stepsā¦
The first change youāll see in this new release is that, if you create a new Kedro project with Kedro 0.17.0 you are no longer asked whether you want to include the boilerplate code for the Iris dataset example. Weāve removed this option and you should now use a Kedro starter if you want to create a project that is pre-populated with code. If you simply want a blank Kedro project to drop your existing code into, you can create one, as always, with kedro new
.
1. Update pyproject.toml
Copy the following three keys from the .kedro.yml
of your existing Kedro project into the pyproject.toml
file of your new Kedro 0.17.0 project:
[tools.kedro]
package_name = ā<package_name>ā
project_name = ā<project_name>ā
project_version = ā0.17.0ā
2. Check your source directory
If you defined a different source directory (source_dir
), make sure you also move that to pyproject.toml
.
3. Copy files from your existing project
- Copy subfolders of
project/src/project_name/pipelines
from existing to new project - Copy subfolders of
project/src/test/pipelines
from existing to new project - Copy the requirements your project needs into
requirements.txt
and/orrequirements.in
. - Copy the project configuration from the
conf
folder. Take note of the new locations needed for modular pipeline configuration data (move it fromconf/base/pipeline_name/catalog.yml
toconf/base/catalog/pipeline_name.yml
and likewise forparameters.yml
). - Copy from the
data/
folder of your existing project, if needed, into the same location in your new project. - Copy any Hooks from
src/<package_name>/hooks.py
.
4. Update project settings
Copy settings in src/<package_name>/settings.py
to the new project as needed.
5. Update your new projectās README and docs as necessary
6. Update settings.py
For example, if you specified additional Hook implementations in hooks
, or listed plugins under disable_hooks_by_plugin
in your .kedro.yml
, you will need to move them to settings.py
accordingly:
from <package_name>.hooks import MyCustomHooks, ProjectHooks
HOOKS = (ProjectHooks(), MyCustomHooks())
DISABLE_HOOKS_FOR_PLUGINS = ("my_plugin1",)
7. Migration for node
names
From 0.17.0 the only allowed characters for node names are letters, digits, hyphens, underscores and/or fullstops. If you have previously defined node names that have special characters, spaces or other characters that are no longer permitted, you will need to rename those nodes.
8. Copy changes to kedro_cli.py
If you previously customised the kedro run
command or added more CLI commands to your kedro_cli.py
, you should copy-paste them into <project_root>/src/<package_name>/cli.py
. Note, however, that the new way to run a Kedro pipeline is via a KedroSession, rather than using the KedroContext:
with KedroSession.create(package_name=...) as session:
session.run()
9. Copy changes made to ConfigLoader
If you have defined a custom class, such as TemplatedConfigLoader
, by overriding ProjectContext._create_config_loader
, you should move the contents of the function in src/<package_name>/hooks.py
, under register_config_loader
.
10. Copy changes made to DataCatalog
Likewise, if you have DataCatalog
defined with ProjectContext._create_catalog
, you should copy-paste the contents into register_catalog
.
11. Upgrade Kedro-Viz to the latest version
Kedro 0.17.0 will not work with previous versions of Kedro-Viz. You have to use Kedro-Viz 3.8.0.
What are the other changes we have made?
Hereās a brief summary of the key changes in Kedro 0.17.0, but take a look at the full release notes on Github for details.
- We introduced
KedroSession
to manage the lifecycle of a Kedro run. - There are additional hook specifications in the
DatasetSpecs
class to run Hooks before and after datasets are loaded from/saved to the catalog. - We added
kedro catalog create
. For a registered pipeline, it creates a<conf_root>/<env>/catalog/<pipeline_name>.yml
configuration file withMemoryDataSet
entries for each dataset that is missing fromDataCatalog
. - We replaced
.kedro.yml
withsettings.py
andpyproject.toml
. - We removed
ProjectContext
fromsrc/<package_name>/run.py
. - We made registration Hooks the mandatory way to customise the
ConfigLoader
or theDataCatalog
used in a project. - We made a set of breaking changes and bug fixes, documented in the release notes.
- We deleted the deprecated
kedro.cli
andkedro.context
modules and gave notice of a set of upcoming deprecations in Kedro 0.18.0 (kedro.framework.context.load_context
,kedro.framework.cli.get_project_context
, the decorator API for bothnode
andpipeline
and the Transformers API).
Written by: Dmitrii Deriabinā Machine Learning Engineer, Ivan Danov ā Machine Learning Engineer & Tech Lead, Jo Stichbury ā Technical Writer and Editor, Liam Brummittā Front-End Engineer, Lim Hoang ā Software Engineer, Lorena Balan ā Software Engineer, Merel Theisen ā Software Engineer, Yetunde Dada ā Product Manager