New in Kedro this Month!

In May 2020, we’re celebrating the releases of Kedro 0.16.1 and Kedro-Viz 3.3.0, important milestones in the data science workflow.

--

Written by: Jo Stichbury — Technical Writer and Editor, Yetunde Dada — Product Manager, Lim Hoang — Software Engineer, Kiyohito Kunii — Software Engineer, QuantumBlack

Photo by Luca Bravo on Unsplash

Kedro is an open source workflow development framework that implements software engineering best-practice for data and machine learning (ML) pipelines. It was designed to eliminate delays in collaborative projects and guarantee production-ready analytics code. Kedro is built upon a collective knowledge gathered by QuantumBlack, whose teams routinely deliver real-world machine learning applications as part of McKinsey.

In this article, we won’t explain further what Kedro is, but you can read our introductory article to find out more, and find links to other articles and podcasts in our documentation.

What’s in Kedro 0.16.1 and Kedro-Viz 3.3.0?

Kedro was designed to make it easier to build data and ML pipelines, and with this release, we’ve focused on helping you to create reusable pipelines. We’ve also made it easier to extend Kedro, and we have improved the overall user experience.

TL;DR

We’re doubling down on how Kedro is used to create reusable analytics code stores; you will experience an improved workflow for creating and sharing pre-built pipelines.

It’s easier to add new tooling like MLFlow, Great Expectations and more to your workflow using Hooks, allowing you to extend Kedro’s standard execution and add extra behaviours.

You can now tell the story of how you processed data, from source to actionable insights, using the Data Layers functionality on Kedro-Viz.

We have made the core library of Kedro much smaller by separating optional dependencies from core library installation.

Reusable analytics code stores

We’ve noticed that our users often use Kedro to create reusable analytics code stores. So we’re now introducing a workflow that makes it possible for you to interact with pre-built pipelines for your specific domain. The goal is efficiency; if you can reuse much of your analytics code, you won’t need to start from scratch when working on a new but similar project.

The pipeline group of commands on Kedro’s CLI

The kedro pipeline group of commands has been introduced to play a part in how you create and share reusable pipelines. We’ll be iterating on this functionality.

Extending Kedro with Hooks

We have added Hooks, which allows you to extend Kedro by injecting additional behaviour at certain points in the main pipeline execution.

The following execution touch-points, known as Hook specifications, are provided in kedro.framework.hooks. We prototyped a few of our most requested touch points in the table below.

You can learn more about Hooks in our upcoming introductory article, as well as in the Kedro documentation.

Data processing as part of your pipeline visualisation

Late last year we introduced Kedro’s pipeline visualisation tool, Kedro-Viz, with an article about how it’s used it to collaborate with business stakeholders when solving an ML use case. In Kedro-Viz 3.3.0, we’re expanding on this ability with the Data Layers functionality, so you can apply your data processing convention to your pipeline visualisation.

At QuantumBlack, we refer to the Data Engineering Convention as our style guide for describing the different pre-processing steps that you would encounter in a data and ML pipeline. We start from raw, immutable source data, and work our way through data cleaning and feature engineering until we get to actionable insights. In industry, there is no standard convention, as this tweet from Jeremy Stanley, Founder and CTO, Stealth and former VP of Data Science at Instacart reveals.

This is why we’ve made the design flexible for whatever definitions you use. The data processing attribute is defined as a layer in the Data Catalog.

For example, your catalog.yml entry for an interaction with a SQL Database as one of your raw data sources looks like:

scooters:
type: pandas.SQLTableDataSet
credentials: scooters_credentials
table_name: scooters
load_args:
index_col: [‘name’]
columns: [‘name’, ‘gear’]
save_args:
if_exists: ‘replace’
layer: raw

This would allow you to create a pipeline visualisation that looks like the following:

Pipeline visualisation on Kedro-Viz with data processing layers enabled

We have also made some UI and UX improvements to Kedro-Viz. In future releases, look out for a side-panel for node and dataset metadata as well as a solution to help you visualise reusable pipelines.

A lighter core library

Our users told us that Kedro was carrying too many dependencies when it was first installed. The size of the library also made it challenging to deploy Kedro pipelines to some environments that restrict the size of the deployment bundle, such as AWS Lambda. In response to this feedback, we are introducing a modular installation experience from Kedro 0.16.1. You can now choose to use certain components based on what you’re trying to do and leave out others. For instance, for a workflow dependent on PySpark, you can install the core library of Kedro and the dataset dependencies for using PySpark:

pip install “kedro[pyspark]”

You can also opt to be more granular. For example, if your workflow only uses CSV and Parquet files, you can select to install dependencies related to those datasets only:

pip install “kedro[pandas.CSVDataSet]”
pip install “kedro[pandas.ParquetDataSet]”

Integration with platforms such as AWS is now easier thanks to a pure Python core library with other extensions to be brought in on a necessity basis.

Redesigning the user experience

We have improved the command line experience by speeding it up by 50%. We’ve also improved error handling; if you make a typo on the command line it now suggests the possible commands you intended to type, git-style.

There are a number of documentation improvements in the new release, including a section on the new Hooks feature, and we have simplified the getting started guide. We have also improved our processes, and have reworked the community code of conduct.

Check out the release notes!

Take a look at RELEASE.md for details of API breaks, bug fixes and a basic migration guide. You’ll find a comprehensive list, some highlights of which include:

  • We have added support for Pandas 1.x.
  • We’ve enabled Python 3.8 compatibility, but we are no longer supporting Python 3.5.
  • conda installations are now allowed, thanks to a community contribution.
  • We’ve added an ImageDataSet for interacting with images, a GeoJSONDataSet for working with geospatial data and an APIDataSet for handling API requests.

In conclusion

If you’re not yet a Kedro user and you want to find out more, we recommend that you start with our documentation. There’s also a neat presentation available from PyConDE & PyData Berlin 2019 if you want to be guided through an overview.

If you are already using Kedro and you’re inspired to contribute, take a look at our guide on Github that describes how to work with the team behind Kedro to contribute feature requests/bug reports/documentation.

--

--

QuantumBlack, AI by McKinsey
QuantumBlack, AI by McKinsey

An advanced analytics firm operating at the intersection of strategy, technology and design.