Getting Started with Kedro
A Python library for production-ready data pipelines
Jo Stichbury originally published this article in Towards Data Science. This article has been updated to reflect the latest version of Kedro.
In this post, I will introduce Kedro, a new open-source tool for data scientists and data engineers. After a brief description of what it is and why it is likely to become a standard part of every professional’s toolchain, I will describe how to use it in a tutorial that you should be able to complete in fifteen minutes. Strap in for a spaceflight to the future!
Suppose you are a data scientist working for a senior executive who makes key financial decisions for your company. She asks you to provide an ad-hoc analysis, and when you do, she thanks you for delivering useful insights for her planning. Great!
Three months down the line, the newly-promoted executive, now your CEO, asks you to re-run the analysis for the next planning meeting… and you cannot. The code is broken because you’ve overwritten some of the key sections of the file, and you cannot remember the exact environment you used at the time. Or maybe the code is OK, but it’s in one big Jupyter Notebook with all the file paths hardcoded, meaning that you have to go through laboriously to check and change each one for the new data inputs. Not so great!
Some everyday principles of software engineering are unfamiliar to data scientists who do not all have an extensive background in programming. In fact, many data scientists are self-taught, learning to program as part of a research project or when necessary in their jobs. And, maybe, a few data scientists will argue that their code does not need to be of a high standard because they aren’t working on a production system.
Any code that feeds some business decision process should be considered as production code
Data scientists may not consider the primary output of their work to be code, but this doesn’t mean that the code they write shouldn’t follow the standards expected by a software engineering team. In fact, it should have the following, minimum, characteristics:
- Versioning — Use git or a similar tool to save your changes regularly whether you work alone or in a team.
- Reproducibility — You should be able to transfer a project to another computer and run it without significant effort or reconfiguration.
- Follow standards — A common project structure, standard coding conventions and tools.
- Documentation — Automated documentation to keep your documentation up-to-date with your code.
- Modularity — Code can be executed and tested easily.
For more detail on these principles, take a look at a useful blog post by Thomas Huijskens from QuantumBlack.
What is Kedro?
Kedro is a development workflow tool that allows you to create portable data pipelines. It applies software engineering best practices to your data science code so it is reproducible, modular and well-documented. If you use Kedro you can worry less about how to write production-ready code (Kedro does the heavy lifting for you) and you’ll standardise the way your team collaborates, allowing you all to work more efficiently.
Many data scientists need to perform the routine tasks of data cleaning, processing and compilation that may not be their favourite activities but form a large percentage of their day to day tasks. Kedro makes it easier to build a data pipeline to automate the “heavy lifting” and reduce the amount of time spent on this kind of task.
Kedro has been created by QuantumBlack, an advanced analytics firm, that was acquired by the consultancy giant, McKinsey & Company, in 2015. QuantumBlack have used Kedro on more than 70+ projects with McKinsey and have now decided to open source it. Their goal is to enable staff, clients and third-party developers to use Kedro and build upon it. By extending this open tool for their own needs, data professionals are able to complete their regular tasks effectively and potentially share their enhancements to the community.
Is Kedro a workflow scheduler?
Kedro is not a workflow scheduler like Airflow and Luigi. Kedro makes it easy to prototype your data pipeline, while Airflow and Luigi are complimentary frameworks that are great at managing deployment, scheduling, monitoring and alerting. A Kedro pipeline is like a machine that builds a car part. Airflow or Luigi tell different machines to switch on or off in order to work together and produce a car. QuantumBlack has built a Kedro-Airflow plugin, providing faster prototyping time and reducing the barriers to entry associated with moving pipelines to both workflow schedulers.
What do I need to know?
Kedro’s documentation has been designed for beginners to get started creating their own Kedro projects (in Python 3.5+). If you have just the very basic knowledge of Python then you might find the learning curve more challenging but bear with it and follow the documentation for guidance.
A fifteen-minute spaceflight with Kedro 0.15.0
My easy-to-follow fifteen-minute tutorial will be based on the following scenario:
It is 2160 and the space tourism industry is booming. Globally, there are thousands of space shuttle companies taking tourists to the moon and back. You have been able to source three (fictional) datasets about the amenities offered in each space shuttle, customer reviews and company information. You want to train a linear regression model that uses the datasets to predict the price of shuttle hire. However, before you get to train the model, you need to prepare the data by doing some data engineering, which is the process of preparing data for model building by creating a master table.
Getting set up
The first step is to pip install kedro==0.15.0
into a new Python 3.5+ virtual environment (we recommend conda
), and confirm that it is correctly installed by typing:
kedro info
You should see the Kedro graphic as shown below:
This will additionally give you the version of Kedro that you are using. This tutorial was written using version 0.15.0 in August 2019 and since this is an active and open source project, there may, over time, be changes to the codebase which cause minor changes to the example project. We will endeavour to keep the example code up to date, although not this blog post, so you should consult Kedro documentation and release notes if you hit upon any inconsistencies.
Kedro workflow
When building a Kedro project, you will typically follow a standard development workflow as shown in the diagram below. In this tutorial, I will walk through each of the steps.
Set up the project template
To keep things simple, the QuantumBlack team have provided all the code you need, so you simply need to clone the example project from its Github Repo. You do not need to set up a project template in this tutorial.
Once you have the code for the Spaceflights project, to make sure you have the necessary dependencies for it, please run the following in the Python 3.5+ virtual environment:
kedro install
Set up the data
Kedro uses configuration files to make a project’s code reproducible across different environments, when it may need to reference datasets in different locations.
To set up data for a Kedro project, you will typically configure the registry of data sources that manages the loading and saving of data.
This tutorial makes use of three datasets for spaceflight companies shuttling customers to the moon and back, and uses two data formats: .csv and .xlsx. The files are stored in the data/01_raw/
folder of the project directory.
To work with the datasets provided, all Kedro projects have a conf/base/catalog.yml file which acts as a registry of the datasets in use. Registering a dataset is as simple as adding a named entry into the .yml file to include file location (path), parameters for the given dataset, type of data and versioning. Kedro comes with support for a few types of data, such as CSV, and if you look at the catalog.yml file for the Spaceflights tutorial, you will see the following entries for the datasets used:
companies:
type: CSVLocalDataSet
filepath: data/01_raw/companies.csvreviews:
type: CSVLocalDataSet
filepath: data/01_raw/reviews.csvshuttles:
type: ExcelLocalDataSet
filepath: data/01_raw/shuttles.xlsx
To check whether Kedro loads the data correctly, and to inspect the first five rows of data, open a terminal window and start an IPython session with the kedro ipython
CLI command in the Kedro project directory (and just type (exit()
when you want to end the session and continue with the tutorial) and enter the following:
context.catalog.load("companies").head()
Create and run the pipeline
The next part of the workflow is to create a pipeline from a set of nodes, which are Python functions that perform distinct, individual tasks. In a typical project, this stage of the project comprises of three steps:
- Write and document your node functions
- Build your pipeline(s) by specifying inputs and outputs for each of its nodes
- Run Kedro, choosing whether to run the nodes, sequentially or in parallel
- Visualise the pipeline
Data engineering pipeline
We have reviewed the raw datasets for the Spaceflights project, and it is now time to consider the data engineering pipeline that processes the data and prepares it for the model within the data science pipeline.
The data engineering pipeline preprocesses two datasets and merges them with a third into a master table.
In the file data_engineering.py inside the nodes folder, you’ll find the preprocess_companies
and preprocess_shuttles
functions (specified as nodes within the pipeline in pipeline.py). Each function takes in a dataframe and outputs pre-processed data, preprocessed_companies
and preprocessed_shuttles
respectively.
When Kedro runs the data engineering pipeline, it determines whether datasets are registered in the data catalog (conf/base/catalog.yml). If a dataset is registered, it is persisted automatically to the path specified without a need to specify any code in the function itself. If a dataset is not registered, Kedro stores it in memory for the pipeline run and removes it afterwards.
In the tutorial, the preprocessed data is registered in conf/base/catalog.yml:
preprocessed_companies:
type: CSVLocalDataSet
filepath: data/02_intermediate/preprocessed_companies.csvpreprocessed_shuttles:
type: CSVLocalDataSet
filepath: data/02_intermediate/preprocessed_shuttles.csv
CSVLocalDataSet
is chosen for its simplicity, but you can choose any other available dataset implementation class to save the data, for example, to a database table, cloud storage (like S3, Azure Blob Store etc.) and others.
The data_engineering.py file also includes a function, create_master_table
which the data engineering pipeline uses to join together the companies
, shuttles
and reviews
dataframes into a single master_table
.
Data science pipeline
The data science pipeline, which builds a model that uses the datasets, is comprised of a price prediction model, which uses a simple LinearRegression implementation from the scikit-learn
library. The code is found in the file src/kedro_tutorial/nodes/price_prediction.py.
The test size and random state parameters for the prediction model are specified in conf/base/parameterd.yml and Kedro feeds them into the catalog when the pipeline is executed.
You can access your parameters in an IPython session by opening a terminal window and start an IPython session with the kedro ipython
CLI command in the Kedro project directory and entering the following:
context.config_loader.get("**/parameters.yml")
Combining the pipelines
The project’s pipelines are specified in pipeline.py
:
def create_pipeline(**kwargs):
de_pipeline = Pipeline(
[
node(
preprocess_companies,
"companies",
"preprocessed_companies",
),
node(
preprocess_shuttles,
"shuttles",
"preprocessed_shuttles",
),
node(
create_master_table,
["preprocessed_shuttles", "preprocessed_companies", "reviews"],
"master_table",
),
],
name="data engineering",
)
ds_pipeline = Pipeline(
[
node(
split_data,
["master_table", "parameters"],
["X_train", "X_test", "y_train", "y_test"],
),
node(train_model, ["X_train", "y_train"], "regressor"),
node(evaluate_model, ["regressor", "X_test", "y_test"], None),
],
name="data science",
)
return de_pipeline + ds_pipeline
The de_pipeline
will preprocess the data, then use create_master_table
to combine preprocessed_shuttles
, preprocessed_companies
, and reviews
into the master_table
dataset. The ds_pipeline
will then create features, train and evaluate the model. The first node of the ds_pipeline
outputs 4 objects: [X_train
, X_test
, y_train
, y_test
] which are then used to train the model and, in a final node, to evaluate the model.
The two pipelines are merged together in de_pipeline + ds_pipeline
. The order in which you add the pipelines together is not significant since Kedro automatically detects the correct execution order for all the nodes. The same overall pipeline would result if you specified ds_pipeline + de_pipeline
.
Both pipelines will be executed when you invoke kedro run
, and you should see output similar to the following:
If you have any problems getting the tutorial code up and running, you should check that you are working in a Python 3.5+ environment and have all the dependencies installed for the project. If there are still problems, you should head over to Stack Overflow and use the #kedro hashtag for guidance from the community and, if the behaviour you’re observing appears to be a problem with Kedro, please feel to raise an issue on Kedro’s Github repository.
Kedro runners
There are two different ways to run a Kedro pipeline. You can specify:
SequentialRunner
— runs your nodes sequentially: once a node has completed its task then the next one startsParallelRunner
— runs your nodes in parallel: independent nodes are able to run at the same time, allowing you to take advantage of multiple CPU cores, and be more efficient when there are independent branches in your pipeline. Note thatParallelRunner
performs task parallelisation (i.e. can run multiple tasks at once), but does not parallelise the calculations within the task.
By default, Kedro uses a SequentialRunner
when you invoke kedro run
. Switching to use ParallelRunner
is as simple as providing an additional flag as follows:
kedro run — parallel
Visualising your data pipeline
Kedro-Viz shows you how your Kedro data pipelines are structured. With Kedro-Viz you can:
- See how your datasets and Python functions (nodes) are resolved in Kedro so that you can understand how your data pipeline is built
- Get a clear picture when you have lots of datasets and nodes by using tags to visualise sub-pipelines
- Search for nodes and datasets
Kedro-Viz is developed as a Kedro plugin and the first step is to pip install kedro-viz
in your virtual environment.
Finally, run the following command from the project directory to visualise your pipeline:
kedro viz
This command will run a kedro_viz.server
on http://127.0.0.1:4141
and will open up your visualisation on a browser. You should be able to see the following:
That’s the end of the tutorial to run through the basics of Kedro. There’s plenty more documentation on the various configuration options and optimisations you can use in your own projects, for example:
- Installation guide
- Longer version of this tutorial
- User guide
- API reference docs
- Blog post by QuantumBlack
Contributions
If you start working with Kedro and want to contribute an example or changes to the Kedro project, we would welcome the chance to work with you. Please consult the contribution guide.
Acknowledgements
The Spaceflights example is based on a tutorial written by the Kedro team at QuantumBlack Labs (Yetunde Dada, Ivan Danov, Dmitrii Deriabin, Lorena Balan, Gordon Wrigley, Kiyohito Kunii, Nasef Khan, Richard Westenra and Nikolaos Tsaousis), who have kindly given me early access to their code and documentation in order to produce this tutorial for TowardsDataScience.