Learn you a Kedro*
Write reproducible, maintainable and modular data science code
*A tribute to “Learn you a Haskell for Great Good!”
Grab a coffee and sit down. This is a long post!
In this article, I introduce Kedro, an open-source Python framework for creating reproducible, maintainable and modular data science code. After a brief description of what it is and why it is likely to become a standard part of every data scientist’s toolchain, I describe some technical Kedro concepts and illustrate how to use them with a tutorial.
Overall, you should be able to read and digest this article in about 30 minutes. But the benefits of using Kedro will last you a professional lifetime.
Suppose you are a data scientist working for a senior executive who makes key financial decisions for your company. She asks you to provide an ad-hoc analysis, and when you do, she thanks you for delivering useful insights for her planning. Great!
Three months down the line, the newly-promoted executive, now your CEO, asks you to re-run the analysis for the next planning meeting… and you can’t. The code is broken because you’ve overwritten some of the key sections of the file, and you don’t remember the exact environment you used at the time. Or maybe the code is OK, but it’s in one big Jupyter notebook with all the file paths hard coded, meaning that you have to go through laboriously to check and change each one for the new data inputs. Oops! That’s not so great!
It happens. Probably more often than it should. When it does, the usual argument is that the original project wasn’t planned to a high standard because it wasn’t designed for a production system.
But… any code that feeds some business decision process should be considered production code!
Of course, if you’re working as a data scientist, you may not consider the primary output of your work to be code. But your project still uses code so why not build on the hard-won experience of software engineers? Look to future-proof your project with the following, minimum, characteristics:
- It should be versioned — Use git or a similar tool to save your changes regularly whether you work alone or in a team.
- It should be reproducible — You should be able to transfer a project to another computer and run it without significant effort.
- It should follow standards — Stick to a common project structure, standard coding conventions and tools for future collaboration and better teamwork.
- It should be documented — Use automated documentation to keep your documentation up-to-date with your code.
- It should be modular — Code can be executed and tested easily.
For more detail on these principles, take a look at a useful blog post by Thomas Huijskens.
Introducing Kedro
Kedro helps you create reproducible, maintainable and modular data science code. It borrows concepts from software engineering and applies them to machine-learning code; applied concepts include modularity, separation of concerns and versioning.
Kedro concepts
First, let’s examine the most basic elements of Kedro with a traditional “Hello World” example that’s broken into sections. You can find out more about these concepts in the Kedro glossary.
Node
A node is a wrapper for a Python function that names the inputs and outputs of that function. Nodes are the building blocks of a pipeline and you can link them such that the output of one node is the input of another.
Here, a node called return_greeting_node
, has no inputs, and names a single output (my_salutation
). It wraps the function (return_greeting
):
# Prepare first node
def return_greeting():
return “Hello”return_greeting_node = node(func=return_greeting, inputs=None, outputs=”my_salutation”)
Here, a second node called join_statements_node
, names a single input (my_salutation
) and a single output (my_message
). It wraps the function (join_statements
):
# Prepare second node
def join_statements(greeting):
return f”{greeting} Kedro!”join_statements_node = node(join_statements, inputs=”my_salutation”, outputs=”my_message”)
Note that my_salutation
is the output of return_greeting_node
and also the input of join_statements_node
.
Pipeline
A Kedro pipeline organises the dependencies and execution order of a collection of nodes. In this example the pipeline executes return_greeting_node
before it executes join_statements_node
:
# Assemble nodes into a pipeline
pipeline = Pipeline([return_greeting_node, join_statements_node])
DataCatalog
The Kedro DataCatalog
is a registry of all data sources that the project can use. It maps the names of node inputs and outputs as keys in a DataSet
, which is a Kedro class that can be specialised for different types of data storage. Kedro uses a MemoryDataSet
for data that is simply stored in-memory, and it alternatively provides different built-in datasets for different file types and file systems so you don’t have to write the logic for reading/writing data:
# Prepare a data catalog
data_catalog = DataCatalog({“my_salutation”: MemoryDataSet()})
Runner
The Runner is an object that runs the pipeline. Kedro resolves the order in which to execute the nodes:
- Kedro first executes
return_greeting_node
. This runsreturn_greeting
, which takes no input but outputs the string “Hello”. - Kedro stores the output string in the
MemoryDataSet
namedmy_salutation
. - Kedro then executes the second node,
join_statements_node
. This loads themy_salutation
dataset and injects it into thejoin_statements
function. - The function joins the input salutation with “Kedro!” to form the output string “Hello Kedro!”
- Kedro returns the output of the pipeline in a dictionary, with key
my_message
.
Hello Kedro!
It’s now time to stitch everything together. Here is the full example:
If you put this example into a file, e.g. hello_kedro.py
and run it in the terminal (python hello_kedro.py
), you should see {‘my_message’: ‘Hello Kedro!’}
printed to the console.
Get started!
Now you’ve a basic understanding of the concepts, why not install Kedro and try it out? Most people start with the Kedro spaceflights tutorial, which you can find in the documentation, but to simplify it, I’ve created a shortened version below.
Consult the installation prerequisites in the Kedro documentation. Once you are ready, run the following to install from the Python Package Index (PyPI):pip install kedro
To check that Kedro is installed:kedro info
You should see an ASCII art graphic and the Kedro version number.
If you do not see the graphic displayed, or have any issues with your installation, see the frequently asked questions or Kedro community support on Discord. And for more information, check out the documentation about installing Kedro.
Kedro spaceflights tutorial
In the tutorial you will learn each step of the Kedro project development workflow with an example that constructs nodes and pipelines for the following price-prediction scenario:
It is 2160 and the space tourism industry is booming. Globally, there are thousands of space shuttle companies taking tourists to the Moon and back. You have been able to source amenities offered in each space shuttle, customer reviews and company information.
You want to construct a model for predicting the price for each trip to the Moon and the corresponding return flight.
You will need approximately 20 minutes to work through this tutorial, which uses a Kedro starter to create the spaceflights project for you, then walks through the most important sections of code to relate them to the concepts you learned above.
Set up the project
We assume at this point that you have installed Kedro, as described above. First, create your project within your preferred working directory, using the Kedro starter for the spaceflights tutorial:kedro new --starter=spaceflights
You’ll have a fully-working project when you use this, with all the code and data set up and ready to run.
Feel free to name your project as you like, but here we assume it’s called Kedro Tutorial.
Keep the default names for the repo_name (kedro-tutorial
) and python_package (kedro_tutorial
) when prompted.
Next, use Kedro to install the project-specific dependencies. Navigate to the root directory of the project and run:kedro install
Kedro manages your project’s dependencies to make it easier for others to run the project. It avoids version conflicts: Kedro ensures that you all use same Python packages and versions. You can find out more about how to work with project dependencies in the Kedro documentation.
At this point, you can just run the project to see what happens. So if you want to skip ahead, scroll down to the “Test the pipeline” section.
Set up the data
The spaceflights tutorial makes use of fictional datasets of companies shuttling customers to the Moon and back. You will use the data to train a model to predict the price of shuttle hire. However, before you get to train the model, you will need to prepare the data for model building.
The spaceflight tutorial has three files and uses two data formats: .csv
and .xlsx
. You will find the raw data in the data/01_raw/
folder of your project directory:
The datasets need to be registered so Kedro can load them. All Kedro projects have a conf/base/catalog.yml
file, and in the example you’ll see each dataset added within that file by the starter:
- File location (path)
- Parameters for the given dataset
- Type of data
- Versioning
For the csv
datasets:
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csvreviews:
type: pandas.CSVDataSet
filepath: data/01_raw/reviews.csv
For the xlsx dataset:
shuttles:
type: pandas.ExcelDataSet
filepath: data/01_raw/shuttles.xlsx
To confirm that Kedro can load the data correctly, open a Kedro iPython session (kedro ipython
) and run this:
companies = catalog.load(“companies”)
companies.head()
shuttles = catalog.load(“shuttles”)
shuttles.head()
The command loads each dataset. The head method from pandas then displays the first five rows of the dataframe.
When you have finished, close the iPython session:exit()
Next, we will set up two modular pipelines for the project:
data_processing_pipeline
to preprocess the datadata_science_pipeline
to create features, train and evaluate the model.
Data processing modular pipeline
Let’s look at the example nodes used to pre-process the three input datasets to create a master table and prepare the data for modelling.
In src/kedro_tutorial/pipelines/data_processing/nodes.py
, you’ll see two functions: (preprocess_companies
and preprocess_shuttles
) that each input a raw dataframe and output a dataframe containing pre-processed data:
You’ll find nodes for both functions, and a modular pipeline for data processing, in create_pipeline()
within src/kedro_tutorial/pipelines/data_processing/pipeline.py
:
Note: companies and shuttles refer to the datasets defined in conf/base/catalog.yml
. These are inputs to the preprocess_companies
and preprocess_shuttles
functions. The Kedro pipeline uses the named node inputs (and outputs) to determine interdependencies between the nodes, and hence, their execution order.
The nodes each output a new dataset (preprocessed_companies
and preprocessed_shuttles
. Kedro’s DataCatalog
will take care of saving the datasets automatically (in this case as csv
data).
In the pipeline above, you’ll see another node, create_master_table()
, which joins together the three dataframes into a single master table, within src/kedro_tutorial/pipelines/data_processing/nodes.py
:
Kedro callscreate_master_table()
with the data loaded from datasets preprocessed_shuttles
, preprocessed_companies
, and reviews
and saves the output to dataset master_table
.
Modular pipeline for data science
Now let’s look at data science pipeline for price prediction, which uses the LinearRegression implementation from the scikit-learn library.
You’ll see three data science functions in src/kedro_tutorial/pipelines/data_science/nodes.py
:
The input parameters are in the configuration file conf/base/parameters.yml
:
test_size: 0.2
random_state: 3
features:
- engines
- passenger_capacity
- crew
- d_check_complete
- moon_clearance_complete
- iata_approved
- company_rating
- review_scores_rating
These are the parameters fed into the DataCatalog
when the pipeline executes. The parameters test_size
and random_state
are used as part of the train-test split, and features gives the names of columns in the master table to use as features.
The trained model is saved as a dataset (see conf/base/catalog.yml
):
regressor:
type: pickle.PickleDataSet
filepath: data/06_models/regressor.pickle
versioned: true
The modular pipeline for data science is created within src/kedro_tutorial/pipelines/data_science/pipeline.py
:
Merge the pipelines
The data processing and data science pipelines are added to the project in register_pipelines within src/kedro_tutorial/pipeline_registry
:
The two modular pipelines are merged together into a project default pipeline by the __default__ key
used in “__default__”: data_processing_pipeline + data_science_pipeline
.
Note: The order in which you add the pipelines together is not significant and data_science_pipeline + data_processing_pipeline
will result in the same pipeline, since Kedro automatically detects the correct execution order for all the nodes in the resulting pipeline.
Test the pipeline
To run the pipeline:kedro run
You should see output similar to the following (there may well be some warnings):
Summary
In this tutorial, we illustrated a standard Kedro development workflow:
We created a new project with kedro new, installed project dependencies with kedro install
(the full tutorial talks about configuration too)
2. Data
We added data to the data/
folder and referenced the input datasets for the project in conf/base/catalog.yml
3. Pipelines
We created the data transformation steps as Python functions (nodes), constructed modular pipelines and merged them into a project pipeline. For brevity, we didn’t cover how to visualise your pipelines.
Wow, that was some long tutorial!
Sorry, yes, it was! There’s even more detail over in the Kedro documentation, which you’ll find useful if you want to dig in more to what’s going on, or if you have any problems running the example:
- Introduction to Kedro
- Longer version of this tutorial
- API reference
- Kedro Github repo
- Kedro community update
Acknowledgements
The Spaceflights example is based on a tutorial written by the Kedro team at QuantumBlack Labs back in 2019. I couldn’t have done any of this without them. So many thanks are due to Yetunde Dada, Ivan Danov, Dmitrii Deriabin, Lorena Balan, Kiyohito Kunii, Merel Theisen and others in QB Labs who have since joined the team.