How to build models as products using MLOps? Part 2 | Machine Learning Pipelines with Kedro

Published in

Indicium Engineering

10 min readOct 6, 2021

You downloaded your data from a database, then did exploratory data analysis (EDA), removed the invalid data, defined the prediction problem, did several tests, and finally chose a nice model with an accuracy above eight thousand (with fine-tuning of hyperparameters, of course).

But what now?

How to turn this extremely organized notebook into a product?

How I feel working on my notebooks — Photo by Wonderlane on Unsplash

First of all, it isn't easy to stay organized using a notebook. There are tools for documentation, but life can go on without making good use of them. Also, the structure of notebooks has some limitations depending on the use.

Typically, their results are extremely useful for one-off analyses, but they lack functionality for continuous use. In this context, integration, maintenance, and performance are keywords. After all, more robust applications need more robust frameworks.

The algorithms that predict the best time to send you a notification are not running on a notebook — Photo by Jonas Leupe on Unsplash

Our suggestion of the day is the Kedro library which, according to the definition available in the official documentation:

Kedro is an open-source Python framework for creating reproducible, sustainable, and modular data science code. It borrows concepts from software engineering best practices and applies them to machine learning code; the concepts applied include modularity, separation of interests, and version control.

This paragraph describes in an almost musical way what Kedro is all about. Let’s go through the main points in parts.

Open source Python framework: Kedro is free and maintained by its community, making it easy to use.
Reproducible code: with Kedro, you can create data pipelines and use them for different data sources.
Sustainable code: Kedro’s framework makes it easy to maintain code and for a team to work collaboratively on the same pipeline.
Modular code: functions can be used in different parts of the pipeline and different pipelines.

Next, we present a quick guide to applying Kedro, based on the official documentation for the module. In the end, we will also show an example application project.

Installation

For the installation of Kedro, it is highly recommended to use a virtual environment. We will not go into details here, but there are several online sources on how to do it. After activating the virtual environment, the installation of Kedro is done using the PyPI package; run, in a terminal:

$ pip install kedro

To verify the installation, run:

$ kedro info

If all goes well, you will see the following image, followed by the installed version:

Creating your first Kedro project

To create your first Kedro project, run the following command on your terminal:

$ kedro new

Next, you will have to answer three questions: (1) what is the name of the project (use a readable name separated by underscores); (2) the project folder name (you can leave it blank, and kedro will use the project name); (3) the name of the python package (can also be left blank). You should also start Git versioning using:

$ git init

(We will not be exploring Git concepts here, but it is highly recommended that the Kedro project follows code versioning best practices. Kedro ships with readme and gitignore files by default, and there are also local folders that should not be sent to the version control repository.

Example Project

As mentioned in the first text of this series, we will use a standard project to apply the frameworks we will work on: the complete Pokémon dataset, whose data is available in Kagle). The repository for this project is available here, and the datasets can be found at:

https://www.kaggle.com/rounakbanik/pokemon

https://www.kaggle.com/vishalsubbiah/pokemon-images-and-types

Standard Kedro structure

Kedro is defined by a structure of nodes and pipelines. Nodes are the functions that perform any operations on the data. A set of nodes executed in a sequence is called a pipeline. The most common pipelines are data engineering and data science pipelines.

The data pipelines and the nodes performing the operations — Photo by Sigmund on Unsplash

In practice, Kedro uses standard folders to define its nodes and pipelines. The most important Kedro default folders are:

/conf/ — contains the files catalog.yml and parameters.yml, which are the settings of all the files and parameters used in the templates; it has two subfolders, base and local (the last one is ignored by Git).
/src/ — contains the files nodes.py and pipeline.py which define the functions (on the nodes) and the pipelines that identify which inputs and outputs each node has; the subfolder is the project name.
/data/ — Kedro creates a folder for saving data (both raw and manipulated) and templates; the default subfolders are already defined and numbered in the order in which the data is usually used in data science pipelines.

The subfolders of the conf and src folders should have the names of the pipelines. For example, consider the following subfolders:

/conf/base/data_engineering/

Holds information about input and output data (save locations, format, etc.) and parameters used in the data engineering functions
Note: The root folder (/conf/) holds information about data that has not yet been processed.

/src/pokemon_ml_project/data_engineering/

It stores the node file with the functions executed in the project’s data engineering and the pipeline that defines the order of operations.

Kedro has a specific file for defining pipelines and the commands to be executed in the terminal to run specific pipelines. The hooks file declares all pipelines in the project and is located in this path:

/src/pokemon_ml_project/pipeline_registry.py

The function register_pipelines() inside this file defines all pipelines in the project.

Note: In previous versions of Kedro, the file was named hooks.py, but the file structure is the same.

Pipelines and Nodes

The heart of the Kedro project is the nodes and pipelines. These files contain all the information about how to manipulate the data and in what order.

As described earlier, the nodes are the functions to be applied to the data. And they are, in fact, python functions stricto sensu. The main advantage of using functions for operations is their modularity. For example: datasets with identical columns can all train their models in the same way, and therefore the same node can be reused on different datasets. You can find an example node in the following file:

pokemon_ml_project/src/pokemon_ml_project/pipelines/pre_processing/nodes.py

You can see, for example, that in the preprocessing node, there are functions for cleaning data and basic operations. You can browse through the repository and understand step by step the operations that lead up to training the model.

Now that we have the functions, all that is needed is to deal with inputs and outputs, and this is where the magic of pipelines comes in. The pipeline file describes the order of the nodes to be executed, their names, the function (from the file nodes.py) that each node will execute the inputs and outputs of that function.

The example of the preprocessing pipeline, the same as the nodes mentioned above, can be found in:

pokemon_ml_project/src/pokemon_ml_project/pipelines/pre_processing/pipelines.py

When entering the pipelines.py file, you may notice that the data is only called by name, and there are no indications of the target path, for example. Kedro provides an intuitive structure for this: the configuration files catalog.yml and parameters.yml, as we will detail below.

The catalog.yml and parameters.yml files

Kedro uses files of extension yml for project settings. The main ones are catalogs and parameter files. This standard format is easy to read, and its standardization allows much easier interaction for all project professionals.

These files also provide modularity that allows integration between the various environments of the project: development, production, and even in the testing environment of notebooks. This ensures that the whole team is working on the same files and that there is no need to look for them in some directory (or run the risk of accessing the wrong file).

The catalog contains information for saving and therefore importing files. It specifies the file path and file type and includes other information, such as access credentials (which will be saved in another yml file, credentials). For example, in the path pokemon_ml_project/conf/base/catalog.yml is the location of the raw dataset:

pokemon_database:
  type: pandas.CSVDataSet  filepath: data/01_raw/pokemon.csv

This file specifies the dataset name in the context of Kedro, the type, and the save path. Once declared in any project catalog, Kedro itself can handle the files in the context of pipeline files, as we discussed in the previous section. In this case, the file is saved as a pandas dataset in csv, but there is a long list of file types supported by Kedro, which is available in the official documentation.

So to call the file data/01_raw/pokemon.csv, we need to call “pokemon_database”, and Kedro handles the rest.

Parameters.yml, on the other hand, makes it possible to change function parameters in an easy way, such as hyperparameters for fine-tuning a machine learning model. An example of the parameters in the path

pokemon_ml_project/conf/base/pipelines/data_science/parameters.yml

test_data_ratio: 0.3# Modelsprimary_type_models: sk_learn_dt:   classifier: sklearn.tree.DecisionTreeClassifier   params: {    ‘max_depth’: [10, 50, 100, 200],    ‘min_samples_leaf’: [5, 15, 25, 30]   } sk_learn_rf:  classifier: sklearn.ensemble.RandomForestClassifier   params: {    ‘n_estimators’: [50, 100, 200],    ‘max_depth’: [10, 50, 100, 200],    ‘min_samples_leaf’: [5, 15, 25, 30]   }

In this file, we can see the description of some common parameters in ML, such as the training and testing fraction and the most common parameters in a decision tree classification model. These are arguments to Kedro’s node functions and will be passed to the models within the nodes.

Jupyter notebooks in Kedro

Another standard folder is notebooks. Kedro has an embedded integration with Jupyter, and the user can use the notebooks for exploratory analysis and function testing. The notebooks can be used via the command:

$ kedro jupyter lab

This command creates a Jupyter and returns a link to access it through a browser. Kedro has its own commands for interacting with data and functions, and can load datasets and models with just one line of code. You can import every object defined in a catalog (from a trained model to a Pandas DataFrame) by using the following command:

catalog. load(“file_name_in_catalog”)

Jupyter Notebook is highly recommended for

Applying all concept tests before they are implemented,
Performing EDA of the data

If you are looking for more commands for the Jupyter notebook, in the context of Kedro, you can find them in its official documentation.

The requirements file

Kedro creates a file with the project requirements, i.e., which modules are used, which versions, etc. This prevents modules from being updated and functions from stopping working. You fill in the requirements.txt file (located in the /src/ folder) and run the following command to update Kedro’s dependencies:

$ kedro build-reqs

Then, to install all the dependencies correctly, the command is:

$ kedro install

Pipeline visualization

Kedro also provides a visualization tool for the pipelines called Kedro-Viz. It automatically connects the nodes and provides an interesting visualization for both technical professionals and presentations to non-technical audiences. To install the extension, use:

$ pip install kedro-viz

The visualization is generated from the command (yes, that’s it):

$ kedro viz

The Kedro-Viz pipeline for the Pokémon project

Running your project

To run your Kedro project, just run the following command:

$ kedro run

It is also possible to run specific pipelines or nodes by running:

$ kedro run --pipeline=data_engineering$ kedro run --node=clean_data

Since just one command can perform several tasks, Kedro works efficiently with APIs and other code execution automation frameworks.

All node outputs will be saved in the folders specified in the catalog.yml files and they can be either local or cloud located.

Extra: best practices

Yes, Kedro is great for managing data science pipelines. The default framework is already a strong incentive for the project to stay on track. However, you have to be disciplined and careful when writing a Kedro pipeline to take full advantage of this framework.

The DRY principle

There is a mantra in software development that says, “do not repeat yourself.” If you are repeating operations in a file, it is probably because it should become a function.

Keep your functions to only one logical level, and leave the higher levels to be done inside other functions. If those functions need more logic levels, call functions inside your functions. This will make your node file more readable, more modular, and easier to maintain or change.

Use a standard pipeline structure.

The name of your pipelines should instinctively indicate to the reader what role they play in the project structure. Our suggestion is to use the standard names:

— pre_processing: used for the initial cleaning of the data.

— data_engineering: used for feature creation, separation of training and test sets, X and y separation, etc.

— data_science: ML modeling.

— metrics: node for evaluating model results (we will explore this further in the next part of this series.)

Think about the modularity

Whenever you write a function, try to make it as generic as possible. Imagine that your data may change, such as the number of columns, column names, etc. Writing code that is easy to update will save you a lot of work in the future.

Document the processes: other people will work on the same project as you

Leave comments in every function that you create. What that function does, what are the essential inputs and outputs so that the next people to use your code will be able to interpret and make any necessary changes. You will thank your old self when you need to change something in a project a few months later.

Seek help from a developer if your programming experience is limited.

If you do not have extensive programming experience — like me, in my first Kedro project — seek the help of a developer to review your code and suggest changes.

Don’t be afraid of debugging; it is your best friend.

Learn to use the debugging feature of your IDE, (Kedro is easily debuggable via VSCode, for example) and make extensive use of it. Whenever the results are different from expected, or when an unexpected error occurs, it is essential to find the problem and fix it quickly.

Next steps

Kedro is not capable of tracking project metrics, nor can it save changes made during the project’s lifetime. Therefore, other tools can be used as plugins for Kedro to implement other functionality.

In the next post, we will introduce the MLFlow plugin for Kedro and how you can track the performance of the models with each new iteration.