How We Build a Production-Grade Data Pipeline

Published in

Life at Telkomsel

8 min readJan 25, 2021

Imagine this condition: You do some data analytics stuff on Jupyter Notebook. You write a line of code, add a new cell, import pandas, read CSV, transform the values, then write 400 other cells before you get the result you want and save it as CSV.

A week later, you want to produce that very same result, but you completely forgot which piece of code you should run first, and which cell you should NOT run. You will also need to press enter 400-ish times before you get the result you are looking for, will take you like a half-day just to reproduce the result.

Also, imagine this: The input data which you need are being generated by someone else’s Jupyter Notebooks. You would need to ask them to generate the data first, which they need to run their error-prone notebooks. Another half-day wasted.

You might be wondering if there is a tool that makes your scripts reproducible, maintainable and modular; a tool that helps you to do separation of concerns and versioning; and a tool that helps you to deliver real-world ML applications; your life would be easier.

Enter Kedro

“Creative output depends on creative input” — Robert C. Martin

Kedro (https://github.com/quantumblacklabs/kedro) is an open-source Python framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration, and pipeline assembly. By using kedro, we write our data analytics/machine learning scripts with software engineering principles in mind.

We would need to do some installations & setups before we can use the Kedro framework. We are not going into details about how to install kedro, you can find it here.

To create a new kedro project, we can run kedro new on the CLI, then we need to input the project name, repository name, and python package name. Let us use bmi for all the names.

$ kedro new
Project Name:
=============
Please enter a human readable name for your new project.
Spaces and punctuation are allowed.
 [New Kedro Project]: bmiRepository Name:
================
Please enter a directory name for your new project repository.
Alphanumeric characters, hyphens and underscores are allowed.
Lowercase is recommended.
 [new-kedro-project]: bmiPython Package Name:
====================
Please enter a valid Python package name for your project package.
Alphanumeric characters and underscores are allowed.
Lowercase is recommended. Package name must start with a letter or underscore.
 [new_kedro_project]: bmiGenerate Example Pipeline:
==========================
Do you want to generate an example pipeline in your project?
Good for first-time users. (default=N)
 [y/N]: NChange directory to the project generated in /home/user/bmi

This is how the kedro project usually structured:

bmi                     # Parent directory of the template
    ├── conf            # Project configuration files
    ├── data            # Local project data
    ├── docs            # Project documentation
    ├── logs            # Project output logs
    ├── notebooks       # Project related Jupyter notebooks
    ├── README.md       # Project README
    ├── setup.cfg       # Configuration options for `pytest`
    └── src             # Project source code

There are several important concepts in kedro, but we will be focusing on the 3 most important concepts: DataCatalog, Node, and Pipeline.

DataCatalog

DataCatalog is the registry of all data sources that the project can use. DataCatalog is a powerful concept. We can find all data sources & sinks in one place, as opposed to Jupyter Notebook or plain python script that the data definition is scattered everywhere.

The DataCatalog is stored in the yaml file named catalog.yml under conf/bmi/ folder. For example, if we have 3 CSVs as the data input/output, we can define all of them in one file:

The topmost level of the yaml key is the catalog name (e.g freshman_bmi), this name will be used by Node later as a reference to the input/output data. We should also define the type and file path of the data.

In the example we use pandas.CSVDataSet type, but we can also use other types like SparkDataSet , pandas.ExcelDataSet , pandas.SQLQueryDataSet , and many more. You can find the completed DataSet type here.

We can also define the arguments for load/save, like which CSV separator we use, or whether we overwrite or append the file, and so on.

Node

Node in kedro is a python function wrapper that names the inputs and outputs of that function. We can link one Node to another by setting the output of one node as the input to another node.

For example, let us say we have 2 tasks which will:

Calculate the height of a person based on the weight & BMI information, and save it to CSV
Calculate the average weight, height, and BMI for each gender, and save it to CSV

We should first create a function for each of them:

The node contains several parameters:

func: A function that corresponds to the node logic.
inputs: The name or the list of the names of variables used as inputs to the function. We can put the catalog name that we defined on catalog.yml here.
outputs: The name or the list of the names of variables used as outputs to the function. We can put the catalog name that we defined on catalog.yml here.
name: Optional node name to be used when displaying the node in logs or any other visualisations.
tags: Optional set of tags to be applied to the node.

In this example, we would then need to create 2 nodes:

As we can see the first node, named calculate_height, take freshman_bmi as the input of the function, and save the outputs to freshman_with_height.

Both freshman_bmi and freshman_with_height are defined in the catalog.yml as CSV file with their file path, so the function will read/write the data based on the types & file paths defined.

The second node gets the input from the output of the first node, then save it as CSV to freshman_bmi_summary as defined on catalog.yml.

Pipeline

A pipeline organizes the dependencies and execution order of your collection of nodes and connects inputs and outputs while keeping your code modular.

The pipeline determines the node execution order by resolving dependencies and does not necessarily run the nodes in the order in which they are passed in. The pipeline contains one or more nodes.

In this example, we can leverage the above code to create a pipeline as follows:

As for the best practices, usually, we split the pipeline and nodes into different python files, so we will have nodes.py and pipeline.py under src/bmi/pipelines/ folder:

There are two kinds of pipelines: the main pipeline, and the sub-pipeline.
The one that we create above is the sub-pipeline. The main pipeline combines all the sub-pipelines in the project. The main pipeline file is automatically generated when we create a new kedro project. In this example, the main pipeline should be available at src/bmi/pipeline.py.

We need to “register” the sub-pipelines that we created, by importing the pipelines and call create_pipeline() function in the main pipeline file, as follows:

the kedro main pipeline

Executing Kedro Command

After we create a pipeline, we can use kedro run command to run the whole pipeline:

$ kedro run --env bmi --pipeline bmi_pipeline2021-01-15 17:06:03,333 - kedro.io.data_catalog - INFO - Loading data from `freshman_bmi` (CSVDataSet)...
2021-01-15 17:06:03,344 - kedro.pipeline.node - INFO - Running node: calculate_height: calculate_height([freshman_bmi]) -> [freshman_with_height]
2021-01-15 17:06:03,365 - numexpr.utils - INFO - NumExpr defaulting to 4 threads.
2021-01-15 17:06:03,374 - kedro.io.data_catalog - INFO - Saving data to `freshman_with_height` (CSVDataSet)...
2021-01-15 17:06:03,390 - kedro.runner.sequential_runner - INFO - Completed 1 out of 2 tasks
2021-01-15 17:06:03,391 - kedro.io.data_catalog - INFO - Loading data from `freshman_with_height` (CSVDataSet)...
2021-01-15 17:06:03,398 - kedro.pipeline.node - INFO - Running node: calculate_avg_by_gender: calculate_avg_by_gender([freshman_with_height]) -> [freshman_bmi_summary]
2021-01-15 17:06:03,409 - kedro.io.data_catalog - INFO - Saving data to `freshman_bmi_summary` (CSVDataSet)...
2021-01-15 17:06:03,417 - kedro.runner.sequential_runner - INFO - Completed 2 out of 2 tasks
2021-01-15 17:06:03,418 - kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully.

The --env defines which configurations that we use. Since we put our catalog onconf/bmi folder, then we pass bmi as the value. We also specify the pipeline name in the --pipeline argument.

Another thing worth mentioning: we run the pipeline which contains two nodes and generates two CSV files in less than a second!

Apart from the pipeline, we can also use --node and --tag :

$ kedro run --env bmi --node calculate_height

2021-01-15 17:17:12,367 - kedro.io.data_catalog - INFO - Loading data from `freshman_bmi` (CSVDataSet)...
2021-01-15 17:17:12,375 - kedro.pipeline.node - INFO - Running node: calculate_height: calculate_height([freshman_bmi]) -> [freshman_with_height]
2021-01-15 17:17:12,399 - numexpr.utils - INFO - NumExpr defaulting to 4 threads.
2021-01-15 17:17:12,404 - kedro.io.data_catalog - INFO - Saving data to `freshman_with_height` (CSVDataSet)...
2021-01-15 17:17:12,419 - kedro.runner.sequential_runner - INFO - Completed 1 out of 1 tasks
2021-01-15 17:17:12,420 - kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully.
$ kedro run --env bmi --tag bmi_pipeline2021-01-15 17:18:37,251 - kedro.io.data_catalog - INFO - Loading data from `freshman_bmi` (CSVDataSet)...
2021-01-15 17:18:37,257 - kedro.pipeline.node - INFO - Running node: calculate_height: calculate_height([freshman_bmi]) -> [freshman_with_height]
2021-01-15 17:18:37,277 - numexpr.utils - INFO - NumExpr defaulting to 4 threads.
2021-01-15 17:18:37,285 - kedro.io.data_catalog - INFO - Saving data to `freshman_with_height` (CSVDataSet)...
2021-01-15 17:18:37,307 - kedro.runner.sequential_runner - INFO - Completed 1 out of 2 tasks
2021-01-15 17:18:37,308 - kedro.io.data_catalog - INFO - Loading data from `freshman_with_height` (CSVDataSet)...
2021-01-15 17:18:37,315 - kedro.pipeline.node - INFO - Running node: calculate_avg_by_gender: calculate_avg_by_gender([freshman_with_height]) -> [freshman_bmi_summary]
2021-01-15 17:18:37,324 - kedro.io.data_catalog - INFO - Saving data to `freshman_bmi_summary` (CSVDataSet)...
2021-01-15 17:18:37,333 - kedro.runner.sequential_runner - INFO - Completed 2 out of 2 tasks
2021-01-15 17:18:37,334 - kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully.

Kedro Viz

Another cool feature of kedro is Kedro-Viz. It shows you how your data pipelines are structured. With Kedro-Viz you can:

See how your datasets and Python functions (nodes) are resolved in Kedro so that you can understand how your data pipeline is built
Get a clear picture when you have lots of datasets and nodes by using tags to visualise sub-pipelines
Search for nodes and datasets

This is how our BMI data pipeline looks like, as we can see clearly on the viz, we have three Data Catalog/Definition, and two Nodes. We can also see the connection between the data catalog & nodes:

Kedro In Telkomsel

We heavily use kedro in several data-related projects within Telkomsel. We use kedro in our production environment which consumes tens of TBs of data, runs hundreds of feature engineering tasks, and serves dozens of Machine Learning models.

We have been using kedro for more than a year now and things are going smoothly. Some of the benefits we get by using kedro, including:

The collaboration between the Data Engineering and Data Science team has never been this smooth
Have a single source of truth of data sources & sinks, feature logic, and configurations
Ability to run complex end-to-end data pipeline with just a few commands
Fancy data pipeline visualization helps us a lot when debugging and explaining the pipeline to the business user
High-quality data science scripts assured by unit tests & integration tests

If you are wondering, here is one of our kedro data pipeline looks like: