Migrate Kedro Pipeline on Vertex AI
An experimental approach
All right, you talk, I’ll listen!
Do you have any ideas about Data Science on Google Cloud you would like to see? Please let me know by filling out this form. This would help me on future blog posts =)
Disclaimer
What do you know about Kedro? In case you don’t, I leave a short introductory video and a Medium article. Also I assume that you know about Vertex AI. But you have this article where developer advocates share more about it. Finally, the article is based on the kedro-kubeflow plugin which recently introduced Vertex AI Pipelines support. It is still EXPERIMENTAL. But I thought it might be interesting to explore.
Premises
Recently I had the opportunity to talk with several data science teams. And we discuss for real about the main challenges they have by making ready-to-production code. It comes out that time to delivery generally gets longer because
- Management challenges of teams composed by data engineers and data scientists with different backgrounds. Also, in most of the cases, both of them lack software engineering skills. And, if we add the chaotic context of the workplace with strict deadlines and distractions, it results in weak processes, inefficient collaboration and poor code quality.
- Experiments. Along the journey to the “best” ML approach, you run several experiments with different data, modeling approaches and parameter configurations. As a consequence of management challenges, the team does not produce proper documentation and then it cannot guarantee reproducibility and reusability of the code.
- Dependencies. Unlike traditional software, ML code has dependencies such as data and system dependencies. They erode boundaries that encapsulation and modular design would help to create in order to maintain the code. At the same time, make testing and monitoring really challenging.
- Lack of standardization. At the end, once you find the best model, the team needs to re-engineer the code in order to meet IT specifications and build a replicable pipeline to automatically retrain and deploy models.
For those reasons, most of the teams decide to work on what they called “template” or “workflow framework” or “interface”. Some of them built one from scratch, others adopted CookieCutter or, as in this case, Kedro. At the end, the goal is to introduce a layer of standardization on top of their data science platform SDKs in order to make reproducible ML projects in respect of software engineering best practices.
At that point, you might be wondering:
How does Vertex AI fit in this story? What’s the link between Kedro and Vertex AI?
Let’s see.
Our Scenario
Assume that you are part of an innovation unit working in a small consulting company for the agriculture sector. A larger consortium of farmers would like to automate the process of classification of three different varieties of wheat: Kama, Rosa and Canadian. In order to validate the ML approach, a high quality visualization of the internal kernel structure was detected using a soft X-ray technique in collaboration with an important institute of agrophysics. And the data was provided for the proof of concept.
The goal is to build a solid classifier for these three types of wheat. In order to speed up the ML experimentation phase and provide a ready-to-production code, your team adopted Kedro as project framework. Because of the number of members and the amount of data to process, you were looking for a managed service that allows you to scale-up it easily. Indeed, talking about Kedro, when you have to ship models in production, it provides a robust pipeline and packaging framework using Docker or Airflow which means that your Kedro pipelines can be seamlessly deployed on a Kubernetes (Kubeflow Operator) or an Airflow cluster (DAG) at least. But you still need to figure out resources they need in order to do their job. So here the thing:
What if you can find a managed way to deploy a Kedro pipeline?
The only answer that came to your mind at that point was: Vertex AI Pipelines*. Based on the public documentation, Kedro pipelines can be converted into Kubeflow Pipelines which means they can be deployed on the Vertex AI service. At least in theory which is enough for the team to give it a shot in this proof of concept.
Our Dataset
The dataset derives from UCI’s seeds dataset provided by Małgorzata Charytanowicz, Jerzy Niewczas, Piotr Kulczycki, Piotr A. Kowalski, Szymon Łukasik, and Sławomir Zak from University of Lublin and Cracow University of Technology. A soft X-ray technique and GRAINS package were used to construct the following seven, real-valued geometric parameters of wheat kernels from a total of 210 samples: area A, perimeter P, compactness C, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove.
From Kedro to Vertex AI passing through plug-ins and simple arrangements
Let’s assume that you already build your pipeline as described in the Kedro documentation. If you did things right, you should have the following folder structure.
.
├── README.md
├── conf
│ ├── README.md
│ ├── base
│ └── local
├── data
│ ├── 01_raw
│ ├── 02_intermediate
│ ├── 03_primary
│ ├── 04_feature
│ ├── 05_model_input
│ ├── 06_models
│ ├── 07_model_output
│ └── 08_reporting
├── docs
│ ├── build
│ └── source
├── info.log
├── logs
│ ├── info.log
│ └── journals
├── notebooks
│ ├── 1_kedro_vertex_wheat_beam_clf_xgb.ipynb
│ └── notebook_template.ipynb
├── pyproject.toml
├── setup.cfg
└── src
├── requirements.in
├── requirements.txt
├── setup.py
├── tests
├── wk_classification
where 1_kedro_vertex_wheat_beam_clf_xgb.ipynb is the notebook I use to do some experimentation around the ML approach. In this case I end up with the XGBoost model.
Below you can see the pipeline visualization I have
Starting from here, I can barely express the steps I covered to deploy a Kedro pipeline to Vertex AI in the following order: Package. Adopt. Convert. Arrange. Run.
Package.
In order to ship the Kedro project as Vertex AI Pipeline, I installed the kedro-docker plugin as described in the documentation and I built the Docker image.
The plugin comes with nice command line capabilities. You just need to run
kedro docker init
And it generates Dockerfile, .dockerignore and .dive-ci files for your project. In order to make it work, I just changed the BASE_IMAGE in the Dockerfile and I extended the .dockerignore to include the data folder. Below the Dockerfile:
ARG BASE_IMAGE=google/cloud-sdk:latest
FROM $BASE_IMAGE# install project requirements
COPY src/requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt && rm -f /tmp/requirements.txt# add kedro user
ARG KEDRO_UID=999
ARG KEDRO_GID=0
RUN groupadd -f -g ${KEDRO_GID} kedro_group && \
useradd -d /home/kedro -s /bin/bash -g ${KEDRO_GID} -u ${KEDRO_UID} kedro# copy the whole project except what is in .dockerignore
WORKDIR /home/kedro
COPY . .RUN chown -R kedro:${KEDRO_GID} /home/kedro
USER kedro
RUN chmod -R a+w /home/kedroEXPOSE 8888CMD [“kedro”, “run”]
And the .dockerignore
##########################
# Kedro PROJECT
# ignore Dockerfile and .dockerignore
Dockerfile
.dockerignore
# ignore potentially sensitive credentials files
conf/**/*credentials*
# ignore all local configuration
conf/local
!conf/local/.gitkeep
# ignore everything in the following folders
# data
logs
notebooks
references
results
# except the following
!logs/.gitkeep
!notebooks/.gitkeep
!references/.gitkeep
!results/.gitkeep
!data/01_raw
Now, we should have all we need to containerize the Kedro project and deploy it using Kubeflow Pipelines. Based on the original documentation, you need to run the provided script which would convert Kedro nodes and dependencies into the correspondent workflow spec. Some of the limitations I faced are related to the KFP v1 and ContainerOp the script uses which are in deprecation mode. Also I did some tests and the generated spec wouldn’t respect the format required by Vertex AI pipelines service.
So I start to dive into that “conversation” challenge and I find out that even though we convert the pipeline as it is, it would work neither on a self-managed Kubernetes cluster or Vertex AI Pipeline service. Why? Because:
- When you run Kedro pipeline on Kubeflow, nodes do not share memory. Then MemoryDataSets wouldn’t work and every artifact needs to be stored as a file.
- Because of the serverless service, a different set of parameters has to be passed in order to successfully deploy the pipeline on Vertex AI Pipelines.
Lucky for us, kedro-kubeflow plugin recently introduced an extension that allows the conversion of the Kedro pipeline in a KFP standard compatible with the Vertex AI service. And, although it is still experimental, I find out that it is able to address some of those challenges by doing some arrangements.
Adapt.
Once I install the plugin, in order to set up Vertex AI Pipelines as a running infrastructure, I create a new gcp configuration folder. Then, because of the serverless service, a distinct set of parameters has to be passed in order to successfully deploy the pipeline. In the plugin documentation, you will find the full description of them. Also I modify the catalog to deal with the “not share memory” node limitation. For simplicity, below you find the catalog
seeds:
type: pandas.CSVDataSet
filepath: /gcs/${KEDRO_CONFIG_BUCKET}/data/seeds.csv
layer: raw
preprocessed_seeds:
type: pandas.CSVDataSet
filepath: /gcs/${KEDRO_CONFIG_BUCKET}/data/preprocessed_seeds.csv
layer: processing
X_train:
type: pandas.CSVDataSet
filepath: /gcs/${KEDRO_CONFIG_BUCKET}/data/X_train.csv
layer: processing
X_test:
type: pandas.CSVDataSet
filepath: /gcs/${KEDRO_CONFIG_BUCKET}/data/X_test.csv
layer: processing
y_train:
type: pandas.CSVDataSet
filepath: /gcs/${KEDRO_CONFIG_BUCKET}/data/y_train.csv
layer: processing
y_test:
type: pandas.CSVDataSet
filepath: /gcs/${KEDRO_CONFIG_BUCKET}/data/y_test.csv
layer: processing
xgboost_classifier:
type: pickle.PickleDataSet
filepath: /gcs/${KEDRO_CONFIG_BUCKET}/models/xgboost
backend: joblib
versioned: true
layer: model
And the configuration
host: vertex-ai-pipelines
project_id: ${KEDRO_CONFIG_PROJECT_ID}
region: ${KEDRO_CONFIG_REGION}
run_config:
root: ${KEDRO_CONFIG_BUCKET}/pipelines
image: gcr.io/${KEDRO_CONFIG_PROJECT_ID}/classify_wheat_kernel
experiment_name: classify-wheat-kernel
run_name: classify-wheat-kernel
where ${…} would leverage the TemplatedConfigLoader class which allows for template values based on the configs.
So far, everything is set up. Time to convert the pipeline
Convert. Arrange. Run.
Honestly, the conversion itself is simple and straightforward. With the dockerfile and .dockerignore I build the image, I tag properly and I push it to the project’s Container Registry.
kedro docker build --docker-args="--no-cache" --base-image="google/cloud-sdk:latest"
docker tag classify-wheat-kernel gcr.io/$PROJECT_ID/classify_wheat_kernel
docker push gcr.io/$PROJECT_ID/classify_wheat_kernel
Then I run the command
kedro kubeflow -e gcp compile -o ../pipeline/kedro-vertex-classify-wheat-kernel-pipeline.json
That’s it. That’s because the command would generate a PipelineSpec which is supposed to be compatible with the Vertex AI service. But, for what I experienced, I had to arrange the output a bit to make it work. Below the list of changes I made
- Replace “system.Dataset” to “system.Model” in order to store model artifact with the right metadata type
- Delete the –-params arguments from kedro run command line for each node
- Correct cp path to move pipeline artifact under the correct metadata path. For instance, I change /home/kedro//gcs/ to /home/kedro/gcs/
Update: Please check the “Additional Remarks” section to get further clarifications.
Well, when I have done that, I just submit the pipeline to Vertex AI using a custom script and Vertex AI SDK. Below you can see the execution graph of the converted pipeline on Vertex AI².
And I can happily say: CONVERSION COMPLETE!
What’s Next
In this article, I use a simple pipeline example to show how it is possible to convert a Kedro pipeline into a Vertex AI pipeline. I leverage some of the greatest plug-in of Kedro ecosystem such as kedro-docker and kedro-kubeflow.
As I mentioned at the beginning, the entire approach is EXPERIMENTAL as Kedro Kubeflow Plugin support of the Vertex AI Pipelines. Apart from some adjustments, for example, I was not able to figure out how to track InputParameters and OutputParameters as you can see in the figure above. But the idea of submitting a few commands in order to convert an entire project into a scalable pipeline which would be almost ready for production remains so valuable to me. That’s why I’m looking forward to working with the production release of the plugin.
In the meantime, I hope you find the article interesting. If so, clap it or leave comments. And feel free to reach me on LinkedIn or Twitter for further discussion or just to have a chat about Data Science on Google Cloud!
Thank you to Tuba Islam, Janos Bana, Manuel Hurtado, Mariusz Strzelecki and Marek Wiewiorka for their great feedback.
Reference
- https://kedro.readthedocs.io/en/stable/index.html
- https://kedro.readthedocs.io/en/stable/03_tutorial/01_spaceflights_tutorial.html
- https://kedro.readthedocs.io/en/stable/10_deployment/06_kubeflow.html
- https://github.com/quantumblacklabs/kedro-docker
- https://kedro-kubeflow.readthedocs.io/en/0.4.4/index.html
- https://cloud.google.com/python/docs/reference/aiplatform/latest
- https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-0.2.0/
Additional Remarks
In the meantime, I had the opportunity to discuss with Mariusz Strzelecki and Marek Wiewiorka who are some of maintainers of kedro-kubeflow plug-in. And they provide great feedback about cases I faced. In order:
- The latest releases (0.4.4) supports Model artifact as long as the artifact’s layer is set to “models”. As side notes, the plugin was created for KFP v1 protocol compatibility (AI Platform Pipelines) originally. That version didn’t distinguish artifacts types. Now it supports Dataset and Model.
- Further investigations are required. The plugin supports parameters. I’m going to open an issue on Github.
- It was related to the fact that the Vertex AI documentation suggests /gcs/ as prefix while the plugin code handles it differently.
*Some of you would argue why I do not suggest Composer. Below a note from “Best practices for implementing machine learning on Google Cloud” article:
“While you could consider other orchestrators like Cloud Composer (see Airflow), pipelines is a better choice because it includes built-in support for common ML operations and tracks ML-specific metadata and lineage. Lineage is especially important for validating that your pipelines are operating correctly in production.”