Photo by Helio Dilolwa on Unsplash

Unleashing the Power of Kubeflow: A Guide to Training Machine Learning Pipelines (Part 1: Introduction)

Ines Benameur
Gnomon Digital
Published in
10 min readJan 18, 2024

--

Table of contents

Introduction

As organizations embrace the need for reproducibility, scalability, and automation in their machine learning pipelines, Kubeflow emerges as a robust solution and a paramount tool to achieving scalable and reproducible results.

This is what several articles and docs would say about Kubeflow, without getting into much details. And as a consulting company that provides expertise and guidance to our clients, we figured, why not put Kubeflow to the test ourselves?

We wanted to make sure it’s the perfect fit for our clients and their journey in tackling machine learning challenges in production.

Our own journey to master Kubeflow and grasp its fundamental concepts was a bit of a bumpy ride, as the available resources were often lacking in detail, especially when it came to addressing complex and specific use cases.

Therefore, the idea behind this series of articles is to consolidate all the insights we’ve gathered about this tool, and answer two main questions:

  • To Kubeflow or not to Kubeflow?
  • How to Kubeflow?

This series of articles serves as a comprehensive guide to fathom the full potential of Kubeflow for training machine learning pipelines.

From the foundational concepts of Kubeflow Pipelines (Part 1) to hands-on examples of building and running pipelines (Part 2), and deploying an ML model with Kserve (Part 3), we will explore how this platform empowers data scientists and machine learning engineers to focus on model development while abstracting away the complexities of infrastructure management, and what considerations should you take into account before deploying Kubeflow?

Kubeflow Overview

Kubeflow is an open-source platform designed to simplify and streamline the deployment, monitoring, and management of machine learning workflows on Kubernetes. It is not just a tool; it’s an ecosystem designed to address the unique challenges of machine learning in production.

Evolution And Milestones

  1. Introduction: The Kubeflow project was introduced at KubeCon + CloudNativeCon North America 2017 by Google engineers David Aronchick, Jeremy Lewi, and Vishnu Kannan. It aimed to address the perceived lack of flexible options for constructing production-ready machine learning systems and originated from Google’s initiative to open-source their internal TensorFlow operations.
  2. First release: Kubeflow rapidly gained momentum, with its first release, Kubeflow 0.1, announced at KubeCon + CloudNativeCon Europe 2018, quickly becoming one of the top 2% of GitHub projects.
  3. Stability: Kubeflow 1.0 was unveiled in March 2020, marking the graduation of many components to a “stable status”, signaling their readiness for production usage.
  4. Application to CNCF: In October 2022, Google applied for Kubeflow to join the Cloud Native Computing Foundation, and by July 2023, the foundation voted to accept Kubeflow as an incubating stage project.

Why use Kubeflow?

Our consulting company recently undertook a project for a client aiming to enhance their Machine Learning Operations (MLOps) and we made the choice to implement Kubeflow because we thought it was specifically adapted to our client’s use case and deeply rooted in their operational landscape:

  1. Operating within the Google Cloud Platform (GCP) and leveraging Kubernetes for container orchestration, our client’s technological backbone became a pivotal factor in selecting the MLOps solution. So, Kubeflow, being the cool kid on the Kubernetes block, just made sense!
  2. As a data scientist, the goal is to train machine learning models quickly and efficiently. For this purpose, Kubernetes is often employed, an orchestration tool for containers that automates the deployment, scalability, and management of containerized applications. Kubernetes simplifies the execution of applications on a cluster comprising a set of servers called nodes. However, orchestrating and scheduling containers on physical or virtual machines involves the creation of components (nodes, pods, services, etc.), which is not the primary concern for data scientists. Kubeflow was there to the rescue as it offers a great amount of automation.
  3. Traditional machine learning workflows often involve a series of disparate tools and manual interventions, making them prone to errors and challenging to reproduce. Kubeflow aims to solve these issues by providing a unified platform, enabling users to define, deploy, and manage entire machine learning pipelines as code.
  4. Another important thing to consider was that it enables versioning and monitoring of the produced models, making it easier to track changes and revert to previous states.

Limitations to consider

When setting up the project for our client, we realized that while Kubeflow offers a comprehensive solution for data scientists, there are important aspects that users need to be mindful of:

  1. Deployment Challenges: Deploying Kubeflow across different Kubernetes environments may pose challenges, as documented steps may be incomplete or require specific configurations. When deploying Kubeflow with Google Kubernetes Engine (GKE ) for the first time, we encountered several challenges that were not explicitly explained by the deployment guide, this includes allowing access to Kubeflow API inside the notebook, specifying limits and requests for all Kubernetes Pods running different Kubeflow components (Pipeline, Tensorboard, etc.)
  2. Model Deployment Complexity: Deploying models in Kubernetes involves scalability considerations, framework dependencies, access management, monitoring, and contingency planning. And even though we mentioned earlier that Kubeflow facilitates the use of Kubernetes without burdening data scientists with the complexity of these configuration steps, we can’t really assert that debugging issues with the containers execution would be a piece of cake for them: For a data scientist using Kubeflow for the first time, with little to no prior knowledge about Kubernetes, this may be a challenging journey.
  3. Cost Implications: While Kubeflow is open source, it does incur costs associated with maintaining infrastructure, including the need for container environments and computing resources. This upfront investment and ongoing expenses might not be feasible for all companies, as deploying a full suite of Kubeflow components and add-ons requires a considerable resources allocation (roughly 30 Pods in the Kubeflow namespace alone!) .

Understanding Kubeflow’s Global Architecture

The architecture consists of a couple of key components, each serving a specific purpose:

  1. Kubeflow Pipelines defines and manages machine learning workflows.
  2. For model serving, KServe ensures seamless and scalable deployment.
  3. Katib, on the other hand, handles hyperparameter tuning.

The entire ecosystem is designed to be modular, allowing for flexibility and extensibility, and it embraces standardization to enhance interoperability with other tools.

Let’s explore the key elements that make up the global architecture of Kubeflow, highlighting the essential solutions and integrated components:

Kubernetes Layer

Kubernetes, the container orchestration platform, lies at the heart of Kubeflow’s architecture. Kubernetes serves as the underlying infrastructure, managing the deployment, scaling, and operation of containerized applications. Let’s go through the Kubernetes objects we mentioned in the diagram:

  • Dex: Deployed as a Kubernetes pod, Dex helps manage user authentication and authorization, allowing users to log in securely and access Kubeflow components with appropriate permissions.
  • MinIO: Deployed as a Pod, Minio integrates with Kubeflow Pipelines to store and retrieve artifacts (see article 2 for more details) during the machine learning workflow. It provides a scalable and distributed storage solution for data used in ML experiments.
  • Secrets: Secrets are employed in Kubeflow to securely store and manage sensitive information required by different components. For instance, API keys or credentials necessary for accessing external services like databases or cloud storage might be stored as secrets.
  • Pods: Pods are the fundamental units of deployment in Kubernetes, encapsulating one or more containers. In Kubeflow, pods are utilized to run different components, such as training jobs, serving containers, and various microservices. For example, jobs can run in pods to execute machine learning model training.
  • Metadata Store: Though not a specific Kubernetes object, it is a crucial part of managing metadata related to machine learning workflows. Metadata Stores are employed to track and manage information about experiments, pipeline runs, and associated artifacts. While specific implementations may vary, a Metadata Store is typically used to store information about the state and history of machine learning workflows.

Kubeflow Components Layer

  • Central Dashboard: The Central Dashboard provides an intuitive web-based interface for users to interact with and monitor their machine learning experiments, pipelines, and deployed models. It exposes the User Interfaces (UIs) of Kubeflow components running in your cluster.
  • Notebook Servers: Integrated notebook servers allow data scientists to create and experiment with their machine learning models interactively. These notebooks are based on Jupyter.
  • Kubeflow Pipelines: KFP is a critical component that facilitates the end-to-end orchestration of machine learning workflows. Our focus in this article will be on creating and managing pipelines within Kubeflow. We will explore this component in details in the following section.
  • Kubeflow Katib: Katib focuses on hyperparameter tuning, automating the optimization of machine learning models by efficiently searching through hyperparameter spaces.
  • Kubeflow KServe: KServe within Kubeflow handles the deployment, scaling, and management of machine learning models as real-time serverless inference services.

Now that we went through the global architecture of Kubeflow, we will focus in the remaining of this article and the following one, on Kubeflow’s main component which is Kubeflow pipelines.

Overview of Kubeflow pipelines

Kubeflow Pipelines is at the heart of Kubeflow’s workflow orchestration capabilities.

A pipeline is a declarative and reproducible description of an end-to-end machine learning workflow. It defines a series of steps, each encapsulating a specific task or operation within the machine learning process, such as data preprocessing, model training, or evaluation.

DAG visualization in Kubeflow UI

How does a pipeline work in Kubeflow?

  1. Component-Based Structure: A pipeline is composed of individual components, each representing a discrete unit of work. These components are encapsulated in containers or as Python functions, making them modular, reusable, and easily maintainable.
  2. Directed Acyclic Graph (DAG): The components in a pipeline are connected in a directed acyclic graph (DAG), specifying the flow of data and dependencies between the different steps. This ensures a logical and efficient sequence of operations during the execution of the pipeline.
  3. Input and Output Artifacts: Components within a pipeline communicate through well-defined input and output artifacts. Artifacts can be datasets, models, or any other data produced or consumed by a step. This enables a clear specification of data dependencies between pipeline components.
  4. Parameterization: Pipelines support parameterization, allowing users to define configurable parameters for components. This flexibility enables the same pipeline to be reused with different input configurations, facilitating experimentation and hyperparameter tuning.
  5. Pipeline DSL and YAML Specification: Kubeflow Pipelines provide a Domain-Specific Language (DSL) for defining pipelines programmatically in Python. Additionally, the pipeline configuration can be represented in YAML format, serving as a platform-neutral Intermediate Representation (IR) that ensures cross-platform portability.
  6. Execution on Kubernetes: Kubeflow Pipelines leverage the underlying Kubernetes infrastructure for container orchestration. Each pipeline step is executed as a containerized workload on a Kubernetes pod.
  7. Logging and Monitoring: During pipeline execution, Kubeflow Pipelines generate logs and metrics that are essential for monitoring and troubleshooting. The Kubeflow Pipelines UI provides a user-friendly interface for visualizing the progress, logs, and metrics of each step in real-time.
  8. Reproducibility and Versioning: Pipelines contribute to reproducibility by capturing the entire workflow as code. Additionally, the ability to version control pipeline definitions ensures traceability and reproducibility across different iterations of the machine learning workflow.

Key advantages of Kubeflow pipelines

From the technical description of the pipeline mechanism, we can deduct the following advantages:

  • Author Comprehensive ML Workflows in Python: Kubeflow Pipelines (KFP) facilitates the seamless creation of end-to-end machine learning workflows directly in Python, enabling a familiar and expressive programming environment.
  • Build Customized ML Components or Leverage an Extensive Ecosystem: Users can craft fully customized machine learning components tailored to their specific needs or choose from a rich ecosystem of pre-existing components.
  • Effortlessly Manage, Track, and Visualize: KFP provides robust tools for the effortless management, tracking, and visualization of pipeline definitions, runs, experiments, and machine learning artifacts.
  • Optimize Resource Utilization: With KFP, users can efficiently utilize compute resources by leveraging parallel task execution and implementing caching mechanisms, effectively eliminating redundant executions.
  • Ensure Cross-Platform Pipeline Portability: KFP ensures the portability of machine learning pipelines across different platforms, which promotes seamless deployment and execution across diverse environments.

Conclusion

In conclusion, we’ve explored various facets of using Kubeflow as an end-to-end platform for building and managing machine learning pipelines. We displayed its capabilities, discussed potential challenges, and highlighted crucial considerations for practitioners. Summarizing the key points, Kubeflow emerges as a comprehensive and open-source solution, offering seamless integration with Kubernetes, robust experiment tracking, and efficient component orchestration.

In the next article, we will go through a hands on exploration of Kubeflow pipelines, putting into practice the theoretical insights we’ve discussed in this first part.

References

Kubeflow official documentation: https://www.kubeflow.org/docs/

Wikipedia: https://en.wikipedia.org/wiki/Kubeflow

--

--