Kubeflow Summit Europe 2024 ✨

Malek ZAAG
Ubuntu AI
Published in
5 min readApr 9, 2024

In this blog, we are going to highlight some keynotes of the Kubeflow Summit Europe 2024 which was held this year at Paris. Unfortunately, i couldn’t assist physically but i watched lately the cncf playlist on youtube and tried to do a small wrap up.

💻What is Kubeflow ?

Kubeflow is a Kubernetes-native, open-source framework for developing, managing, and running machine learning (ML) workloads. Kubeflow is an AI/ML platform that brings together several tools covering the main AI/ML use cases: data exploration, data pipelines, model training, and model serving.

What is Kubeflow used for?

Kubeflow solves many of the challenges involved in orchestrating machine learning pipelines by providing a set of tools and APIs that simplify the process of training and deploying ML models at scale.

Kubeflow pipeline

Now we defined what is Kubeflow, let’s start talking about the keynotes and what they brang to us this year :

🤖Scalable Platform for Training and Inference Using Kubeflow at CERN

The European Organization for Nuclear Research, known as CERN is an intergovernmental organization that operates the largest particle physics laboratory in the world.

This talk will go into the details of how a kubeflow based machine learning platform handles all the steps from data preparation, interactive analysis, distributed training and inference.

The requirements at CERN :

  • The platform should manage the full machine learning lifecycle Using multiple services can be confusing and hard to integrate.
MLOps lifecycle
  • The platform needs to be integrated with CERN systems Auth, storage systems, etc…
  • The platform should be centralized to ensure easy and efficient access to GPUs and other accelerators.
Reasons for centralizing resources
  • The platform should be easy to use many scientists are not infrastructure experts.

⚛How MLOPS and Kubeflow are used at CERN ?

ATLAS is one of two general-purpose detectors at the Large Hadron Collider (LHC). It investigates a wide range of physics, from the Higgs boson to extra dimensions and particles that could make up dark matter.

I want to find Higgs bosons in the recorded collisions to study them.

And this was their pipeline workflow to study the Higgs bosons particles.

CERN Atlas pipeline

Salt: General-purpose software to train multi-modal, multi-task transformer models.

Katib: Used within Kubeflow to tune model Hyperparmeters.

Kubeflow Notebooks: Store notebooks to be run in containers.

Ceph: an open-source, distributed storage system.

Transforming Data Science at PepsiCo: The Kubeflow Revolution

Kubeflow is also used at Pepsi and this is for many reasons :

  • We already have K8S clusters and infrastructure team to
    maintain it
  • Lots of data deserves lots of models
  • Hyperparameters tuning -> Katib
  • Serve models -> KServe
  • Model training -> training operators

The need for Kubeflow ?

There was several reasons for using kubeflow at PepsiCo :

  • Production is PAINFUL
  • With all the gaps Data Science was left to fend for themselves.
  • A lot of non-efficient work, going to production (or even staging) is a slog.

This led to creating multiple solutions that works with kubeflow to bring the best to the AI/ML ecosystem like the Monorepo for all of Data Science/AI project :

Monorepo benefits

Or even the Prometheus CLI that is built on top of the kfp SDK:

Prometheus cli features

🔄Culture Shift at PepsiCo

  • None of the code we built matters without rethinking our relationship to Kubeflow.
  • If all we built was better tooling for a broken workflow, there would be no fundamental change.

The Good, the Bad, and the Missing Parts of Kubeflow

Kubeflow

😄The good parts:

  • pipelines
  • notebooks, katib, kserve

😞The bad parts:

  • documentation, tutorials, installation

🤔The missing parts:

  • Monitoring models
  • Model registry
  • Initial setup

What’s coming for kubeflow ?

  • finish cncf graduation
  • establish a TOC (technical oversigh tcommitte)
  • arm64 support
  • conformance testing

🧠AutoML & Training Working Group Updates

AutoML working group (WG) is responsible for all aspects of AutoML features on Kubeflow with Katib as the sub-project. Katib is a Kubernetes-native project with rich support for HyperParameter tuning, Neural Architecture Search, and Early Stopping algorithms.

Katib features

Katib architecture

Katib architecture

Katib future ?

Training operator overview

Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and scalable distributed training of machine learning (ML) models created with various ML frameworks such as PyTorch, TensorFlow, XGBoost, and others.

User can integrate other ML libraries such as HuggingFace, DeepSpeed, or Megatron with Training Operator to orchestrate their ML training on Kubernetes.

Training Operator features
Example of distributed training for PyTorch

Training Operator Roadmap :

Conclusion :

With this AI trend and need for performant and cost effective deployment strategies for ML models, kubeflow can be an interesting option for companies that haven’t already migrated to cloud-native environments.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Was this helpful? Confusing? If you have any questions, feel free to contact me!

Before you leave:

👏 Clap for the story

📰 Subscribe for more posts like this @malek.zaag ⚡️

👉👈 Please follow me: GitHub | LinkedIn

--

--

Malek ZAAG
Ubuntu AI

I build stuff on Cloud☁️ and I am a Kubernetes Enthusiast ☸️