source: shutterstock

Kubeflow: Simplified, Extended and Operationalized

yaron haviv
Towards Data Science
13 min readNov 12, 2020

--

The success and growth of companies can be deeply intertwined with the technologies they use in their tech stack. Nowhere is this more apparent than in the case of developing ML pipelines.

The nature of delivering robust ML models and data pipelines to production is a complex business. Every data science team is faced with the same limitations and constraints, regardless of the scope of a given project:

  • Feature analysis and preparation at scale is hard
  • Running many ML training tasks, each with different parameters or algorithms, takes too much time and requires expensive resources
  • Tracking and reproducing experiments, along with the right versions of code, data and results, is complex
  • Transferring data science notebook code to production pipelines takes a lot of effort
  • Production pipelines need to handle many operational tasks for scalability, availability, automated model drift detection, integration with live data, security, etc.

Building a model is far from the end of the journey. To deploy AI enabled applications to production you’ll need an army of developers, data engineers, DevOps practitioners and data scientists, ideally collaborating and working together on a single platform.

There aren’t many solutions in this emerging field that cover this particular space, and Kubeflow has gained momentum as the open-source leader in the past few years. Kubeflow is a wonderful tool for orchestrating complicated workflows running on Kubernetes, but it also poses challenges, especially for data scientists who aren’t accustomed to working with these types of solutions. Many ML practitioners find it complex to use, encounter various usability issues and become frustrated with functionality gaps.

Luckily, it’s possible to extend the capabilities of Kubeflow and transform it into a comprehensive, easy-to-use MLOps platform. At Iguazio, we’ve done just that. We’ve embraced Kubeflow. Our approach has been to add it to our managed services catalog and bridge the functionality gaps that exist, wrapping and extending its capabilities with new open-source frameworks we’ve developed.

Why Do We Need an MLOps Solution?

The Life of a Data Scientist: Tool Wrangler

Data scientists generally go through the following process when developing models:

  1. Data collection from CSV or dense formats (e.g. Parquet) which require manual extraction from external sources. Tools used: ETL and data exporting tools.
  2. Data labelling, exploration and enrichment to identify potential patterns and features. Tools used: Interactive tools for data labelling, exploration and model development; analytics tools/databases to aggregate, join and pivot data at scale.
  3. Model training and validation. Tools used: Parallel model training frameworks.
  4. Model evaluation using automation and/or visualization tools. Tools used: CI/CD frameworks, visualization libraries.
  5. Go back to step 1 and repeat until the desired outcomes have been achieved.

This results in a lot of time-consuming and complexity-snowballing legwork. It’s imperative that data science teams introduce some method into this madness if they have any intention of ever getting a model out of the lab.

It’s Just the Beginning

But this long process is just the first step! You’ll still need to operationalize the model, tie it to applications, and automate your feature engineering process. This can take a year or more and requires an army of ML practitioners attacking the problem with brute force.

source: Twitter

Once we have a model, we need to go over all of the many steps in order to deliver AI/ML based applications to production. This includes: packaging, scaling, performance tuning, instrumentation and automation.

Kubeflow is the ML Toolkit for Kubernetes

Kubeflow is an open-source project originated by Google that groups leading ML frameworks over Kubernetes. It provides common installation tools and cross-validation, but its tools can also be deployed as a standalone. It acts as the brick and mortar for model development, focusing on automation, scaling and tracking of ML model development pipelines over Kubernetes.

Notable Kubeflow Components:

Containerized Jupyter Notebooks

Kubeflow provides a Jupyter Notebook service over Kubernetes so users can create a Jupyter server with given resource requirements and get provisioned containers.

Kubeflow Pipelines (KFP)

What is Kubeflow Pipelines? Simply put, it’s a description of an ML workflow with all of its components. A component can be responsible for different steps in the ML process, like data processing, transformation, model training, validation, etc.

KFP enables you to describe a workflow with individual steps (where each step is a container microservice or serverless function) through a Python SDK, execute that workflow on the cluster, and track the progress and experiment results.

KFP enables users to automate ML pipelines instead of running individual jobs manually.

Scalable Training Operators

Application lifecycle and scaling in Kubernetes can be managed through an operator. The Kubeflow project hosts various operators to run scalable ML training tasks using different ML frameworks (TensorFlow, PyTourch, Horovod, MXNet, Chainer).

Model Serving Solutions

The Kubeflow project hosts a basic model serving framework (KFServing) and supports external serving frameworks (Seldon, Triton and Nuclio-serving). Iguazio’s leading open-source project, Nuclio, a serverless engine for real-time APIs, ML and data engineering tasks has a model serving flavor with fast performance and vast set of features (automated model deployment, canary, model monitoring, drift and outlier detection, ensembles and real-time pipelines, etc.) .

Operationalizing Kubeflow: Key Challenges

Kubeflow is itself a component of a broader system of ML operationalization. Kubeflow contains some powerful components that make it a frontrunner in its space — it’s Kubernetes native, highly customizable, open-source and loosely coupled. But it has some key limitations that need to be addressed by an ML team.

Challenge 1: Complexity

Kubeflow is an ecosystem of tools rather than a holistic or integrated solution, and it relies on many underlying platform services and tools (Kubernetes, user management, data services, data versioning, monitoring, logging, API gateways, etc.). Without a managed solution, companies need a team of developers, DevOps and ML engineers who will work on integrating the various Kubeflow and add-on components into a complete service with unified portals and security, managing the deployment, supporting internal users, upgrading each of the sub-services, building tools and expertise for troubleshooting, and following the community work in an ongoing basis. This may make sense if you already have a large DevOps organization and inexpensive resources, but won’t make sense for many leaner organizations who want to focus on their business applications.

Let’s assume you want to run hyperparameter jobs using some tool like Katib, with Kubeflow Pipelines, a distributed ML operator like MPIJob, and track or compare the results. You would need to stitch all of those tools together manually, even though they are all child projects under Kubeflow.

Installing Kubeflow properly means running and managing more than a dozen individual services, and some are not production grade. ML practitioners who like the concept of a single AI/ML framework over Kubernetes find themselves struggling to deploy, manage and secure it. Once they do, they notice their internal users reluctant to use it, due to usability and integration issues.

Challenge 2: Kubeflow Speaks the Language of Engineers, Not Scientists

The core focus of Kubeflow is advanced ML engineering using Kubernetes. It was originally designed by Kubernetes community members for DevOps and MLOps engineers, rather than for data scientists.

Like every other Kubernetes functionality, it all starts with Docker containers, YAML files and scripts. Data scientists are the main consumer for a data science platform. They don’t want to manage YAML files — they’d rather use a simple UI and data engineering, or ML logic in Python, and have something run it for them on a cluster and then collect the results, hopefully in a scalable and managed way.

While running ML tasks in distributed pipelines is powerful, data scientists usually start off using a local IDE like Jupyter, PyCharm or VSCode. They want to develop, debug, run and track the work locally without creating Kubernetes resources or complex pipelines, and only after they get the individual runs to work successfully they (or an ML engineer from the team) want to transition to running on a scalable cluster or use multi-stage pipelines. This is unfortunately not possible in Kubeflow, as every tracked execution requires building a full pipeline and Kubernetes resources.

Data science teams would like to organize and manage their work based on logical ML projects, with clear project level membership, project specific resources and billing, ability to isolate resources and security credentials between projects, etc. In Kubeflow users can create jobs with elevated privileges, there is no isolation, monitoring or tracking at the user or project level.

And again, many tasks are complex and manual. For example, if you want to view the results of a pipeline run, you must write code to store and visualize the run artifacts, most commercial tools or managed just do that for you.

The data science team should be able to use familiar tools and simple UI portals, and have resources be provisioned and monitored for them under the hood.

Challenge 3: Kubeflow Is A Partial MLOps Solution

The Kubeflow toolkit applies mainly to the model development stage, meaning teams need several other mandatory services, adding further overhead. To get to business value fast (or at all), teams need to extend Kubeflow functionality to the rest of the data science pipeline and manually integrate with external data and identity management systems.

The MLOps Stack and Kubeflow (image by author)

Kubeflow focuses on the model development pipeline for running and tracking ML experiments, however most users need additional services as outlined in the picture above, services for scalable data and feature engineering, model management, production/real-time pipelines, versioned and managed data repositories, managed runtimes, authentication/security, etc.

Iguazio developed open-source frameworks (like Nuclio, MLRun) and additional data services and management layers which complement and extend Kubeflow into a fully functional, managed data science platform. This is further described in the section below.

Delivering an End-to-End Data Science Platform, Kubeflow Inside

Much of the complexity in delivering data science based products to production is the integration between different organizational and technology silos, taking individual “best of breed” components and integrating them manually require significant amount of resources, maintain the organizational silos, and leads to a huge technical debt since you maintain all the integration and glue layers.

We believe that organizations should deploy a data science solution which spans the data engineering, data science, ML engineering and DevOps space. We need to abstract away much of the complexity while enabling a high-performing, scalable and secure platform which can be deployed in any cloud or on-prem.

Here’s a list of the additional functionality found on the Iguazio platform (much of it is open source):

Integrated Solution with User-Friendly Portal, APIs and Automation

Usability is one of the main challenges in Kubeflow, compounded by the fact that Kubeflow only addresses parts of the solution (no offline development, data engineering or production deployments). Users find themselves working against multiple portals and UIs, and adding a lot of glue code.

Assume that all the data science elements and metadata are managed by one service, you can create a feature-set object using the feature store, pass this as an object to training, validation or model serving logic, use it to validate your model behavior and concept drift, and update it and the corresponding labeled data directly from the production pipeline. Using glue logic and scripts to achieve that is almost impossible and leads to significant extra development and ML engineering effort.

The open-source framework called MLRun controls the entire functionality from data ingestion to production pipelines, it delivers a comprehensive Web UI, CLI, API/SDK, and it is tightly integrated with the underline Kubeflow components (Operators, Pipelines, etc.)

With MLRun, users can work through the SDK and IDE plug-ins from anywhere using their native IDE (Jupyter, PyCharm, VSCode, etc.), there’s no need to run on the Kubernetes cluster, and you can run individual tasks or complete batch and real-time pipelines with a few simple commands.

Execution, experiment, data, model tracking and automated deployment is done automatically through MLRun serverless runtime engines. MLRun maintains a project hierarchy with strict membership and cross team collaboration (rather than using a flat namespace like in Kubeflow).

Example: MLRun project dashboard screen (image by author)

You can deploy the various components as described in MLRun documentation, or alternatively, you can use Iguazio’s managed offering which adds end-to-end security and fully managed service experience on top of MLRun and Kubeflow plus additional set of data and management services.

Feature Stores

Feature engineering can be one of the most time intensive tasks in building ML pipelines. Without a shared catalog of all features, teams risk wasted time on redevelopment and duplication of work across teams and projects. The features need to be computed regularly using data engineering frameworks (like Spark, Presto, Dask, etc.) and stream processing engines (Flink, Nuclio, etc.) and be delivered using offline/batch APIs for training and real-time APIs and databases for serving. One key benefit of feature stores is the delivery of a consistent feature set for the training and serving layer, ensuring that trained models maintain their performance in production.

Furthermore, feature stores are a critical element for storing versioned feature metadata and statistics, which are essential to conducting model monitoring, validation, concept drift detection and continuous retraining.

Building your own feature store is a luxury usually reserved to tier-one web companies due to the complex stack it entails. The alternative of not having a feature store means the team will invest a significant amount of manual data engineering work to build and maintain features.

Real-Time Pipelines, Model Management and Monitoring

Once models are ready, they need to turn into running microservices which accept feature vectors through APIs or data streams and produce prediction results, which need to be used in production business applications. In reality, production pipelines are far more complex than a simple serving end point and they may include steps for feature retrieval, transformations, validation, outlier detection, serving of multiple models under an ensemble or more complex graph, alerting, etc.

Nuclio and MLRun-serving open-source frameworks enable teams to build complex real-time data processing and model serving pipelines from the training results and data from the real-time feature store in a matter of minutes, and deploy or upgrade to production with a single API/UI command.

Getting a model to production is only the beginning of its lifecycle. A model’s predictions and behavior must be logged, they must be monitored for concept drift, and sometimes retrained on fresh data under shifting circumstances. Data requires governance and auditing. Without a centralized system in place, model maintenance can quickly become unmanageable and overly expensive.

Managed Services and Serverless Runtimes

Significant effort goes into creating and managing various software packages and turning them into production services with auto scaling, configuration management, upgrades and security. A typical data science stack can consist of dozens of services like data analysis tools, ML/AI frameworks, development tools, dashboards, security and auth services, and logging. You could hire a large team to build a stack manually, or you could use a platform with managed services that allow users to choose a service, specify params and click deploy.

And further, turning code into production services is a time consuming project involving many team members (see figure below). You need to package the code into containers, reengineer it for resiliency, scale, performance, hardened security, add instrumentation, etc. Many data scientists aren’t proficient in software development and DevOps, so these tasks can be even more challenging for them. The use of serverless technologies enables a team to automate all of these tasks.

MLRun runtimes and Nuclio can take a piece of code and automatically turn it into an elastic and fully managed service, and this can be done from a UI or from within your development environment (like Jupyter) with a single line of code.

The serverless functions can run as standalone or can be plugged in as steps in a larger Kubeflow pipeline without any additional development. MLRun projects enable automated CI/CD pipelines. A workflow can be triggered through a Git hook every time code, data or configuration changes, and members of the project can review the results and approve automated deployment into test or production systems.

Effort of migrating from development to production (image by author)

Function Marketplace and AutoML

You can develop your own functions or use a set of prebaked functions from a marketplace and plug them into your pipeline with zero development effort. The MLRun function marketplace provides a set of AutoML functions for feature analysis, detection and automated ML training. The functions are designed for scale, performance and are ready for production with instrumentation and visual reports.

Teams can implement a local function marketplace/repository which maximize collaboration and reuse. One member can build a function that can be used in another project, or improved by another team member and returned to the market (using a different version tag).

Production-Ready Managed Data Services

Every data science solution works on data, and data can come in many forms (files, tables, streams, time-series, etc.), which require integration with several underlying data services. Data access should be controlled where every job or user is confined to specific datasets and with specific access permissions, furthermore data must be versioned for reproducibility.

You can obtain and integrate managed data services from the different cloud providers. Iguazio has an industry-leading high-performance multi-model database and low latency distributed file system, with unique technology that uses flash/SSD storage but performs like in-memory DB at a fraction of the cost, and with much higher scale. These ingredients enable faster time to deployment, faster performance, larger scale and reduced costs.

End-to-End Security

Data science platforms can contain very sensitive data, aggregated from multiple users. A multi-layered security approach must be implemented, including:

  1. Authentication and identity management: Determine which user or service is accessing a piece of data or another service, using industry standard protocols such as LDAP, OAuth2 and OIDC.
  2. Role Based Access Control (RBAC): Enforce the access rights of every user, task or role, which resource they can access and what are the operations they are allowed to perform.
  3. Data security: Control the access and privacy of data assets.
  4. Secrets management: Protect secrets and access credentials, avoid storing those secrets inside the code or the data assets, but make them available to specific users and jobs at execution time.

MLRun supports integration with security services. You can deploy and integrate them by yourself or use Iguazio managed platform with all of that solidified and fully managed for you.

Summary

At Iguazio, we’re big believers in the potential of Kubeflow, which is why we’ve chosen to wrap it and extend it into a complete end-to-end MLOps solution. We also believe in breaking the silos between data engineering, data science and DevOps practitioners, so that complexity is abstracted away and organizations aren’t left with heavy technical debts. With some additional services and functionalities, Kubeflow can be the right solution to the challenges of operationalizing machine learning — giving data scientists a high-performing, scalable and secure technology which can be deployed in any cloud, on-prem or in hybrid environments.

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

yaron haviv
yaron haviv

Written by yaron haviv

Hands-on expert, Blogger and Visionary in #data-science & #Cloud, VP & CTO background, @iguazio founder https://il.linkedin.com/in/yaronh

No responses yet