HANDS-ON: MLOps

This is how you set up an MLOps platform on AWS EKS with Kubeflow and MLflow

Give your data science team the MLOps tools they need

Bartłomiej Poniecki-Klotz
Ubuntu AI

--

Kubeflow and MLflow architecture diagram deployed on top of AWS, EC2 instances — on-demand and GPU spot. The Juju controller is bootstraped on Kubernetes.
MLOps stack based on AWS EKS, Kubeflow and MLflow

Why do I need anything more than a Sagemaker for data science?

AWS Sagemaker is a fantastic tool for data science, which, together with AWS services, supports the complete MLOps lifecycle. Unfortunately, it has a few shortcomings.

First, when you create a machine learning model on AWS, you use it on AWS only. This limitation impacts the possibility of deploying trained models on multiple clouds and implementing multi-cloud strategies.

Second, we cannot use Confidential Computing Virtual Machines with Sagemaker. Such VMs provides an additional layer of trust when handling sensitive data.

Third, Sagemaker does not support training machine learning models at the edge. You can only train models in the cloud and then deploy them at the edge device. Unfortunately, the data gravity issue forces data scientists to train models at the edge. They cannot use Sagemaker for this.

Using the open-source tools from this article, I provide environments for customers in industries like telco, public sector, banking and energy. We use AWS EKS for managed Kubernetes with on-demand and spot instances to provide GPUs cost-effectively.

Follow the article and get your hands dirty! We will build an open-source environment ready for data scientists to solve business problems.

At the end of it, you will have:

  • AWS EKS cluster
  • Charmed Kubeflow with Charmed MLflow
  • User onboarded with integrations to Minio, MLflow and Kubeflow Pipelines
  • Access to cheap GPUs using AWS Spot instances
  • Running code examples for training models using MLflow and Minio, scheduling Kubeflow Pipeline tasks on spot instances

Why Kubeflow?

Kubeflow can be deployed on any CNCF-compliant Kubernetes cluster, including flavours provided by public clouds. This feature makes it possible to train models on a single cloud and deploy the inference endpoints in multiple clouds, on-premise and edge. This flexibility is one of the critical features of MLOps open-source tools. At the same time, you have access to all goodies from the public cloud provider, primarily managed services and a rich ecosystem of APIs.

You are in the right place if your strategy involves using an AWS-managed Kubernetes cluster for a Data Science training environment. With capacity planning, Kubeflow runs efficiently on workloads like notebooks, model training pipelines and inference endpoints. Additionally, spot instances are a great way to reduce the cost of providing GPUs for experimentation and accelerated workloads starting from $0.2 per hour.

Why MLflow?

MLflow is an excellent addition to Kubeflow because it covers Model and Experiment management. On the other hand, Kubeflow exceeds in providing experimentation environments, automated pipelines and model deployments. Additionally, models trained as a part of MLflow experiments can be easily deployed to production using Seldon Core or KServe from Kubeflow. Together they offer a complete E2E experience and MLOps stack.

We have a lot of exciting things to do. Let’s dive in!

Install tools

First, instal all needed CLI tools and packages:

There are incompatibility issues of newer versions of kubectl with the AWS EKS cluster running Kubernetes API version 1.24. To avoid them, install kubectl version 1.23, which works perfectly with this version. Kubernetes API version 1.24 is the highest officially supported version for Kubeflow. The community is currently working to support higher versions.

Fingers crossed!

AWS account and configuration

Secondly, you need an IAM user in an AWS account. The eksctl uses this user to create and manage the AWS EKS cluster. Additionally, kubectl uses it to get credentials to access the Kubernetes resources via Kubernetes API.

When creating the AWS IAM user, you must attach proper roles. You can find the list of required permissions in the eksctl prerequisite documentation. Next, create AWS CLI user access credentials and configure the AWS CLI.

Additionally, eksctl uses a local public key to allow access to Kubernetes nodes via SSH. By default, eksctl uses the key stored under ~/.ssh/id_rsa.pub (in Linux). If you do not have it yet, create it before using eksctl. Alternatively, you can supply different public keys during the cluster creation.

If you prefer cloud-agnostic Kubernetes distribution, check how to create a Kubernetes cluster on any cloud here.

Create an EKS cluster

The third step is to create an EKS cluster using eksctl. The eksctl uses AWS CloudFormation to deploy resources. AWS EKS cluster consists of a control plane and multiple node groups. AWS provides the control plane as a part of managed service. The node group configuration in create cluster command translates into auto-scaling groups of EC2 instances. The cluster recognises each VM in the group and adds it as a node.

Kubeflow has high requirements for CPU, memory and disk. It requires four cores, 32GB of memory and 50GB of disk. We create a cluster with a node group of t3.xlarge and use defaults of 2 instances per node group, each with an 80 GB disk.

eksctl create cluster --name=bpk-ckf \
--region=eu-west-1 \
--version=1.24 \
--with-oidc \
--instance-types=t3.xlarge \
--ssh-access

Kubeflow is not a stateless workload and creates volumes for notebooks, pipeline runs or object storage. Currently, the EKS cluster cannot create persistent volumes. To make it fully functional, you need a proper storage configuration. The AWS EBS CSI plugin solves it. It is a two-step process. First, you create a role which EC2 instance use to manage EBS volumes. Secondly, you enable the EKS addon. That’s all.

Remember to update the service account role in the script below. The ARN contains your AWS account number, so it’s different for each account.

eksctl create iamserviceaccount  \
--name ebs-csi-controller-sa \
--namespace kube-system \
--cluster bpk-ckf \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
--approve \
--role-only \
--role-name AmazonEKS_EBS_CSI_DriverRole

eksctl create addon --name aws-ebs-csi-driver \
--cluster bpk-ckf \
--service-account-role-arn arn:aws:iam::xxxxxxxxxx:role/AmazonEKS_EBS_CSI_DriverRole \
--force

After a few minutes, your EKS cluster is fully operational. It contains two nodes in Ireland (eu-west-1). The Kubernetes API version is 1.24.

kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-192-168-43-58.eu-west-1.compute.internal Ready <none> 9m44s v1.24.13-eks-0a21954
ip-192-168-90-196.eu-west-1.compute.internal Ready <none> 9m44s v1.24.13-eks-0a21954

The auto-scaling groups created two instances, type t3.xlarge, in different Availability Zones. To know which VM is which node, check the private IP and compare them with the node’s name.

AWS console EC2 list with two t3.xlarge instances

Deploy Charmed Kubeflow

Charmed Kubeflow is an upstream Kubeflow distribution wrapped in automation provided and supported by Canonical. Charmed Kubeflow uses Juju, Canoncal’s orchestration engine for software operators, to simplify the deployment and integration between applications and day-2 operations like backups.

Juju supports deployment on bare metal servers, public cloud and Kubernetes clusters. The first step of every deployment using Juju is selecting the cloud and creating the controller. The controller provides an API to manage your deployment and executes your commands on the cloud. You can reuse the controller between deployments for efficiency.

Juju has integration with AWS public cloud and uses AWS CLI credentials to add a selected Kubernetes cluster as a cloud. The new cloud is available to bootstrap a controller on. Juju creates a new namespace and a Pod, which works as API for Juju CLI to deploy related applications while bootstrapping a controller on top of the Kubernetes cluster.

juju add-k8s --eks bpk-ckf-eks
juju bootstrap bpk-ckf-eks

You bootstrapped the controller. You are ready to start deployment. In the Juju ecosystem, single applications are called Charms. The set of applications related to each other and saved in the YAML format are Bundles. For more information about charms and bundles, check out here.

I prepared a Bundle for you. It contains the Charmed Kubeflow bundle integrated with MLflow. Also, I filled in the static user and password to make it convenient. Remember to change the static password for security because your Kubeflow installation will be available under a public URL.

Dex supports integration with LDAP and OIDC so that you can integrate it with your user management system.

juju config dex-auth static-password=xxxxxx

In the bundle, we also declare relations. We use them to integrate two applications. The relation between MLflow and object storage provides a way to store and view models logged as a part of the experiment. On the other hand, the relation between MLflow and Istio allows one to access MLflow UI on the same URL as Kubeflow.

Now, download the bundle and deploy it in the “kubeflow” model (Kubernetes namespace).

wget https://gist.githubusercontent.com/Barteus/bcd7e3d37d093a3b556cebec4cb3adfe/raw/251d7f3e39a2b40e694790a2270313e4f32c1a5c/ckf-mlflow-bundle.yaml
juju add-model kubeflow
juju deploy ./ckf-mlflow-bundle.yaml --trust

Juju CLI provides a convenient way to check deployment status.


watch -c juju --color

During the Charmed Kubeflow deployment, there is a need for manual intervention. After istio-pilot charm is in an active/idle state, you can see the message that the gateway is still missing on one of the other charms. Check if the gateway is missing in the “kubeflow” namespace. If so, trigger the creation of the gateway manually. This way, the gateway is created within a few seconds.

kubectl get gateway -n kubeflow
#empty


juju run --unit istio-pilot/0 -- "export JUJU_DISPATCH_PATH=hooks/config-changed; ./dispatch"

kubectl get gateway -n kubeflow
NAME AGE
kubeflow-gateway 6s

The last step before you see the Kubeflow login screen is updating the public_url. Dex and OIDC Gatekeeper use the Public URL to allow access only via known IP or DNS domain. This security-required configuration helps avoid impersonations. Additionally, Kubeflow’s internal services use the same URL for communication.

In the deployment, a Service of type Loadbalancer called istio-ingressgatway-workload exposes Kubeflow outside the cluster. EKS creates an Elastic LoadBalancer (ELB) for you. ELB has the DNS record assigned, which you use as a public URL for Kubeflow Dashboard.

PUBLIC_URL="http://$(kubectl get svc istio-ingressgateway-workload -n kubeflow -o json | jq -r '.status.loadBalancer.ingress | first | .hostname')"
echo PUBLIC_URL: $PUBLIC_URL

juju config dex-auth public-url=$PUBLIC_URL
juju config oidc-gatekeeper public-url=$PUBLIC_URL

You configured Charmed Kubeflow. It’s time to go to the login page. The login page is available under the PUBLIC_URL. When going to this URL, you see the Dex login page.

Kubeflow login page with email address and password fields
Kubeflow login page

For now, do not log in. We will use Juju Action to initialise your user profile first.

Initiate admin profile

Juju actions benefit day-2 operations like initialising profiles in Kubeflow or databases backup and restore. They do not require you to install or configure additional applications because charm provides all of them in the required versions.

You will use one of the Juju actions defined in Charmed Kubeflow to initialise your user profile with Kubeflow Notebook integrations and secret for Seldon Core. The action will add PodDefaults to the user namespace. PodDefaults allow the user to inject MLflow and MinIO credentials as environment variables into the notebook during creation. No more passwords in the notebooks! This step is optional but makes your Data Scientists life easier.

Each Juju action has a schema and documentation attached to the deployed charm. You can check its details using Charmhub, in the Charm repository, or Juju CLI. Below you will find how to list all actions available in the charm and what properties you provide.

$ juju actions kubeflow-profiles --schema
create-profile:
description: Create a new profile under an authenticated user and apply configurations
to the profile to allow using Minio, MLFlow, and Seldon.
properties:
profilename:
description: the name of the new profile to be created
type: string
resourcequota:
description: (Optional) resource quota for the new profile
type: string
username:
description: the name of the authenticated user under which the new profile
will be added
type: string
required:
- username
- profilename
title: create-profile
type: object
initialise-profile:
description: Apply configuration to an existing profile to allow using Minio, MLFlow,
and Seldon.
properties:
profilename:
description: the name of the existing profile to be configured
type: string
required:
- profilename
title: initialise-profile
type: object

If you want to know the already created profiles, run the following command to get a list. Each profile has a Kubernetes namespace linked with it. If you remove the namespace, Kubeflow Profiles Controller recreates it. Unfortunately, you lose all namespaced resources like secrets, Pods and Volumes.

$ kubectl get profile
No resources found

There are no profiles, so let’s create a new one. Juju units execute actions. When deploying applications on top of Kubernetes, units are Pods. The next step is to provide the properties of the action. You create a profile for the user called “admin”; the namespace’s name is also admin. There are no profiles, so let’s create a new one.

All actions are run asynchronously by default. Asynchronous execution is handy for long-running operations like backups. The action we use will finish quickly, so we add “ — wait” at the end to wait for results. Actions return action_id instead of results when executed in asynchronous mode. You use action_id to check if it ended and get its results.

$ juju status kubeflow-profiles
Model Controller Cloud/Region Version SLA Timestamp
kubeflow bpk-ckf-eks-eu-west-1 bpk-ckf-eks/eu-west-1 2.9.42 unsupported 11:17:46+02:00

App Version Status Scale Charm Channel Rev Address Exposed Message
kubeflow-profiles active 1 kubeflow-profiles 1.7/stable 269 10.100.161.205 no

Unit Workload Agent Address Ports Message
kubeflow-profiles/0* active idle 192.168.54.198

$ juju run-action kubeflow-profiles/0 create-profile profilename=admin username=admin --wait
unit-kubeflow-profiles-0:
UnitId: kubeflow-profiles/0
id: "4"
log:
- 2023-05-26 11:27:53 +0200 CEST Running action create-profile with parameters username=admin,
profile_name=admin, resource_quota=None
- 2023-05-26 11:27:53 +0200 CEST Profile doesnt exist, action will proceed to create
profile.
- 2023-05-26 11:27:54 +0200 CEST Profile admin created.
results: {}
status: completed
timing:
completed: 2023-05-26 09:27:57 +0000 UTC
enqueued: 2023-05-26 09:27:51 +0000 UTC
started: 2023-05-26 09:27:52 +0000 UTC

PodDefaults enable you to integrate external tools with Kubeflow Notebooks. Users can have multiple Profiles and namespaces, each with different PodDefaults. The usage of PodDefaults in Profiles makes working on various projects a lot easier.

$ kubectl get profiles
NAME AGE
admin 76s
$ kubectl get poddefaults -n admin
NAME AGE
access-minio 85s
access-ml-pipeline 86s
access-mlflow 84s

Log into Kubeflow

Hurray! It’s testing time. Let’s see if everything works as expected!

You log in using a static username and static password.

Kubeflow Dashboard visible after log in with menu on the left with Notebooks, AutoML and Kubeflow Pipeline tabs.
Kubeflow Dashboard

Jupyter Notebook is one of the most frequently used tools in Kubeflow. You can create a new one in the “Notebooks” tab. You can select one of the proposed notebook images or create your own. You cannot install tools outside the “pip” and “conda” packages in the notebook. If you need additional tools like Java or OS-level packages for OpenCV, you need to build your docker image. You can find more details in Kubeflow documentation.

PodDefaults visible as Configurations during Kubeflow Notebook creation

In the advanced options, expand “Configurations” and select Allow access to Minio and MLflow. Next, select “Create” and wait for your notebook to be ready. You connect to it from the list of available Notebooks. Each of them is a Kubernetes Pod in the user namespace. No other user can connect to others’ notebooks because of authorisation policies enforced by Istio.

In the newly created notebook, you clone the Kubeflow Examples git repository — https://github.com/canonical/kubeflow-examples.git. There is a long list of examples you can run in Charmed Kubeflow. Feel free to try!

In the “notebook-integrations” folder, you find a mlflow-integration notebook. This notebook trains a few model versions using MinIO (object storage) and MLflow. It saves the training parameters and metadata into the MLflow experiment. It also serialises the trained model and persists it in the object storage.

The MLflow UI is available under the same public URL as Kubeflow with the suffix “mlflow/”. The ending “/” is essential for Istio to redirect traffic to MLflow instead of the Kubeflow Dashboard.

MLflow UI deployed on AWS EKS and exposed for logged users

As you see, the MLflow and S3 object storage integration works correctly. Now it’s time to run a simple Kubeflow Pipeline to validate the rest of the MLOps stack. We use an example from the “eks-spot-instance” folder — 01-base-workload.ipynb.

The pipeline downloads two CSV files and merges them into one. Argo Workflows, the workflow engine behind Kubeflow Pipelines, saves the resulting CSV file in the object storage. When you click the link in the Kubeflow Dashboard, it downloads the file to your local disk.

Running Kubeflow Pipeline — downloading and merging datasets

Kubeflow Pipelines also works — time to spice it up and add some GPUs to the mix. There is no Data Science without GPUs!

Add GPU spot instances

Spot instances in the public cloud are VMs from the cloud’s excess capacity. Using spot instances means you might lose the instance at any point interrupting your processing. The process of taking the VM back by AWS is called eviction. Additionally, GPU-powered instances are often unavailable as spot instances because of their high demand.

The estimated monthly cost difference between GPU on-demand and spot instances

Spot instances are helpful because of their cost performance to lower the Total Cost of Ownership (TCO) for AI projects. The average price difference is around 70%, based on Amazon statistics. Unfortunately, AWS can evict you anytime, so using them is unreliable. They are best for stateless workloads that persist data in external storage like AWS S3. If the workload is long-running, consider saving checkpoints and restoring work from them. You should avoid using them for hosting APIs, databases or notebooks.

You add a new node group to the cluster to use spot instances. EKS takes care of the rest. In the below example, I provided two instance types to have a higher chance of having spot instances available. P3 instance type is significantly more expensive than g4dn. If you want to avoid costs, remove it.

eksctl create nodegroup --cluster=bpk-ckf \
--spot \
--instance-types=p3.2xlarge,g4dn.xlarge \
--name gpu-spot \
--nodes 1

You add a new node to the cluster. Meanwhile, the node controller labels the new node as a spot using the “eks.amazonaws.com/capacityType=SPOT” label. In the AWS EKS, you can run any workload on spot instances without “tolerations” by default. If you want to use your GPU instance mainly for accelerated workloads, add a “taint” to it. Taints are a Kubernetes mechanism signalling scheduler to avoid scheduling Pods on the node. For more information, check Kubernetes documentation.

You build an AI-ready Kubernetes cluster and want to use GPUs for accelerated workloads fully. You want it for Machine Learning pipelines and training jobs. Taint the node using PreferNoSchedule, so the Kubernetes scheduler avoids putting Kubeflow operators on this node. The “Prefer” part also allows the scheduler to assign Pods to the node when the cluster lacks resources.

$ kubectl get nodes --show-labels | grep SPOT

$ kubectl taint nodes ip-192-168-55-136.eu-west-1.compute.internal spot-instance=true:PreferNoSchedule

$ kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints --no-headers
ip-192-168-30-5.eu-west-1.compute.internal <none>
ip-192-168-55-136.eu-west-1.compute.internal [map[effect:PreferNoSchedule key:spot-instance value:true]]
ip-192-168-64-60.eu-west-1.compute.internal <none>

You get the same results if you run the 01-base-workload notebook from the “eks-spot-instance” folder. All tasks from the pipelines run on CPU nodes. On the other hand, look at the 03-spot-retries-workload notebook and compare it to the previous one.

All the tasks in the pipelines are the same. A pipeline running on spot instances requires a few additional configurations:

  • Toleration to allow scheduling on tainted nodes
  • Affinity to force scheduler to use SPOT node
  • Retries configuration on tasks not to fail a Pipeline Run on spot instance eviction
  • Backoff duration on steps to wait for a few minutes for a new SPOT instance

If you run the 03-spot-retries-workload notebook now, your Pipeline Run runs on spot nodes. Kubeflow Dashboard shows the Pods and Events details, but searching in this long YAML file is impossible.

Kubeflow Pipelines executed successfully on the spot instance

I prefer to use kubectl. First, I check for a tainted node with spot instance taint. Next, I find Pods with the Pipeline Run prefix in the user namespace. All Pods in a run follow the naming convention: <pipeline-name>-<pipeline-id>-<task-id>. The naming makes it easy to filter the Pods of the pipeline.

$ kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints --no-headers
ip-192-168-30-5.eu-west-1.compute.internal <none>
ip-192-168-55-136.eu-west-1.compute.internal [map[effect:PreferNoSchedule key:spot-instance value:true]]
ip-192-168-64-60.eu-west-1.compute.internal <none>

$ kubectl get po -n admin -o wide | grep -i "base-pipeline-pxmsj"
base-pipeline-pxmsj-1137348039 0/2 Completed 0 12m 192.168.39.212 ip-192-168-55-136.eu-west-1.compute.internal <none> <none>
base-pipeline-pxmsj-4168395925 0/2 Completed 0 12m 192.168.37.29 ip-192-168-55-136.eu-west-1.compute.internal <none> <none>

Kubernetes scheduler placed the Pod on the GPU spot instance node — confirmed!

Cleanup

This demo running in AWS costs around $550 per month. Remember to clean up the deployment if you want to pay only for things you use.

The cleanup process needs to account for removing:

  • Charmed Kubeflow and Juju controller
  • EKS cluster and all resources created by eksctl.
juju destroy-controller bpk-ckf-eks-eu-west-1 --destroy-storage --destroy-all-models --force
eksctl delete cluster bpk-ckf

If you have issues deleting the EKS cluster, go to the CloudFormation in the AWS console and remove them manually. This way, you get more details about the error. If needed, remove orphaned resources manually and rerun the eksctl delete command. Finally, check in the EC2 view if no EBS volumes or snapshots are left. They do not cost much, but this charge can accumulate quickly.

Let’s wrap up!

Your MLOps stack in AWS is ready for data scientists. They can now focus on solving business problems using an open-source tools ecosystem. Here is an MLOps pipeline end-to-end example:

You used Juju to deploy the stack on top of AWS EKS, and now it will help you with day-2 operations. Check Charmhub for more applications you need for AI and data projects.

Building efficient, production-grade pipelines without technical debt is not a trivial task. Here is a small helper:

Next steps

Here are a few ideas you can add to your MLOps stack to improve it:

  • Add an SSL certificate and friendly URL using AWS Route53
  • Integrate with AWS S3
  • Integrate with AWS Cognito (requires SSL certificate first)
  • Add data sources for your Data Scientist to use, like MongoDB or Postgres
  • Add Kafka for data ingestion

Which of the steps do you want to implement? Let me know in the comments below.

For more MLOps Hand-on guides, tutorials and code examples, follow me on Medium and contact me via social media.

--

--