Run MLflow Project in EKS

Jagane Sundar
InfinStor
Published in
6 min readApr 26, 2022

MLflow Project is a format for packaging python data science code. The resulting package can be reused, run repeatedly, shared, etc. Recently, MLflow has added the capability to run MLflow Projects in a Kubernetes cluster. There are some nuances when running this in an Amazon Elastic Kubernetes Cluster. This is a step by step guide to running MLflow Projects in Amazon EKS. We will be using Amazon ECR (Elastic Container Registry) and a few other AWS services as well.

Prerequisites:

You will need the following capabilities in order to run MLflow Projects in EKS.

  1. Terminal with the ability to run the docker cli, the aws cli, and the mlflow cli
  2. Functional MLflow, hosted in a server, ideally in a publicly accessible IP/hostname
  3. A repositories in a public docker registry. We will use Amazon ECR in this article
  4. EKS Cluster

Prereq 1: Docker, aws and MLflow cli

Install the AWS CLI using https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

Install mlflow by typing:

pip install mlflow

Install docker for your operating system. Instructions are at: https://docs.docker.com/get-docker/

Prereq 2: MLflow in a public IP/Hostname

The MLflow tracking service will be accessed by various pieces of code running in various locations such as your own laptop/desktop, a container in the EKS cluster, etc. Hence the MLflow server must be available at a public IP/hostname. Running ‘mlflow server’ on your laptop/desktop will not serve this purpose. Here are some popular ways to get access to hosted MLflow:

  1. Sign up for InfinStor Free MLflow service (you need to bring your own S3 bucket) https://mlflowui.free.infinstor.com/register.html
  2. Create your own MLflow service in AWS: https://aws.amazon.com/blogs/machine-learning/managing-your-machine-learning-lifecycle-with-mlflow-and-amazon-sagemaker/
  3. Create your own MLflow service using nginx as a reverse proxy: https://towardsdatascience.com/managing-your-machine-learning-experiments-with-mlflow-1cd6ee21996e

In this article we will mostly be using InfinStor free MLflow service, but the instructions will work fine on any MLflow service

Prereq 3: Docker Registry with a public repository

Let’s create an ECR public registry and a repositories in this registry. Note that AWS Elastic Container Registry’s private repositories and public repositories are two entirely different things. We will be creating an ECR public registry here. Here’s a screen capture while creating a public repository called mlflow-projects-demo/full-image . This repository is for storing the docker container that contains the MLproject’s environment and all the code for the MLflow project.

Creating a ECR public repository for storing the docker container of the full image, i.e. env plus project code for the MLflow project

Now, we need to perform a ‘docker login’ using credentials obtained from the aws cli. Here’s the command:

aws ecr-public get-login-password |docker login --username AWS --password-stdin public.ecr.aws

Note that for the above command to succeed, you must have the aws cli correctly configured with region information and access key id/secret access key.

Prereq 4: Create EKS cluster

We will create the EKS cluster using the eksctl tool. Note that we will create the cluster in two steps — we create the EKS control plane first, then we create the node group. The reason is because we want to be able to ssh into the node group VMs and view the container logs.

First, we create the cluster without a nodegroup:

$ eksctl create cluster --name=jaganes-test-cluster --without-nodegroup

Now we create the nodegroup:

$ eksctl create nodegroup --cluster jaganes-test-cluster --node-type m5.large --name jaganes-nodegroup --nodes 2 --nodes-min 2 --nodes-max 3 --ssh-access --ssh-public-key my-own-key

In the above example, my-own-key is the name of an EC2 keypair that you can use to ssh into the nodes. Container logs are stored in /var/log/container and this is the easiest way to view container logs

Edit Sample kubernetes config files

The kubernetes support in MLflow uses two configuration files — kubernetes_config.json and kubernetes_job_template.yaml. Example files are provided in the <mlflow_src>/examples/docker directory.

Here is the kubernetes_config.json that I used for EKS

{
“kube-context”: “iam-root-account@jaganes-test-cluster.us-east-1.eksctl.io”,
“kube-job-template-path”: “./kubernetes_job_template.yaml”,
“repository-uri”: “public.ecr.aws/l9n7x1v8/mlflow-projects-demo/full-image”
}

Here are the notables in the above file:

kube-context is set to iam-root-account@jaganes-test-cluster.us-east-1.eksctl.io to match the name of the context in ~/.kube/config. Note that I created the cluster in EKS using my AWS account root’s access key ID and secret access key, hence the context refers to the root account.

kube-job-template-path refers to the config file described in the next section of this article

repository-uri refers to the Docker repository created earlier in this article. Note that you will need to have logged into the docker registry as described above.

The second configuration file kubernetes_job_template.yaml is as follows:

apiVersion: batch/v1
kind: Job
metadata:
name: “{replaced with MLflow Project name}”
namespace: default
spec:
ttlSecondsAfterFinished: 100
backoffLimit: 0
template:
spec:
containers:
- name: “{replaced with MLflow Project name}”
image: “{replaced with URI of Docker image created during Project execution}”
command: [“{replaced with MLflow Project entry point command}”]
resources:
limits:
memory: 512Mi
requests:
memory: 256Mi
restartPolicy: Never

The only change I made to the file above was to change the namespace from mlflow to default since I do not have a namespace called mlflow in the EKS k8s cluster

Finally, we run the MLproject in the EKS cluster by invoking one of the following two commands. In the first case, we simply use the MLflow built in kubernetes backend. This will work fine if your MLflow service does not need authentication.

mlflow run . -b kubernetes --backend-config kubernetes_config.json -Palpha=0.5

The second version of this command specifies the infinstor-backend as the backend. The infinstor-backend is identical to the built in mlflow kubernetes backend, however it adds two additional capabilities:

  1. It exports the infinstor authentication token as a secret to the EKS cluster
  2. It exports the AWS credentials from ~/.aws/credentials to the EKS cluster as a secret
mlflow run . -b infinstor-backend --backend-config kubernetes_config.json -Palpha=0.5

Results!

The cli prints out something such as

2022/04/25 14:18:15 INFO mlflow.projects.docker: === Building docker image public.ecr.aws/l9n7x1v8/mlflow-projects-demo/full-image:43237e9 ===
2022/04/25 14:18:16 INFO mlflow.projects.kubernetes: === Pushing docker image public.ecr.aws/l9n7x1v8/mlflow-projects-demo/full-image:43237e9 ===
2022/04/25 14:18:23 INFO mlflow.projects.utils: === Created directory /tmp/tmpot6u9jn4 for downloading remote URIs passed to arguments of type ‘path’ ===
2022/04/25 14:18:23 INFO mlflow.projects.kubernetes: === Creating Job docker-example-2022–04–25–14–18–23–623311 ===

Let’s go look at the pods created in the EKS cluster:

(base) jagane@jaganeworkstation:~/tmp/eks/mlflow/examples/docker$ kubectl get pods
NAME READY STATUS RESTARTS AGE
docker-example-2022–04–25–14–18–23–623311-zms5g 0/1 ContainerCreating 0 6s

Now, lets get the logs from the container:

(base) jagane@jaganeworkstation:~/tmp/eks/mlflow/examples/docker$ kubectl logs -f docker-example-2022–04–25–14–18–23–623311-zms5g

2022–04–25 21:18:50,279–1 — botocore.credentials — INFO — Found credentials in shared credentials file: ~/.aws/credentials
Elasticnet model (alpha=0.500000, l1_ratio=0.100000):
RMSE: 0.7947931019036528
MAE: 0.6189130834228138
R2: 0.1841166871822183

Finally, let’s go see the results of this execution in the MLflow webpage:

MLflow Webpage

Footnotes:

Container Logs: It is very useful to see container logs when something goes wrong. The way I have EKS setup, you can simply ssh into the node:

$ ssh -i <your_key.pem> ec2-user@your_eks_node_ip_from_ec2_console$ sudo su -
root@ip-192–168–43–108 containers]# cd /var/log/containers
root@ip-192–168–43–108 containers]# ls
aws-node-d4sln_kube-system_aws-node-51bedd4cfca6171112e3860c5fb0a5ffe3cb1394be8a96a0651d52b51d5a285e.log
aws-node-d4sln_kube-system_aws-vpc-cni-init-d6d5a3da326305b97951b0b38c5fed733abf5763f5e8bbc52e80a18169ce9cc8.log
coredns-66cb55d4f4–5dvcb_kube-system_coredns-e2287541e25ac30ae6e7d395ba213f61dd18dc218407aefc0a8a59492fd9215e.log
docker-example-2022–04–25–14–57–37–031554-fznzv_default_docker-example-ece4a30352d1c7712e6a0dbd469ad61b8969c1d1dd8c6ec83d9a717bb37b0efa.log
kube-proxy-4sn5r_kube-system_kube-proxy-d51dba686919d53ef948c276fdc19a9a73a2d678a7302e68dc35dd27cc67b480.log

Turn Off EKS: The EKS control plane costs $0.10 per hour. That is nominally $72 per month. Remember to turn it off. My preferred way of turning it off is the go to the AWS CloudFormation console and delete the two relevant stacks. In the example, you can see the two stacks eksctl-jaganes-test-cluster-nodegroup-jaganes-nodegroup and eksctl-jaganes-test-cluster-cluster. These two clusters need to be deleted — the nodegroup first, then the cluster.

CloudFormation Console for EKS

--

--

Jagane Sundar
InfinStor

Entrepreneur, Technology Enthusiast, Machine Learning student, Cloud Computing expert, Big Data expert, Distributed Coordination expert