SAKK: An Open-Source Tool to Deploy EKS clusters with Kubeflow at Scale
Introduction
Over the past few years, ML adoption continued its steady and inevitable growth across a multitude of industries. As the number of production ML projects grows, so does the adoption of running ML in the cloud. One of the most notable ML toolkits that companies select for their AI/ML stack, Kubeflow, is seeing growing adoption and contributions by renowned tech giants.
Originally open-sourced in December 2017, Kubeflow gained a significant momentum in the last 2 years. One of the major cloud platforms, Amazon, offers Amazon EKS (Elastic Kubernetes Service) as a flexible way to run and scale applications on Kubernetes in AWS. Amazon EKS became generally available in 2018, and is now one of the leading environments for Kubernetes, which is often used to run Kubeflow.
However, despite such a powerful set of tools available, for many enterprise ML teams getting a production Kubeflow cluster on EKS up and running in a quick and effective manner can still be a challenge since there is a lack of clear obvious path to this goal. At Provectus, we have built a simple open-source tool based on our own best practices and knowledge that we gained through our ML teams, and we would like to share it with everyone.
Why Deploy Amazon EKS with Kubeflow at Scale?
Many companies that work on ML/AI projects (either client or in-house) often have to deploy multiple similar Kubeflow clusters, and then manage and augment them with new resources as their training and inference workload grows. When several teams are working on a project, they might also need isolated access to Kubeflow pipelines for efficiency and security reasons.
While you can have several pipelines in a single Kubeflow instance, there is currently no way in Kubeflow to grant isolated secure access to teams working on a single Kubeflow cluster. Also, each Kubeflow instance requires a separate Amazon EKS cluster. The existing ways of setting up and deploying a Kubeflow EKS cluster require working with several native tools for Kubeflow and Kubernetes that don’t allow you to add non-native resources to the deployed clusters later.
For example, if your company works on a number of ML/AI projects, it needs at least several clusters with Kubeflow instances (dev, test, production environments, client stands, etc.) for each project, and often for each project team. All these deployments require you to install several CLI tools, configure them separately to describe all the Kubernetes and Kubeflow resources you want to deploy, and repeat the process for each of your clusters.
This means you will likely have to allocate highly qualified engineers to handle ML infrastructure deployment and maintenance. They will set up multiple cluster configurations either manually or by writing a bunch of bash scripts. The same goes for further cluster support and upgrades.
What is Swiss Army Kube for Kubeflow
Swiss Army Kube for Kubeflow bridges the gap by offering a straightforward blueprint based on the best DevOps practices. It spares you the effort required to separately deploy and manage Kubeflow, Amazon EKS, and other resources you may need to bring to your cluster for your desired ML workflow.
The example folder of SAKK repository plays the role of a ready project template. All you need to do is to checkout the branch and set it up each time you want to deploy Amazon EKS with Kubeflow. ML teams can configure the cluster with everything required at once by setting variables in a single main.tf file at the root of the repository, and deploy it with a couple of Terraform commands. This simple process can be replicated as many times as you want to get multiple ML clusters up and running in minutes using Terraform as a single entry point.
After deployment, clusters can be managed with Argo CD CLI or UI. Their states are stored in the GitHub repository with an organized CI/CD pipeline. This approach allows the users not only to enjoy all the portability and standardization benefits of Kubernetes and Amazon EKS, but also adds resources to the clusters that go beyond the restrictions of native Kubeflow, Kubernetes, or Amazon EKS CLI tools without the need to write custom code.
Using SAKK for cluster configuration and deployment does not require extensive knowledge of DevOps tools, thus, it can help to save time of DevOps engineers for more important tasks. It provides an opportunity to significantly reduce the time, resources, and costs associated with the deployment and maintenance of ML infrastructure, and training people to manage it. Some key highlights are:
Cluster Automation with Argo CD
SAKK uses Argo CD to automate application states and provide the GitOps approach. The state of your cluster will be stored in a GitHub repository and S3 bucket, described as code.
Advantage of Terraform
Terraform is used to unify and standardize cloud infrastructure deployment with IaC — a go-to standard and currently one of the best practices for ML DevOps (MLOps). At the moment, Terraform is the leading software in this space. The HCL (HashiCorp Configuration Language) syntax of Terraform configurations is easy to learn and is a better alternative to random scripts for dealing with clusters. Moreover, Terraform has great documentation and a vibrant community.
Built-In Identity Management with Amazon Cognito
For identity management on AWS, SAKK uses Cognito User Pools. In the quickstart below, Cognito is used to create a secure environment, with all access permissions managed in one place. However, SAKK doesn’t vendor-lock you, as it is capable of using any other Identity provider.
Quickstart: Deploy a EKS Cluster with Kubeflow
Deploying Kubeflow on EKS using Swiss Army Kube is very straightforward. Aside from prerequisites, it takes just a couple more steps:
- Configure your cluster deployment (set up ~5 variables in one file)
- Deploy your cluster with two Terraform commands (init and apply)
After that, you will get a cluster ready for access and further management.
Prerequisites
- For this short tutorial, you need to have an AWS account with an IAM user and the AWS CLI installed. If you don’t have it yet, please use these official guides from AWS:
2. Next, install Terraform using this official guide:
3. Fork and clone the Swiss Army Kube for Kubeflow official repository:
That’s it! Now let’s configure and deploy the Amazon EKS Kubeflow cluster.
1. Configure Cluster Deployment
You set up your cluster in a single Terraform file: main.tf
. The minimal number of things to configure here is the following:
cluster_name
(name of your cluster)mainzoneid
(main Route53 zone id)domains
(names of endpoint domains)admin_arns
(ARNs of users who will have admin permissions)cert_manager_email
(emails for LetsEncrypt notifications)cognito_users
(list of users for Cognito Pool)
Example configuration of main.tf
:
terraform {
backend s3 {}
}module "sak_kubeflow" {
source = "git::https://github.com/provectus/sak-kubeflow.git?ref=init" cluster_name = "simple" owner = "github-repo-owner"
repository = "github-repo-name"
branch = "branch-name" #Main route53 zone id if exist (Change It)
mainzoneid = "id-of-route53-zone" # Name of domains aimed for endpoints
domains = ["sandbox.some.domain.local"] # ARNs of users who will have admin permissions.
admin_arns = [
{
userarn = "arn:aws:iam::<aws-account-id>:user/<username>"
username = "<username>"
groups = ["system:masters"]
}
] # Email that would be used for LetsEncrypt notifications
cert_manager_email = "info@some.domain.local" # An optional list of users for Cognito Pool
cognito_users = [
{
email = "qa@some.domain.local"
username = "qa"
group = "masters"
},
{
email = "developer@some.domain.local"
username = "developer"
}
] argo_path_prefix = "examples/simple/"
argo_apps_dir = "argocd-applications"
}
In most cases, you’ll also need to override variables related to the GitHub repository (repository
, branch
, owner
) in the main.tf
.
Next, you might want to configure backend.hcl
that stores Terraform state. Example configuration of backend.hcl
:
bucket = "bucket-with-terraform-states"
key = "some-key/kubeflow-sandbox"
region = "region-where-bucket-placed"
dynamodb_table = "dynamodb-table-for-locks"
2. Deploy Your AWS EKS Kubeflow Cluster
Deploy the cluster you’ve just configured with the following Terraform commands:
terraform init
terraform apply
aws --region <region> eks update-kubeconfig --name <cluster-name>
These commands let you:
- Initialize Terraform and download all remote dependencies
- Create a clean EKS cluster with all required AWS resources (IAM roles, ASGs, S3 buckets, etc.)
- Update your local
kubeconfig
file to access your newly created EKS cluster in the configured context
These Terraform commands will generate a few files in the default apps
folder of the repository. You need to commit them in Git and push them to your Github repository before you start deploying services to your EKS Kubernetes cluster:
Note that Argo CD is pre-configured to track changes of the current repository. When new changes are made to its apps
folder, they trigger the synchronization process, and all objects placed in this folder get created.
After that, you can manage your Kubernetes cluster with either Argo CD CLI/UI or kubectl
. To start using kubectl
(Kubernetes CLI for cluster management), install and configure it following this official guide:
3. Access and Manage Your Amazon EKS Kubeflow Cluster
Now you have your cluster deployed and ready for work. During the deployment process, two service access endpoints were created in accordance with your domains
variable settings in the main.tf
file:
- Argo CD
https://argocd.some.domain.local
- Kubeflow
https://kubeflow.some.domain.local
Check the email you provided in the domains
variable for access credentials and use them to log in.
To learn more about Kubeflow and Argo CD, you can check out their respective official documentation:
Start Using Kubeflow Pipelines
Once you successfully logged into your Amazon EKS cluster via kubectl
, access Kubeflow UI and pass all the configuration screens, you’ll see the Kubeflow dashboard:
In the Pipelines section, Kubeflow offers a few samples to let you try pipelines quickly. To learn more about using Kubeflow on AWS, please check the official Kubeflow documentation.
Alternatively, you can upload your own pipelines using AWS SageMaker and Kubeflow. For instance, let’s upload a demo module with one of the built-in AWS SageMaker algorithms.
- Create a folder for managing separate Terraform states (with resources related to pipeline executions) and add a
main.tf
file with this code:
module kmeans_mnist {
source = "path/to/kmeans-mnist-pipeline/folder/at/root/of/the/project" cluster_name = "<your-cluster-name>"
username = "<your-kubeflow-username>"
}
2. Run Terraform:
terraform init
terraform apply
Terraform will generate a training_pipeline.yaml
file and create a Kubernetes service account that matches your Kubeflow username and has all the required permissions for AWS for running the pipeline.
3. Upload the training pipeline to Kubflow through the Pipelines section of Kubeflow UI:
4. Now that you have your first pipeline and a prepared Kubernetes service account, specify them in the form to start a run:
That’s it! Now you have a pipeline executing in Kubeflow.
Moving Forward with Roadmap
SAKK will continue to evolve according to the Roadmap. Upcoming plans include making more resources configurable via Terraform (in the main.tf
):
- Further AWS Integration. More AWS features will become configurable via Terraform: RDS (Postgres), ElastiCache (Redis), S3 (Minio), etc. will be moved out of Kubernetes and managed by AWS.
- Upgrading Product Versions. It will become possible to set product versions (Argo CD, Kubeflow, Kubeflow Pipelines) via Terraform.
- Setting AWS IAM roles for Kubeflow. Setting Kubeflow users’ roles and permissions to enable their work with AWS will move to Terraform. Users will be able to generate Kubeflow profiles and resources that will be stored in the GitHub repository and used as a part of the GitOps process.
- Kubeflow Pipelines Management. It will be possible to store the state of Kubeflow Pipelines. Users will be able to deploy Kubeflow with ready pipelines from the outside: preload them from the GitHub repository, upload default AWS pipelines.
- Integration With Other Cloud Platforms. Currently, SAKK is available for Amazon EKS (Elastic Kubernetes Service) only, but the long-term plan is to expand to other cloud platforms.
Conclusion
In this post, we shared our internal tool SAKK that we use to establish an effective workflow for our ML teams, and showed how to get started with Kubeflow after deployment. The tool is based on the best MLOps practices put together by our ML and DevOps engineers over several years. We hope the product will be as helpful for your ML teams as it was for us and save you time and effort on serving and managing production ML workloads.
We believe that any organization or engineer using ML should be able to focus on their ML applications and pipelines without having to worry too much about infrastructure deployment.