Published in


Hybrid Cloud for ML, Simplified

Set up a hybrid ML infrastructure with a single-line command with VESSL Clusters

  1. Dynamically allocate GPU resources to Kubernetes-backed ML workloads, specifically, notebook servers and training jobs on hybrid infrastructure.
  2. Monitor the GPU usage and node status of on-prem clusters.

The Cost of Cloud in ML

With the rising cost of ML, companies are moving away from the cloud and taking the hybrid approach

  • a growth-stage ADAS company with <20 machine learning engineers spending up to $1M every month on AWS
  • a 100+ people AI/ML team at Fortune 500 looking for an alternative local cloud provider to lower their dependency on Google Cloud

Leveraging hybrid cloud strategy with Kubernetes

Kubernetes is helping the transition to hybrid cloud and has emerged as a key component of SW stack for ML infra

Barebone machines are only a small part of setting up modern ML infrastructures. Many of the challenges teams that face as they make transition to hybrid cloud is in fact in the software stack. The complex compute backends and system details abstracted by AWS, Google Cloud, and Azure have to be manually configured. The challenge becomes even greater if you take the ML-specific topics into account:

  • Mount and cache high-volume datasets across hybrid environments
  • Containerize current progress and move between on-prem and cloud seamlessly
  • Optimize compute resources — Match and scale Kubernetes pods automatically based on the required hardware configurations and monitor resource consumption down to each pod.
  • Improve fault-toleranceSelf-healing replaces containers that fail user-defined health checks, and taints and tolerance prevents schedulers from using physically unusable nodes.

Hybrid ML infra with VESSL Clusters

VESSL abstracts the complex compute backends of hybrid ML infra into an easy-to-use interface

Despite these benefits, Kubernetes infrastructure is hard. Everything that an application typically has to take into consideration, like security, logging, redundancy, and scaling, are all built into the Kubernetes fabric — making it heavy and inherently difficult to learn, especially for data scientists who do not have the engineering background. For organizations that have the luxury of a dedicated DevOps or MLOps team, it may still take up to 6 months to implement and distribute the tools and workflows down to every MLEs.

  • Single-command integration — Integrate on-premise servers with a single-line command and use them with multiple cloud services including VESSL’s managed cloud.
  • Cluster management — Monitor real-time usage and incident status of each node on a shared dashboard and set an internal quota policy to prevent overuse.
  • Job scheduling and elastic scaling — Schedule training jobs and serving workloads and scale pods for optimum performance handling.

Step-by-step guide — cluster integration with VESSL

VESSL Clusters enables cluster integration in a single-line command

VESSL’s cluster integration is composed of four primitives:

  • VESSL Cluster Agent sends information about the clusters and workloads running on the cluster such as the node resource specifications and model metrics.
  • Control plane node acts as the control tower of the cluster and hosts the computation, storage, and memory resources to run all the subsidiary worker nodes.
  • Worker nodes run specified ML workloads based on the runtime spec sent from the control plane node.

Try single-node cluster integration on your Mac

(1) Prerequisites

To connect your on-premise GPU clusters to VESSL, you should first have Docker and Helm installed on your machine. The easiest way is to brew install. Check out the Docker and Helm docs for more detailed installation instructions. Keep in mind you have to sign in and keep Docker running while using VESSL clusters.

brew install docker
brew install helm

(2) Set up VESSL environment

Sign up for a free account on VESSL and pip install VESSL SDK on your Mac. Then, set up a VESSL environment on your machine using vessl configure. This will open a new browser window that asks you to grant access. Proceed by clicking grant access.

pip install vessl --upgrade
vessl configure

(3) Connect your machine

You are now ready to connect your Mac to VESSL. The following single-line command connects your Mac, which will appear as macbook-pro-16 on VESSL. Note the --mode single flag specifying that you will be installing a single-node cluster.

vessl cluster create --name '[CLUSTER_NAME_HERE]' --mode single

(4) Confirm integration

Use our CLI command to confirm your integration and try running a training job on your laptop. Your first run may take a few minutes to get the Docker images installed on your device.

vessl cluster list

Scale cluster integration on your team’s on-premise machines

Integrating more powerful, multi-node GPU clusters for your team is as easy as repeating the steps above. To make the process easier, we’ve prepared a single-line curl command that installs all the binaries and dependencies on your server. Note that Ubunto 18.04 or Centos 7.9 or higher Linux OS is installed on your server.

(1) Prerequisites

Install all the dependencies using our use our magic curl command that

  • Installs and configures NVIDIA container runtime.
  • Installs k0s, a lightweight Kubernetes distribution, and designates and configures a control plane node.
  • Generates a token and a command for connecting worker nodes to the control plane node configured above.
curl -sSLf | sudo bash -s -- --role=controller
curl -sSLf | sudo bash -s -- --role worker --token '[TOKEN_HERE]'
sudo k0s kubectl get nodes

(2) Integrate your machine to VESSL

Once you have completed the steps above, you can now integrate the Kubernetes cluster with VESSL. Note the --mode multi flag.

pip install vessl --upgrade
vessl configure
vessl cluster create --name='[CLUSTER_NAME_HERE]' --mode=multi

(3) Confirm integration

You can once again use our CLI command or visit Clusters page on VESSL to confirm your integration.

vessl cluster list

Wrap up

This post covers the recent trends in and motivation behind hybrid ML infrastructure while connecting your GPU clusters to VESSL using our magic, single-line command.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store