Autoscaling deep learning clusters on AWS with Kubernetes and RiseML

Henning Peters
RiseML Blog
Published in
3 min readFeb 1, 2018

A little known feature of the RiseML installer is it’s ability to create deep learning clusters with autoscaling support on AWS. Autoscaling can help significantly reduce GPU bills and increase compute capacity during peak times by automatically launching and terminating EC2 instances based on demand.

Due to the flexibility of on-demand compute instances, interest for running deep learning experiments in the cloud is growing. While there is already a healthy competition between vendors for cloud GPUs, prices, however, still remain at a high level. For example, on AWS, instances with Nvidia Tesla P100s currently run up to $588 per day. Thus, forgetting to shut one down or keeping one up just for convenience can easily burn a hole in your wallet.

It is clear that manually keeping track of your team’s instances is wasting your budget.

How does autoscaling work?

It’s actually pretty simple: RiseML clusters check every few seconds whether there are pending experiments to be executed or unused instances left and decide to either issue a scale-up or scale-down request to AWS, thereby, continually optimizing the number of nodes in the cluster. Behind the scenes we rely on Kubernetes, the Kubernetes autoscaler component, and of course AWS Auto Scaling.

A deep learning cluster can be setup on AWS with RiseML in 10 minutes — autoscaling is enabled by default.

How to use autoscaling?

A deep learning cluster can be setup on AWS with RiseML in 10 minutes — autoscaling is enabled by default. Simply, install RiseML using the installer, then choose the region, instance type, and set the min/max count of worker nodes:

$ bash -c "$(curl -fsSL https://get.riseml.com)"
Downloading RiseML installer...
############################################################ 100.0%
Configuring installation options
Choose a region or availability zone in which to install RiseML. If a region is chosen the cluster will be in the spread across all of the region's availability zones.* AWS region or availability zone [default: us-east-1]: us-west-2Configure CPU as well as GPU worker nodes. Make sure that the instance type is available in your region and that instance limits suffice. Autoscaling is enabled by default. Set min/max to the same value to disable autoscaling.* CPU workers
min count [default: 0]:
max count [default: 3]: 0
instance type [default: m4.2xlarge]:
* GPU workers
min count [default: 0]:
max count [default: 3]:
instance type [default: p3.2xlarge]: p3.8xlarge
...

This works with GPU and CPU workers as both instance types are autoscaled independently from each other. Thus, in case a CPU-only experiment is scheduled, only CPU workers are launched and vice-versa.

You can configure whether an experiment requires GPUs or not by setting the gpus attribute in the riseml.yml:

project: imagenet
train:
framework: tensorflow
resources:
cpus: 24
gpus: 8
mem: 24576
run:
- python train.py

Once installed, RiseML will automatically scale your cluster based on the autoscaling settings and the resource demands of your experiment. For more detailed installation and usage instructions please check out our docs.

Install RiseML with autoscaling now — it’s free!

In case you are interested in running RiseML with autoscaling on Azure or Google Cloud please contact us.

About RiseML

RiseML lets your team share your clusters’ resources through an interface tailored for machine learning engineers, allowing them to automatically prepare, run, monitor, and scale experiments in parallel using the machine learning framework of their choice. Advanced techniques such as hyperparameter optimization and distributed training can also be easily enabled. We offer a free community edition for individual users and professional and enterprise editions for teams. An open-source release is in preparation.

--

--