CASL Project
Published in

CASL Project

Cost-effective Hyper-parameter Tuning using AdaptDL with NNI

Authors: Petuum CASL Team | Acknowledgment: Microsoft NNI Team

[Source: Sanh et al., 2019]
  • Elastic job scheduling for trials: while each trial is running, AdaptDL dynamically adjusts the number of GPUs assigned to it. Each trial may be distributed across several machines, and AdaptDL ensures each trial is only allocated GPUs it can efficiently utilize. AdaptDL also automatically defragments the cluster to eliminate slow-downs due to network interference between concurrent trials.
  • Adaptive batch size and learning rate: AdaptDL scales the batch size and learning rate of each trial to maximize a metric called Goodput, leading to faster and more efficient training. AdaptDL automatically uses gradient accumulation when necessary to achieve larger batch sizes for higher throughput.
  • Auto-provisioning AWS Spot Instances: AdaptDL can optionally use cheaper AWS spot instances to opportunistically lower training/tuning costs. If the spot instance gets pre-empted, AdaptDL’s elastic scheduling automatically takes care of migration.
  • Integrations with the CASL open-source ecosystem: Tuun, AutoDist, Texar, Forte and Stave.
  • If you’re curious about the tech behind AdaptDL, please see our technical paper!

Getting Started

  • a cluster either on cloud (AWS EKS, Azure AKS etc.) or on premises with Kubernetes, and
  • a local machine authorized to access the Kubernetes cluster.
  • Helm-install AdaptDL onto the Kubernetes instance:
$ helm install adaptdl adaptdl-sched \
--repo https://github.com/petuum/adaptdl/raw/helm-repo \
—-namespace adaptdl —-create-namespace \
--set docker-registry.enabled=true
  • Pip-install NNI onto your local machine:
$ python3 -m pip install —-upgrade nni
$ git clone -b v2.0 https://github.com/Microsoft/nni.git
$ cd nni
  1. CIFAR-10: Configurations defined in examples/trials/cifar10_pytorch/config_adl.yml
  2. MNIST: Configurations defined in examples/trials/mnist-pytorch/config_adl.yml
CIFAR-10 configuration file
$ nnictl create --config examples/trials/cifar10_pytorch/config_adl.yml
An experiment viewed in the NNI GUI

What’s next

About CASL

--

--

News and updates about the CASL open — source project.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Petuum, Inc.

One Machine Learning Platform to Serve Many Industries: Petuum, Inc. is a startup building a revolutionary AI & ML solution development platform petuum.com