Cost-effective Hyper-parameter Tuning using AdaptDL with NNI
Hyper-parameter tuning (HPT) is an essential step in deep learning workflows, allowing teams to push a model’s quality higher by systematically evaluating many different sets of hyper-parameters (each called a “trial”) and picking the best outcome. HPT is appealing because it is easy to automate and requires little engineering or coding. At Petuum, we use HPT to tune our models for Healthcare Report Writing, Industrial Process Optimization and Image Annotation, running dozens of trials per deployed model.
However, HPT requires large amounts of computing — proportional to the number of trials you run — and quickly becomes expensive in time and dollar cost. That’s especially challenging when factoring in the size of modern models. Take Natural Language Processing (NLP) as an example — the figure below shows recent language models reaching 100s of millions to billions of parameters, with training times measured in thousands of GPU-hours or more. Multiply that by tens or hundreds of HPT trials, and the whole HPT workflow may take days or weeks to complete, not to mention thousands of dollars or more in cloud compute costs.
To tackle the problem of long and expensive HPT workflows, our team at Petuum collaborated with Microsoft to integrate AdaptDL with NNI. AdaptDL is an open-source tool in the CASL (Composable, Automatic, and Scalable Learning) ecosystem. AdaptDL offers adaptive resource management for distributed clusters, and reduces the cost of deep learning workloads ranging from a few training/tuning trials to thousands. Specific benefits of AdaptDL include:
- Elastic job scheduling for trials: while each trial is running, AdaptDL dynamically adjusts the number of GPUs assigned to it. Each trial may be distributed across several machines, and AdaptDL ensures each trial is only allocated GPUs it can efficiently utilize. AdaptDL also automatically defragments the cluster to eliminate slow-downs due to network interference between concurrent trials.
- Adaptive batch size and learning rate: AdaptDL scales the batch size and learning rate of each trial to maximize a metric called Goodput, leading to faster and more efficient training. AdaptDL automatically uses gradient accumulation when necessary to achieve larger batch sizes for higher throughput.
- Auto-provisioning AWS Spot Instances: AdaptDL can optionally use cheaper AWS spot instances to opportunistically lower training/tuning costs. If the spot instance gets pre-empted, AdaptDL’s elastic scheduling automatically takes care of migration.
- Integrations with the CASL open-source ecosystem: Tuun, AutoDist, Texar, Forte and Stave.
- If you’re curious about the tech behind AdaptDL, please see our technical paper!
Neural Network Intelligence (NNI), from the Microsoft open-source community, is a toolkit for automatic machine learning (AutoML) and hyper-parameter tuning. NNI provides a frontend for managing AutoML experiments and a rich library of HPT and Neural Architecture Search (NAS) algorithms. NNI dispatches and runs experiments’ trial jobs generated by tuning algorithms to search for the best neural architecture and/or hyper-parameters. By running NNI trials using AdaptDL, we’ve been able to perform HPT 1.5x faster in our clusters, and 3x cheaper on AWS. If you’re already using NNI in your workflow (or thinking about it), you can now plug in AdaptDL to make HPT faster, more efficient, and cheaper!
It’s straight-forward to get started. You will need:
- a cluster either on cloud (AWS EKS, Azure AKS etc.) or on premises with Kubernetes, and
- a local machine authorized to access the Kubernetes cluster.
If you don’t have a Kubernetes cluster and just want to try AdaptDL+NNI, you can follow this guide to set up a simple MicroK8s instance on your local machine.
- Helm-install AdaptDL onto the Kubernetes instance:
$ helm install adaptdl adaptdl-sched \
--repo https://github.com/petuum/adaptdl/raw/helm-repo \
—-namespace adaptdl —-create-namespace \
Please refer to the AdaptDL installation page for detailed instructions.
- Pip-install NNI onto your local machine:
$ python3 -m pip install —-upgrade nni
Please see the latest (2.0+) NNI release installation guide for detailed instructions.
Now AdaptDL+NNI should be ready to go! For more details, refer to the NNI AdaptDL Experiment page to verify the successful installations of both and get started with the examples.
Clone the NNI repository:
$ git clone -b v2.0 https://github.com/Microsoft/nni.git
$ cd nni
The NNI repository provides several AdaptDL examples for you:
- CIFAR-10: Configurations defined in examples/trials/cifar10_pytorch/config_adl.yml
- MNIST: Configurations defined in examples/trials/mnist-pytorch/config_adl.yml
The CIFAR-10 configuration file is shown below.
To run the CIFAR-10 example, you should first modify the configuration file by providing the IP address of your local machine in the “nniManagerIp” field. You will also need to choose an appropriate Kubernetes storage class so that AdaptDL can checkpoint the model. For example, if using MicroK8s, a storage class name of “microk8s-hostpath” can be used (as provided in config_adl.yml).
To run the CIFAR-10 example, simply use nnictl to start your HPT.
$ nnictl create --config examples/trials/cifar10_pytorch/config_adl.yml
Open the NNI GUI to watch your experiment run with AdaptDL!
Beyond the convenience provided by this integration for ML experiments, the CASL Open-source community from Petuum has other projects that is already compatible with NNI and AdaptDL: For example, Tuun can work together with NNI and offer more flexible tuning model choices. Refer to the CASL Tuun page on how NNI can work better together. This Tuun documentation page has more details on how Tuun + NNI can work better together, including a couple of small Tuun + NNI examples.
CASL provides a unified toolkit for composable, automatic, and scalable machine learning systems, including distributed training, resource-adaptive scheduling, hyperparameter tuning, and compositional model construction. CASL consists of many powerful Open-source components that were built to work in unison or leveraged as individual components for specific tasks to provide flexibility and ease of use.
Thanks for reading! Please visit the CASL website to stay up to date on additional CASL and NNI announcements in the near future: https://www.casl-project.ai/ If you’re interested in working professionally on CASL, visit our careers page at Petuum!