AdaptDL on Ray: Simple and Efficient Distributed Training

Petuum, Inc.
CASL Project
Published in
8 min readAug 17, 2022

--

Summary sentence: In this tutorial blog post, we show how to run simple and efficient distributed training of deep learning models by integrating AdaptDL, a state of the art resource scheduling and optimization library from the CASL ecosystem, with Ray.

Introduction

In recent years, due to the size of modern deep learning (DL) models, the expense of model training has grown rapidly in both time and dollar cost, with training times measured in thousands of GPU-hours or more. In response, parallel and distributed DL training has become increasingly important — however, it still remains difficult for users to easily set up and run distributed training, often requiring expertise in both infrastructure (e.g. distributed computing frameworks) and machine learning (e.g. scaling laws and training hyperparameters).

To allow for both efficient DL training speeds, and easy setup of distributed training, our team at Petuum has combined AdaptDL with Ray. AdaptDL is a resource scheduling and optimization library that makes distributed DL training easy and efficient, particularly in environments such as shared clusters and the cloud. However, AdaptDL currently leverages Kubernetes, which can require a good deal of expertise to use. By combining AdaptDL with Ray — a system for simplifying distributed execution — users can get AdaptDL jobs working quickly, without the need to deploy Kubernetes. Furthermore, by using AdaptDL, we can decrease training costs by 2–20x compared to the default training procedures on Ray.

In the tutorial below, we first give brief background on both AdaptDL and Ray, show the performance gains of running AdaptDL on Ray, and then give details on how to implement our system.

Background

AdaptDL is a resource-adaptive deep learning (DL) training and scheduling framework, and is part of the CASL open source project . The goal of AdaptDL is to make distributed DL easy and efficient in dynamic-resource environments such as shared clusters and the cloud. It supports batch size and node autoscaling to optimize training progress, fault tolerance, and the ability to schedule multiple simultaneous jobs.

Since AdaptDL requires a preexisting Kubernetes instance with an attached distributed filesystem, its use can be intimidating for Data Scientists and ML Engineers not familiar with Kubernetes. Cluster creation, even with EKS or Terraform, can require some expertise, as can configuring a set of AdaptDL worker machines to use the distributed filesystem and getting the code to the workers’ Docker image.

The Petuum AI OS (Petuum’s platform for composing, managing, and monitoring custom ML infrastructure on a single pane of glass) simplifies these and other functions of Kubernetes implementation for ML workflows, reducing DevOps and deployment overheads, but even so, not all AI teams are currently using Kubernetes. For these cases, we’d like an alternative way to run distributed training to take advantage of AdaptDL’s benefits

Ray is a fast and simple framework for distributed applications in Python. Given its ease of use for ML applications on Python, it is an increasingly popular method of managing clusters of distributed infrastructure.

The Ray ecosystem includes libraries that are useful for Data Scientists and ML Engineers in their large, deep learning projects. In particular, Ray Train (formerly Ray SGD), is useful for running deep learning training workloads. However, training job scheduling across distributed Ray clusters leaves room to be optimized, in order to improve both training times and costs.

Our Solution: AdaptDL on Ray

As a solution, our team at Petuum has integrated AdaptDL with Ray, such that a user can run AdaptDL training using a Ray cluster (e.g. installed on a cloud platform such as AWS) as a distributed computing backend. This system allows a user to set up and deploy AdaptDL jobs quickly, without needing Kubernetes. Furthermore, by taking advantage of AdaptDL’s worker autoscaling and hyperparameter tuning (and combining it with Ray’s cluster rescaling) we see increased elasticity and training efficiency, and are thus able to decrease training costs by 2–20x compared to the default training procedures on Ray.

Before diving into the details on how to use our system, we show some results up front. As a testbed, we run distributed model training on a Resnet50 model with the Kaggle ImageNet Subset.

We compare distributed model training using the default Ray Train procedures — on both 4 nodes, 16 nodes, and 64 nodes — with our integration of AdaptDL on Ray, run on AWS spot instances.

For all methods, in the plots below we show (a) Top 5 Accuracy vs Approximate Money Spent (in USD), and (b) Top 5 Accuracy vs Training Time (seconds). We see that AdaptDL on Ray is able to reduce training costs by 2–20x and reduce training times by 10–30x, to achieve a given Top5 accuracy. Furthermore, AdaptDL on Ray is easy to set up and use, requiring very little expertise in distributed infrastructure or model training.

(a)

(b)

In the following sections, we detail the three main implementation steps needed to set up and run AdaptDL on Ray:

  1. Setting up model training with AdaptDL.
  2. Setting up Ray Infrastructure.
  3. Running AdaptDL on Ray.

Read on to find out more!

Section 1: Setting up Model Training with AdaptDL

For the AdaptDL scheduler to automatically scale a deep learning training job to the right number of replicas, and adjust the batch size and LR automatically, the training code needs a few changes. To illustrate the key changes, we start with the MNIST example from the official PyTorch repo. We show how to make the model distributed, make PyTorch DataLoaders and training loop restart-safe, and in general add elasticity to the training job using the AdaptDL API.

First, convert the model to a distributed data-parallel (DDP) model so that it can utilize multiple GPUs and nodes.

This closely follows the PyTorch DDP API for initialization of the process group and converting the model to a distributed model.

Next, we need to make the DataLoaders restart-safe so that in the event of a rescale we don’t lose our position within them. Essentially, we convert the PyTorch DataLoaders to AdaptiveDataLoaders. The last API below, train_loader.autoscale_batch_size(max_batch_size, local_bsz_bounds), is optional, and adds automatic batch size scaling to the training. Here max_batch_size is the maximum allowable global batch size and local_bsz_bounds is a tuple indicating local batch size bounds that your single GPU can handle.

Because we don’t want to start from the beginning every time we rescale the training job, we use adaptdl.torch.remaining_epochs_until in the training loop so that we can reposition to the right epoch.

To collect and aggregate stats like loss and accuracy across multiple replicas, AdaptDL provides additional APIs. There is no PyTorch equivalent here. Use the adapdl.torch.Accumulator which is a dict-like object that sums across replicas when synchronized is called.

With these changes, you are ready to train your model in an elastic, distributed data-parallel fashion on multiple GPUs across multiple nodes. Here you can find the full source code for the modified MNIST model that is set up to run on AdaptDL.

Section 2: Setting up Ray Infrastructure

Now that the job has been adjusted for execution on Ray, you will need to deploy a Ray cluster if you don’t already have one. In particular, we focus on setting up a Ray cluster on Amazon Web Services (AWS). Please see these instructions and tutorial for configuring and launching a Ray cluster on AWS.

When creating the cluster, you will need to configure both a docker file with the pip requirements placed in ray/aws/requirements.txt, a working installation of pytorch-gpu. Also, you will need to configure sufficient disk space and a maximum number of worker nodes.

See this configuration file for an example of a cluster configuration

To ensure that the nodes have enough space for Docker to use, you will need to include something like the following BlockDeviceMapping configuration in all of the nodes:

Note that only creating the EBS volume will not make it available for docker. You will also need to format and mount the volume as part of the initialization commands:

If you find that your code does not have access to enough disk space, you can also mount an external volume (as provisioned above) to the runtime containers via:

Finally, you will need to make sure that the permissions for the external volume are set properly.

Section 3: Running AdaptDL on Ray

Once the cluster has been deployed, you will need the address and port of the cluster head. Generally, this will be of the form <head-node-ip>:10001. Make sure that you have access to that port via the AWS subnet and inbound rules.

On your local machine, make sure to install the pip package for adaptdl-ray. This package includes the launcher script, and will generally install in /usr/local/bin/adaptdl_on_ray_aws

If your local version of Python does not match the cluster’s, Ray will not work. In this case, one option is to run the command within a Docker container. If you do, be sure to mount your code directory in the container, e.g. via -v.

To launch your job on the cluster, you will need to install the AdaptDL-Ray CLI via

pip install adaptdl-ray

Once that is complete, run adaptdl_on_ray_aws, using the -f flag to point to your training code and “-d” to the directory that contains your training code and any other dependency files. For example, if you have your code in /home/example/training.py, then you can run:

adaptdl_on_ray_aws -f training.py -u “ray://<ray-cluster-ip>:10001” -m <maximum number of workers> — gpus <gpus per worker> — cpus <cpu_cores_per_worker> -d “/home/example”

More specifically, for the example above, you can run:

./usr/local/bin/adaptdl_on_ray_aws -u “ray://head-node-ip:10001” -f code.py -m <maximum-number-of-workers> — cpus <cpus-per-worker>

In order to retrieve the results of the training, you will need to save output to an external store such as S3 or EFS.

Conclusion

In this tutorial blog post, we have shown how to run simple and efficient distributed training of deep learning models by integrating AdaptDL, our state of the art resource scheduling and optimization library from the CASL ecosystem, with Ray, a system for simplifying distributed execution. Via this combination, users can run AdaptDL jobs more easily, without the need to deploy Kubernetes, and can decrease training costs by 2–20x compared to the default training procedures on Ray.

After you have completed a training job on AdaptDL on Ray, compare the results (in terms of time, cost, and accuracy) with the same job before AdaptDL. In our experiments, even with a few nodes, there is a significant improvement in these metrics from the baseline. We’d love to know your experience!

Check the documentation out for further information and advanced uses and adjustments — for example, using large data sets and spot instances. This article is developed from the AdaptDL documentation on Ray on AWS, available here: https://adaptdl.readthedocs.io/en/latest/ray/aws_ray_adaptdl.html

--

--

Petuum, Inc.
CASL Project

One Machine Learning Platform to Serve Many Industries: Petuum, Inc. is a startup building a revolutionary AI & ML solution development platform petuum.com