Faster and Cheaper PyTorch with RaySGD

Distributed training is annoying to set up and expensive to run. Here’s a library to make distributed Pytorch model training simple and cheap.

Richard Liaw
Distributed Computing with Ray
5 min readApr 7, 2020

--

Distributing your deep learning model training has become a question of when you do it, not if. State-of-the-art ML models like BERT have 100s of millions of parameters, and training these large networks will take you days if not weeks on one machine.

Language models are just getting bigger and bigger.

Fundamentally, you have two choices when it comes to training deep learning models:

Option 1: Tolerate the 20 hour training times, or focus on models that are small enough to train on a single node (or single GPU) to keep things simple and be able to use standard tools like Jupyter Notebooks.

Option 2: Go through a mountain of pain and try to distribute your training.

Photo by Maja Kochanowska on Unsplash

So, what does it take to distribute training today?

To take your training beyond a single node, you’re going to have to deal with:

  1. Messy distributed systems deployments (including setting up networking, containerization, credentials).
  2. A huge AWS bill for expensive nodes (current solutions don’t allow you to use cheap, preemptible instances).
  3. Losing access to your favorite tools such as Jupyter notebooks.

You can use one of the integrated tools for doing distributed training like Torch Distributed Data Parallel or tf.Distributed. While these are “integrated”, they are certainly not a walk in the park to use.

Torch’s AWS tutorial demonstrates the many setup steps you’re going to have to follow to simply get the cluster running, and Tensorflow 2.0 has a bunch of issues.

Maybe you might look at something like Horovod, but Horovod is going to require you to fight against antiquated frameworks like MPI and wait a long time for compilation when you launch.

Along with this complex setup, you’re going to need to give up the typical tools that you use, like Jupyter notebooks. On top of that you’re going to have to use expensive on-demand instances because none of these frameworks are fault-tolerant.

In our own work, we identified these issues as roadblocks to simplifying distributed deep learning training. We set out to create our own solution that solves these key problems.

So, what’s the better way?

RaySGD — the simple distributed training solution

To solve the problems above, we built RaySGD.

It’s a lightweight Python library built on top of distributed PyTorch that not only makes deployment easy, but it provides you with the flexibility you need to do cutting edge research and development.

Now, we’re sure you’re asking yourself,

How is this better than what already exists?

RaySGD focuses on several key benefits:

  • Seamless Parallelization: Go from 1 GPU to 100 GPUs with a single argument.
  • Accelerate Training: Built-in support for mixed precision training with NVIDIA Apex.
  • Simple, Native Interfaces: We’ve kept the interface simple to make it easy to migrate existing training code and keep the mental overhead low — just learn a couple of new lines of code.
  • Fault Tolerance: Supports automatic recovery when machines on the cloud are preempted. Now, you can use spot instances to reduce costs by up to 90%.
  • Seamless Hyperparameter Tuning: RaySGD integrates with RayTune, a cutting edge, distributed, hyperparameter tuning framework.

On top of what we achieved above, RaySGD matches the performance of specialized, SOTA deep learning frameworks, like Horovod.

Comparing Horovod vs Ray (which uses Pytorch Distributed DataParallel underneath the hood) on p3dn.24xlarge instances. Horovod and Ray perform similarly across different scales.

RaySGD is also able to outperform the default torch.nn.DataParallel by up to 20% on 8 GPUs.

Comparing PyTorch DataParallel vs Ray (which uses Pytorch Distributed DataParallel underneath the hood) on p3dn.24xlarge instances. Ray is able to scale better with and without mixed precision, with up to 20% faster performance on 8 GPUs.

RaySGD is built on top of Ray, a framework for fast and simple distributed computing. Training on spot or preemptible instances for huge cost savings is as simple as changing a single line of code: trainer.train(max_retries=100)

How do I get started?

RaySGD provides a minimal API that gives a user the typical customizability they’re already familiar with from TensorFlow or PyTorch. Here’s the minimum you need to run a multi-GPU training job.

  1. pip install -U ray torch
  2. Run the below script.
pytorch.py

This simple script will download CIFAR10 and use a ResNet18 model to do image classification. You can run on multiple GPUs with a single parameter change ( num_workers=N ).

OK then, how do I scale out PyTorch training across a cluster?

Don’t worry, this is just 4 extra steps. I’ll demonstrate how to run RaySGD on AWS, but it’s just as easy to run this on SLURM, Azure, GCP, or your local cluster.

  1. Download the following YAML file and the previous python script (save it as pytorch.py).
  2. Change ray.init() in the script to ray.init(address="auto")
  3. Change num_workers=16 in the TorchTrainer constructor.
  4. Run ray submit ray-cluster.yaml pytorch.py --start --stop . This automatically starts the preemptible cluster (16 total V100 GPUs), and shuts down the cluster as soon as training is finished. For 30 minutes, this will cost… $7.44.
Save this as `ray-cluster.yaml`. Apex installation is optional and commented out for simplicity.

To run on GCP or Azure, simply change a couple lines in the above YAML — more instructions here.

As you can see, getting up and running with RaySGD is simple — you’ve learned almost everything you need in this blog post. So now get started and let us know what you think!

--

--