Large-Scale Distributed Training with TorchX and Ray

Published in

PyTorch

4 min readMar 23, 2022

Author(s): Mark Saroufim and Jules S. Damji

Ray, created at RISELab by the founders of Anyscale. It provides a rich set of native libraries for ML workloads and a general-purpose core for building distributed applications.

On top of the libraries provided by Ray, there is a rich ecosystem of libraries and integrations that enable PyTorch users to achieve greater scale. Two great examples are PyTorch Distributed and PyTorch Lightning enabling users to take advantage of the amazing PyTorch and Ray capabilities together.

This blog introduces how TorchX extends functionality to submit PyTorch jobs via a newly developed Ray Scheduler. Using TorchX SDK and the Ray Job Submission SDK, developers can build and deploy PyTorch machine learning applications from R&D to production. TorchX provides the ability to string together built-in components like hyperparameter optimization, model serving and distributed data-parallel into complex pipelines while leveraging the most popular job schedulers in open source.

A joint engineering effort between the Meta AI’s PyTorch and Anyscale ML teams, this new Ray Scheduler component allows developers to run scalable and distributed PyTorch workloads without setting up an infrastructure for their choice of the cloud or changing their training scripts — all can be done via the TorchX SDK or CLI.

You can run all of the below live in a Google Colab notebook.

TorchX Developer User Journey

To comprehend how PyTorch developers use TorchX SDK and convert or deploy their scripts into jobs deployed on a remote Ray cluster via a new Ray Scheduler, let’s examine some use cases and show some code examples.

In 5 steps, you can convert your PyTorch Python script into a TorchX job and submit it for execution on a Ray Cluster in your cloud.

Step 1: Install ray and torchX on your laptop
pip install ray “torchx[dev]”

Step 2: Create your simple_ray_job.py as you would for any PyTorch training script in your IDE or editor.

Step 3: Launch a Ray Cluster (as shown below) using a <cloud>_cluster.yaml, which specifies the kind of Ray cluster: number and kind of nodes, GPU vs. CPUs, etc. Look at the example file in the repo, though comprehensive and expansive,, the YAML file can be curtailed to the needs of your specific demand of a TorchX job.

Step 4: Submit a TorchX job to the Ray Cluster using the TorchX CLI, as shown below.

Step 5: Monitor the job’s progress or final status by fetching the logs, as shown below.

Submitting TorchX Jobs to Ray Scheduler

As a user, you only need to write your training script. We’ll use a distributed data parallel training script in the below example. You can also pass in other parameters like your working directory or requirements.txt. Then just call the Torchx CLI while specifying the Ray Scheduler, which will look for an available Ray cluster, start running your job, and stream back the logs to your local client. This helps decouple your training script from your infrastructure so that you can easily move to large multi-node workloads with multiple GPUs without changing your code.

Submit TorchX Jobs to Choice of a Cloud

Submitting a job to a cloud of your choice is simple. For instance, if you wish to submit a TorchX job to a Ray cluster on AWS or GCP cloud, assuming you have all your target cloud’s credentials set as required by the specific cloud provider, you can expand on a simple aws_ray_cluster.yaml or gcp_ray_cluster.yaml file to meet your compute node type and needs.

Launching a cluster

Through TorchX’s Ray Scheduler, you can just as easily pick the cloud of your choice to submit your job to.

Defining a TorchX component

Writing a Distributed PyTorch job

The simplest possible Distributed PyTorch job would compute the world size and make sure all nodes agree to that world size.

Submitting a job to a cluster

Once your cluster is up and running, you can now submit your TorchX job with the TorchX CLI after the above command is successfully finished. Through TorchX’s Ray Scheduler, your job will be submitted to the Ray Cluster launched above in the cloud of your choice. For example:

Use TorchX SDK and Ray Jobs API

A common way to check the status of your submission is by using TorchX SDK and Ray Jobs API. The sections below show the commands and results from common tasks such as examining the job status and monitoring a job’s progress, which are common queries for developers.

Examine Job Status

After submitting a job, it’s queued and in a pending state after that the job can either be successful, fail because of some application error or be interrupted by you.

Monitor a Job’s Progress

Collecting logs works in much the same way as getting the job description works. The key here is that logs are actually distributed over multiple machines, yet you get them all streamed back to your console without having to worry about which machine has which logs.

What’s Next

Any existing TorchX component can now run on top of Ray. Over time, we’re looking to add more reference projects to TorchX to let anyone bootstrap an end-to-end ML infrastructure and seamlessly run it and scale it on top of Ray.

Check out the e-2-e example of TorchX and Ray Scheduler.
Visit TorchX Ray Scheduler documentatio
Checkout the TorchX Ray Scheduler talk at the Ray Meetup

Acknowlegements

We want to acknowledge the joint engineering efforts by the Meta AI and Anyscale teams for this endeavor.

Meta AI Team: Mark Saroufim, Aliaksandr Ivanou, Can Balioglu, Tristan Rice, Kiuk Chung, Geeta Chauhan

Anyscale Team: Amog Kamsetty, Jiao Dong, Jules S. Damji, Richard Liaw