Multi Node Distributed Training with PyTorch Lightning & Azure ML

Aaron (Ari) Bornstein
Microsoft Azure
Published in
5 min readOct 26, 2020

--

TL;DR This post outlines how to distribute PyTorch Lightning training on Distributed Clusters with Azure ML

Full end to end implementations can be found on the official Azure Machine Learning GitHub repo.

If you are new to Azure you can get started a free subscription using the link below.

Azure ML and PyTorch Lighting

In my last few posts on the subject, I outlined the benefits of both PyTorch Lightning and Azure ML to simplify training deep learning models and logging. Take a look you haven’t yet check it out!

Now that you are familiar with both the benefits of Azure ML and PyTorch lighting let’s talk about how to take PyTorch Lighting to the next level with multi node distributed model training.

Multi Node Distributed Training

Sample traditional distributed training consideration from Azure Docs.

Multi Node Distributed Training is typically the most advanced use case of the Azure Machine Learning service. If you want a sense of why it is traditionally so difficult, take a look at the Azure Docs.

PyTorch Lighting makes distributed training significantly easier by managing all the distributed data batching, hooks, gradient updates and process ranks for us. Take a look at the video by

here to see how this works.

We only need to make one minor modification to our train script for Azure ML to enable PyTorch lighting to do all the heavy lifting in the following section I will walk through the steps to needed to run a distributed training job on a low priority compute cluster enabling faster training at an order of magnitude cost savings.

Getting Started

Step 1 — Set up Azure ML Workspace

Create Azure ML Workspace from the Portal or use the Azure CLI

Connect to the workspace with the Azure ML SDK as follows

from azureml.core import Workspace
ws = Workspace.get(name="myworkspace", subscription_id='<azure-subscription-id>', resource_group='myresourcegroup')

Step 2 — Set up Multi GPU Cluster

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
# Choose a name for your GPU cluster
gpu_cluster_name = "gpu cluster"
# Verify that cluster does not exist already
try:
gpu_cluster = ComputeTarget(workspace=ws, name=gpu_cluster_name)
print('Found existing cluster, use it.')
except ComputeTargetException:
compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC12s_v3',
max_nodes=2)
gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, compute_config)
gpu_cluster.wait_for_completion(show_output=True)

Step 3 — Configure Environment

To run PyTorch Lighting code on our cluster we need to configure our dependencies we can do that with simple yml file.

channels:
- conda-forge
dependencies:
- python=3.6
- pip:
- azureml-defaults
- mlflow
- azureml-mlflow
- torch
- torchvision
- pytorch-lightning
- cmake
- horovod # optional if you want to use a horovod backend

We can then use the AzureML SDK to create an environment from our dependencies file and configure it to run on any Docker base image we want.

from azureml.core import Environmentenv = Environment.from_conda_specification(environment_name, environment_file)# specify a GPU base image
env.docker.enabled = True
env.docker.base_image = (
"mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04"
)

Step 4 — Training Script

Create a ScriptRunConfig to specify the training script & arguments, environment, and cluster to run on.

We can use any example train script from the PyTorch Lighting examples or our own experiments.

Once we have our training script we need to make one minor modification by adding the following function that sets all the required environmental variables for distributed communication between the Azure nodes.

Then after parsing the input arguments call the above function.

args = parser.parse_args()# -----------
# configure distributed environment
# -----------
set_environment_variables_for_nccl_backend(single_node=int(args.num_nodes) > 1)

Hopefully in the future this step will be abstracted out for us.

Step 5 — Run Experiment

For Multi Node GPU training , specify the number of GPUs to train on per a node (typically this will correspond to the number of GPUs in your cluster’s SKU), the number of nodes(typically this will correspond to the number of nodes in your cluster) and the accelerator mode and the distributed mode, in this case DistributedDataParallel ("ddp"), which PyTorch Lightning expects as arguments --gpus --num_nodesand --accelerator, respectively. See their Multi-GPU training documentation for more information.

Then set the distributed_job_config to a new MpiConfiguration with equal to one(since PyTorch lighting manages all the distributed training)and a node_count equal to the --num_nodes you provided as input to the train script.

We can view the run logs and details in realtime with the following SDK commands.

from azureml.widgets import RunDetailsRunDetails(run).show()
run.wait_for_completion(show_output=True)

And there you have it with out needing to deal with managing the complexity of distributed batching, Cuda, MPI, logging callbacks, or process ranks, PyTorch lighting scale your training job to as many nodes as nodes as you’d like.

Pictured a complete two node 4 gpu run.

You shouldn’t but if you have any issues let me know in the comments.

Acknowledgements

I want to give a major shout out to Minna Xiao and Alex Deng from the Azure ML team for their support and commitment working towards a better developer experience with Open Source Frameworks such as PyTorch Lighting on Azure.

About the Author

Aaron (Ari) Bornstein is an AI researcher with a passion for history, engaging with new technologies and computational medicine. As an Open Source Engineer at Microsoft’s Cloud Developer Advocacy team, he collaborates with the Israeli Hi-Tech Community, to solve real world problems with game changing technologies that are then documented, open sourced, and shared with the rest of the world.

--

--

Aaron (Ari) Bornstein
Microsoft Azure

<Microsoft Open Source Engineer> I am an AI enthusiast with a passion for engaging with new technologies, history, and computational medicine.