Multi Node Distributed Training with PyTorch Lightning & Azure ML

Aaron (Ari) Bornstein

Published in

Microsoft Azure

5 min readOct 26, 2020

TL;DR This post outlines how to distribute PyTorch Lightning training on Distributed Clusters with Azure ML

Full end to end implementations can be found on the official Azure Machine Learning GitHub repo.

Azure/azureml-examples

description: learn how to train and log metrics with PyTorch Lightning PyTorch Lightning is a lightweight open-source…

github.com

If you are new to Azure you can get started a free subscription using the link below.

Create your Azure free account today | Microsoft Azure

Create personalized experiences with AI Build connected, cross-platform experiences-tailored to customer interactions…

azure.microsoft.com

Azure ML and PyTorch Lighting

In my last few posts on the subject, I outlined the benefits of both PyTorch Lightning and Azure ML to simplify training deep learning models and logging. Take a look you haven’t yet check it out!

Training Your First Distributed PyTorch Lightning Model with Azure ML

TLDR; This post outlines how to get started training Multi GPU Models with PyTorch Lightning using Azure Machine…

medium.com

Configuring Native Azure ML Logging with PyTorch Lighting

TL;DR: This post demonstrates how to connect PyTorch Lightning logging to Azure ML natively with ML Flow.

medium.com

Now that you are familiar with both the benefits of Azure ML and PyTorch lighting let’s talk about how to take PyTorch Lighting to the next level with multi node distributed model training.

Multi Node Distributed Training

Sample traditional distributed training consideration from Azure Docs.

Multi Node Distributed Training is typically the most advanced use case of the Azure Machine Learning service. If you want a sense of why it is traditionally so difficult, take a look at the Azure Docs.

What is distributed training? - Azure Machine Learning

In this article, you learn about distributed training and how Azure Machine Learning supports it for deep learning…

docs.microsoft.com

PyTorch Lighting makes distributed training significantly easier by managing all the distributed data batching, hooks, gradient updates and process ranks for us. Take a look at the video by William Falcon here to see how this works.

We only need to make one minor modification to our train script for Azure ML to enable PyTorch lighting to do all the heavy lifting in the following section I will walk through the steps to needed to run a distributed training job on a low priority compute cluster enabling faster training at an order of magnitude cost savings.

Plan and manage costs - Azure Machine Learning

This article describes how to plan and manage costs for Azure Machine Learning. First, you use the Azure pricing…

docs.microsoft.com

Getting Started

Step 1 — Set up Azure ML Workspace

Create Azure ML Workspace from the Portal or use the Azure CLI

Connect to the workspace with the Azure ML SDK as follows

from azureml.core import Workspace
ws = Workspace.get(name="myworkspace", subscription_id='<azure-subscription-id>', resource_group='myresourcegroup')

Step 2 — Set up Multi GPU Cluster

Create compute clusters - Azure Machine Learning

Learn how to create and manage a compute cluster in your Azure Machine Learning workspace. You can use Azure Machine…

docs.microsoft.com

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException# Choose a name for your GPU cluster
gpu_cluster_name = "gpu cluster"# Verify that cluster does not exist already
try:
    gpu_cluster = ComputeTarget(workspace=ws, name=gpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC12s_v3',
max_nodes=2)
    gpu_cluster = ComputeTarget.create(ws, gpu_cluster_name, compute_config)gpu_cluster.wait_for_completion(show_output=True)

Step 3 — Configure Environment

To run PyTorch Lighting code on our cluster we need to configure our dependencies we can do that with simple yml file.

channels:
  - conda-forge
dependencies:
  - python=3.6
  - pip:
    - azureml-defaults
    - mlflow
    - azureml-mlflow
    - torch
    - torchvision
    - pytorch-lightning
    - cmake
    - horovod # optional if you want to use a horovod backend

We can then use the AzureML SDK to create an environment from our dependencies file and configure it to run on any Docker base image we want.

from azureml.core import Environmentenv = Environment.from_conda_specification(environment_name, environment_file)# specify a GPU base image
env.docker.enabled = True
env.docker.base_image = (
    "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.2-cudnn8-ubuntu18.04"
)

Step 4 — Training Script

Create a ScriptRunConfig to specify the training script & arguments, environment, and cluster to run on.

We can use any example train script from the PyTorch Lighting examples or our own experiments.

Once we have our training script we need to make one minor modification by adding the following function that sets all the required environmental variables for distributed communication between the Azure nodes.

Then after parsing the input arguments call the above function.

args = parser.parse_args()# -----------
# configure distributed environment
# -----------set_environment_variables_for_nccl_backend(single_node=int(args.num_nodes) > 1)

Hopefully in the future this step will be abstracted out for us.

Step 5 — Run Experiment

For Multi Node GPU training , specify the number of GPUs to train on per a node (typically this will correspond to the number of GPUs in your cluster’s SKU), the number of nodes(typically this will correspond to the number of nodes in your cluster) and the accelerator mode and the distributed mode, in this case DistributedDataParallel ("ddp"), which PyTorch Lightning expects as arguments --gpus --num_nodesand --accelerator, respectively. See their Multi-GPU training documentation for more information.

Then set the distributed_job_config to a new MpiConfiguration with equal to one(since PyTorch lighting manages all the distributed training)and a node_count equal to the --num_nodes you provided as input to the train script.

We can view the run logs and details in realtime with the following SDK commands.

from azureml.widgets import RunDetailsRunDetails(run).show()
run.wait_for_completion(show_output=True)

And there you have it with out needing to deal with managing the complexity of distributed batching, Cuda, MPI, logging callbacks, or process ranks, PyTorch lighting scale your training job to as many nodes as nodes as you’d like.

You shouldn’t but if you have any issues let me know in the comments.

Acknowledgements

I want to give a major shout out to Minna Xiao and Alex Deng from the Azure ML team for their support and commitment working towards a better developer experience with Open Source Frameworks such as PyTorch Lighting on Azure.

About the Author

Aaron (Ari) Bornstein is an AI researcher with a passion for history, engaging with new technologies and computational medicine. As an Open Source Engineer at Microsoft’s Cloud Developer Advocacy team, he collaborates with the Israeli Hi-Tech Community, to solve real world problems with game changing technologies that are then documented, open sourced, and shared with the rest of the world.

Multi Node Distributed Training with PyTorch Lightning & Azure ML

Azure/azureml-examples

description: learn how to train and log metrics with PyTorch Lightning PyTorch Lightning is a lightweight open-source…

Create your Azure free account today | Microsoft Azure

Create personalized experiences with AI Build connected, cross-platform experiences-tailored to customer interactions…

Azure ML and PyTorch Lighting

Training Your First Distributed PyTorch Lightning Model with Azure ML

TLDR; This post outlines how to get started training Multi GPU Models with PyTorch Lightning using Azure Machine…

Configuring Native Azure ML Logging with PyTorch Lighting

TL;DR: This post demonstrates how to connect PyTorch Lightning logging to Azure ML natively with ML Flow.

Multi Node Distributed Training

What is distributed training? - Azure Machine Learning

In this article, you learn about distributed training and how Azure Machine Learning supports it for deep learning…

Plan and manage costs - Azure Machine Learning

This article describes how to plan and manage costs for Azure Machine Learning. First, you use the Azure pricing…

Getting Started

Step 1 — Set up Azure ML Workspace

Step 2 — Set up Multi GPU Cluster

Create compute clusters - Azure Machine Learning

Learn how to create and manage a compute cluster in your Azure Machine Learning workspace. You can use Azure Machine…

Step 3 — Configure Environment

Step 4 — Training Script

Step 5 — Run Experiment

Acknowledgements

About the Author

Written by Aaron (Ari) Bornstein