Distributed Deep Learning on Kubernetes with Polyaxon

In this short tutorial, we will be going over a new feature in Polyaxon, distributed training.

Polyaxon currently supports and simplifies distributed training on the following frameworks: Tensorflow, MXNet, and Pytorch.

To parallelize your computations across processes and clusters of machines, you need to adapt your code and update your polyaxonfile to specify the cluster definition. Polyaxon then takes care of creating the tasks and exports necessary environment variables to enable the distributed training.

The current specification provides a section called environment to customize the computational resources and the distribution definition.

By default, when no parallelism is specified, Polyaxon creates a single task to run the experiment defined in the polyaxonfile. Sometimes users wish to specify the resources for the master task. e.g. using gpus

environment:
resources:
cpu:
requests: 1
limits: 2
memory:
requests: 256
limits: 1024
gpu:
request: 1
limits: 1

This same section is also used to specify the cluster definition for the distributed training. Since the parallelism is performed differently by the different frameworks, in this guide we will describe how to define a cluster for each one of the supported frameworks.

Distributed Tensorflow

To distribute Tensorflow experiments, the user needs to define a cluster, which is a set of tasks that participate in the distributed execution.

Tensorflow defines 3 different types of tasks: master, workers, and parameter servers.

To define a cluster in Polyaxon with a master, 2 parameter servers, and 4 workers, we need to add a tensorflow subsection to the environment section:

environment:
...

tensorflow:
n_workers: 4
n_ps: 2

As it was mentioned before, Polyaxon always creates a master task.

You can have more control over the created tasks by defining the resources of each task the same way we defined the resources for the master.

Here’s the same example with custom resources:

environment:
resources:
cpu:
requests: 1
limits: 2
memory:
requests: 256
limits: 1024
gpu:
request: 1
limits: 1

tensorflow:
n_workers: 4
n_ps: 2

default_worker:
resources:
cpu:
requests: 1
limits: 2
memory:
requests: 256
limits: 1024
gpu:
request: 1
limits: 1

worker:
- index: 2
resources:
cpu:
requests: 1
limits: 2
memory:
requests: 256
limits: 1024

default_ps:
resources:
cpu:
requests: 1
limits: 1
memory:
requests: 256
limits: 256

The master’s resources is defined in the resources section, i.e.

resources:
cpu:
requests: 1
limits: 2
memory:
requests: 256
limits: 1024
gpu:
request: 1
limits: 1

The third worker (worker with index == 2) has a specific resources definition:

worker
- index: 2
resources:
cpu:
requests: 1
limits: 2
memory:
requests: 256
limits: 1024

And all the other workers have the same default worker resources definition, i.e.

default_worker:
resources:
cpu:
requests: 1
limits: 2
memory:
requests: 256
limits: 1024
gpu:
request: 1
limits: 1

Same logic applies to the parameter servers with the default_ps and ps.

Distributed MXNet

In a similar fashion, MXNet defines 3 types of tasks: scheduler, workers, and parameter servers.

As explained before, the master section will have in this case the scheduler role, the rest of the cluster needs to be defined in an mxnet subsection:

environment:
...

mxnet:
n_workers: 3
n_ps: 1

And in a similar fashion you can customize the resources of each task the same way it was explained in the tensorflow section.

Distributed Pytorch

Distributed Pytorch is also similar but only defines a master task (worker with rank 0) and a set of worker tasks.

To define a Pytorch cluster in Polyaxon with a master and 3 workers, we need to add a pytorch subsection to the environment section:

environment:
...

mxnet:
n_workers: 3

You can customize the resources of each task the same way it was explained in the tensorflow section.

Examples

You can find examples of distributed experiments for these frameworks in our examples github repo.