Advanced scheduling of experiments and jobs on Polyaxon

For some advanced use cases, users might need to have more control over where specific Polyaxon jobs should be deployed on their clusters.

Polyaxon provides a list of options to select which nodes should be used for the core platform, for the dependencies, and for the experiments. These options are Node selectors, Tolerations, and Affinity.

Node selectors

Polyaxon comes with default node selectors to assign pods to nodes: core components, experiments, jobs, builds.

  • core: the core polyaxon platform
  • experiments: all user's experiments scheduled by polyaxon
  • jobs: all user’s jobs scheduled by polyaxon
  • builds: all user’s builds scheduled by polyaxon

Additionally you can specify node selectors for Polyaxon’s dependencies: postgresql, rabbitmq, and redis.

By providing these values, or some of them, you can constrain the pods belonging to that category to only run on particular nodes or to prefer to run on particular nodes.

For example, if you have some GPU nodes, you might want to only use them for training your experiments. In this case you should label your nodes:

kubectl label nodes node1 node2 polyaxon.com=gpu-nodes

And then in your polyaxon_config.yml file you can update the node selectors for the experiments:

nodeSelectors:
experiments:
polyaxon.com: gpu-nodes

Experiments, Jobs, and builds node selectors

In some situation, and after deploying Polyaxon with node selectors or not, you might need to still have control over specific jobs, for example running a distributed Tensorflow experiment, and wishing to schedule the workers on gpu nodes and the parameter servers on cpu nodes. Or trying to do some exploration with Jupyter Notebook on a node with a specific gpu. Or scheduling all builds on the same node.

Polyaxon have a subsection node_selector in the environment section that allows you to override the default node selectors.

Distributed tensorflow example:

---
version: 1

kind: experiment

environment:
tensorflow:
n_workers: 2
n_ps: 1
  worker_default:
node_selector:
polyaxon: gpu-node

ps:
- index: 0
node_selector:
polyaxon: cpu-nodes

build:
image: tensorflow/tensorflow:1.4.1-gpu-py3
build_steps:
- pip install --no-cache-dir -U polyaxon-helper
run:
cmd: python run.py --train-steps=400 --sync

Jupyter Notebook example:

---
version: 1

kind: notebook

environment:
node_selector:
polyaxon: specific-gpu-node

build:
image: tensorflow/tensorflow:1.4.1-gpu-py3
build_steps:
- pip3 install jupyter

Tolerations

In order to use this option, you need to apply one or more taints to one / all of your nodes.

kubectl taint nodes node1 key=value:NoSchedule

And apply this toleration to your deployment by updating you polyaxon_config.yml file.

You can provide default tolerations to the core components, experiments, jobs, builds.

For example to apply a toleration to the core Polyaxon components

tolerations:
core:
- operator: "Exists"
effect: "NoSchedule"

This will allow the core components to be scheduled on any node that has any taint.

In the same way you can also allow one or many dependencies to be deployed only on specific nodes, for example you can add toleration to Postgres

postgresql:
...
tolerations:
- key: "key1"
operator: "Exists"
effect: "NoSchedule"

Affinity

Same as node selectors and tolerations, Polyaxon provides default affinity for core components, experiments, jobs, builds.

Also it comes with an affinity for core components to ensure that they deploy on the same node. You can override this default behavior for the core components as well as for the dependencies.

You can use the affinity subsection in environment to have control on specific experiment, job, or a build.

---
version: 1

kind: experiment

environment:
affinity:
...