Advanced scheduling of experiments and jobs on Polyaxon
For some advanced use cases, users might need to have more control over where specific Polyaxon jobs should be deployed on their clusters.
Polyaxon provides a list of options to select which nodes should be used for the core platform, for the dependencies, and for the experiments. These options are Node selectors, Tolerations, and Affinity.
Polyaxon comes with default node selectors to assign pods to nodes: core components, experiments, jobs, builds.
- core: the core polyaxon platform
- experiments: all user's experiments scheduled by polyaxon
- jobs: all user’s jobs scheduled by polyaxon
- builds: all user’s builds scheduled by polyaxon
Additionally you can specify node selectors for Polyaxon’s dependencies: postgresql, rabbitmq, and redis.
By providing these values, or some of them, you can constrain the pods belonging to that category to only run on particular nodes or to prefer to run on particular nodes.
For example, if you have some GPU nodes, you might want to only use them for training your experiments. In this case you should label your nodes:
kubectl label nodes node1 node2 polyaxon.com=gpu-nodes
And then in your polyaxon_config.yml file you can update the node selectors for the experiments:
Experiments, Jobs, and builds node selectors
In some situation, and after deploying Polyaxon with node selectors or not, you might need to still have control over specific jobs, for example running a distributed Tensorflow experiment, and wishing to schedule the workers on gpu nodes and the parameter servers on cpu nodes. Or trying to do some exploration with Jupyter Notebook on a node with a specific gpu. Or scheduling all builds on the same node.
Polyaxon have a subsection node_selector in the environment section that allows you to override the default node selectors.
Distributed tensorflow example:
- index: 0
- pip install --no-cache-dir -U polyaxon-helper
cmd: python run.py --train-steps=400 --sync
Jupyter Notebook example:
- pip3 install jupyter
In order to use this option, you need to apply one or more taints to one / all of your nodes.
kubectl taint nodes node1 key=value:NoSchedule
And apply this toleration to your deployment by updating you
You can provide default tolerations to the core components, experiments, jobs, builds.
For example to apply a toleration to the core Polyaxon components
- operator: "Exists"
This will allow the core components to be scheduled on any node that has any taint.
In the same way you can also allow one or many dependencies to be deployed only on specific nodes, for example you can add toleration to Postgres
- key: "key1"
Same as node selectors and tolerations, Polyaxon provides default affinity for core components, experiments, jobs, builds.
Also it comes with an affinity for core components to ensure that they deploy on the same node. You can override this default behavior for the core components as well as for the dependencies.
You can use the
affinity subsection in
environment to have control on specific experiment, job, or a build.