Slurm Cluster with Docker

Rodrigo Ancavil
Analytics Vidhya
6 min readMar 1, 2021

--

This is a little (and a simple) guide about how to get a Slurm Cluster using Docker.

The aim is to give an environment to test, introduce and practice the use and development over a Slurm Cluster (this is not an environment for production).

Github repository

Slurm

According to the definition, Slurm is an open-source, fault-tolerance, highly scalable cluster management and job scheduling system for all sizes of Linux clusters.

Slurm stands Simple Linux Utility for Resource Management (SLURM), and it is used by many of the world’s supercomputers and Linux clusters in general.

In simple words, Slurm allows us to execute jobs in the cluster using the resources of nodes that are part of it.

Slurm Architecture

We are going to create a Slurm cluster using docker-compose, Docker compose allows us to create an environment from docker images (previously built). Docker-compose will create containers and the network to communicate them in an isolated environment. Each container will be a component of the cluster.

slurmmaster will be the container with slurmctld (The central management daemon of Slurm).

slurmnode[1–3] are the containers with slurmd (compute node daemon for Slurm).

slurmjupyter will be the container with jupyterlab. This allows interacting with the cluster using JupyterLab as a cluster client. As end-users, we’ll work using JupyterLab through a browser for interacting with Slurm.

cluster_default network, docker-compose will create a network to join and keep containers altogether. Containers inside the network can see each others.

The following scheme shows how components interact.

Slurm cluster docker architecture

Creating the cluster

As I mentioned before, we are going to use docker-compose to create our Slurm Cluster. So we will write a docker-compose.yml file to declare and configure all cluster components.

To install docker-compose you have to execute:

$ pip3 install docker-compose

I created a directory called cluster to write inside my docker-compose.yml

$ mkdir cluster
$ cd cluster
$ vim docker-compose.yml

Note: you can use the editor you like instead of vim.

Now, write or copy and paste the following lines in your docker-compose.yml file.

services:
slurmjupyter:
image: rancavil/slurm-jupyter:19.05.5-1
hostname: slurmjupyter
user: admin
volumes:
- shared-vol:/home/admin
ports:
- 8888:8888
slurmmaster:
image: rancavil/slurm-master:19.05.5-1
hostname: slurmmaster
user: admin
volumes:
- shared-vol:/home/admin
ports:
- 6817:6817
- 6818:6818
- 6819:6819
slurmnode1:
image: rancavil/slurm-node:19.05.5-1
hostname: slurmnode1
user: admin
volumes:
- shared-vol:/home/admin
environment:
- SLURM_NODENAME=slurmnode1
links:
- slurmmaster
slurmnode2:
image: rancavil/slurm-node:19.05.5-1
hostname: slurmnode2
user: admin
volumes:
- shared-vol:/home/admin
environment:
- SLURM_NODENAME=slurmnode2
links:
- slurmmaster
slurmnode3:
image: rancavil/slurm-node:19.05.5-1
hostname: slurmnode3
user: admin
volumes:
- shared-vol:/home/admin
environment:
- SLURM_NODENAME=slurmnode3
links:
- slurmmaster
volumes:
shared-vol:

Note: I am using a computer with two CPUs

Now, you can create the cluster executing.

$ docker-compose up -d

Note: the -d option means that we are going to run a slum cluster as a daemon.

You can check if everthing is OK.

$ docker-compose ps

Now, in your browser go to the following URL to access to JupyterLab.

http://localhost:8888

You’ll see the JupyterLab environment.

JupyterLab with Slurm Queue and Client

It’s installed HPC Tools / Slurm Queue extension.

Push the button, and you’ll get the Slurm Queue Manager.

To get a sight of the cluster, go to a Terminal in the Launcher tab.

In the Terminal, execute the command control show node.

admin@slurmjupyter:~$ scontrol show node
NodeName=slurmnode1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUTot=2 CPULoad=0.88
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=slurmnode1 NodeHostName=slurmnode1 Version=19.05.5
OS=Linux 4.15.0-135-generic #139-Ubuntu SMP Mon Jan 18 17:38:24 UTC 2021
RealMemory=1 AllocMem=0 FreeMem=203 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=slurmpar
BootTime=2021-02-05T00:25:01 SlurmdStartTime=2021-02-28T21:25:18
CfgTRES=cpu=2,mem=1M,billing=2
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=slurmnode2 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUTot=2 CPULoad=0.88
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=slurmnode2 NodeHostName=slurmnode2 Version=19.05.5
OS=Linux 4.15.0-135-generic #139-Ubuntu SMP Mon Jan 18 17:38:24 UTC 2021
RealMemory=1 AllocMem=0 FreeMem=203 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=slurmpar
BootTime=2021-02-05T00:25:01 SlurmdStartTime=2021-02-28T21:25:19
CfgTRES=cpu=2,mem=1M,billing=2
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=slurmnode3 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUTot=2 CPULoad=0.88
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=slurmnode3 NodeHostName=slurmnode3 Version=19.05.5
OS=Linux 4.15.0-135-generic #139-Ubuntu SMP Mon Jan 18 17:38:24 UTC 2021
RealMemory=1 AllocMem=0 FreeMem=203 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=slurmpar
BootTime=2021-02-05T00:25:01 SlurmdStartTime=2021-02-28T21:25:18
CfgTRES=cpu=2,mem=1M,billing=2
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

The cluster creates a Partition called slurmpar. Partitions in Slurm are sets of nodes with associated resources.

A first example

We are going to develop an example using python to test how to work our cluster.

Go to JupyterLab and create a New File and Rename the File as test.py

And write the following code:

#!/usr/bin/env python3

import time
import os
import socket
from datetime import datetime as dt
if __name__ == '__main__':
print('Process started {}'.format(dt.now()))
print('NODE : {}'.format(socket.gethostname()))
print('PID : {}'.format(os.getpid()))
print('Executing for 15 secs')
time.sleep(15)
print('Process finished {}\n'.format(dt.now()))

Now, we’ll write a job.sh script. Go to New File again and rename the File to job.sh, write the following:

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=result.out
#
#SBATCH --ntasks=6
#
sbcast -f test.py /tmp/test.py
srun python3 /tmp/test.py

In the script, I define the output file result.out, and ntask=6 because we have 3 nodes with 2 CPUs each.

sbcast will transmit the file to the nodes allocated to a Slurm job.

srun will run parallel jobs.

Our test.py will be executed in parallel as 6 tasks.

Go to Submit Job in Slurm Queue Manager, and choose job.sh (path /home/admin).

After executing the job.sh script, push Reload button, you’ll see the following.

After 15 secs, the results will be written in the file result.out.

Process started 2021-02-28 11:23:55.094187
NODE : slurmnode1
PID : 249
Executing for 15 secs
Process finished 2021-02-28 11:24:10.109268
Process started 2021-02-28 11:23:55.133633
NODE : slurmnode3
PID : 145
Executing for 15 secs
Process finished 2021-02-28 11:24:10.141112
Process started 2021-02-28 11:23:55.149958
NODE : slurmnode3
PID : 144
Executing for 15 secs
Process finished 2021-02-28 11:24:10.164342
Process started 2021-02-28 11:23:55.153752
NODE : slurmnode1
PID : 248
Executing for 15 secs
Process finished 2021-02-28 11:24:10.168402
Process started 2021-02-28 11:23:55.192345
NODE : slurmnode2
PID : 145
Executing for 15 secs
Process finished 2021-02-28 11:24:10.207377
Process started 2021-02-28 11:23:55.197817
NODE : slurmnode2
PID : 146
Executing for 15 secs
Process finished 2021-02-28 11:24:10.212361

Analyzing the result, we can see that test.py was executed in parallel 6 times, starting and finishing at the same time (all tasks were executed in 15 secs), two times in every node.

To stop the cluster:

$ docker-compose stop

Summary

This is a little tutorial about how to get a Slurm Cluster for learning and practicing parallel programming. Enjoy it.

--

--