Remote debugging with GPUs in distributed (SLURM) compute clusters

8 min readJul 23, 2021

When developing code, PyCharm’s debugging capabilities can be extremely powerful, save developers a lot of time and help them develop better quality code. For instance, one can insert breakpoints while the program is running, inspect the values each variable gets and interact directly with the Python shell to develop new code, rather than having to re-run the entire program from scratch each time a change takes place.

But what happens when your code only runs on GPU and your local machine does not have one? Or when you are too lazy to synchronize all Python packages between the local and remote environments (because as we all know .yaml files never work!). Well, remote debugging is here to save the day!

What we will do

In practice, remote debugging enables you to use the computational resources (CPU/GPU) and environment (ie. python/conda packages) of a remote machine directly from your local IDE (ie. PyCharm). To do so, the IDE connects directly to the remote machine through a Secure Shell (SSH) connection and uses the python kernel to run all computations there. This enables you to insert breakpoints, inspect back variable values, etc. although all computation runs in the remote computer, in real time.

PyCharms’ professional version already offers remote debugging capabilities. In the simplest scenario where you can connect directly to a dedicated machine that will execute your code, the aforementioned tutorial should be sufficient.

However, it is common that the remote machine is a compute node in a cluster managed by a workload manager (ie. SLURM) and cannot be reached directly from your local machine (it usually lies behind a firewall due to security reasons). Nonetheless, it is possible to solve this issue and establish a communication (SSH) tunnel between the local and remote machine.

In most compute clusters, there is a head (central) node that manages the workload for different jobs and users (ie. slurm), while all code execution/computation is expected to happen in one of the compute/worker nodes. Therefore, computers outside the compute cluster are only allowed to communicate with the head node and compute nodes are protected behind a firewall.

To reach the compute node from our local machine, we basically need to create a two tunnels: One from the local machine to the head node, and another from the head node to the compute node.

How-to (step-by-step)

Connecting to compute nodes with SSH

First, we need to figure out the addresses/hostnames of the computers we will to use. To figure out the headnode name we can connect (ssh) to the head node over the terminal and run:

>> hostname
cluster.X.uva.nl

We also need to know the hostnames of the compute nodes. You can either do this by running squeue on the headnode and inspecting the last column (named NODELIST (REASON) ), or ssh from the head node to a compute node and run the hostname command again. Let’s assume the hostname of the compute node we will use is:

>> hostname
compute-node-001

Most likely, there will be multiple compute nodes, named compute-node-001, compute-node-002, etc.

Now, we need to edit the SSH configuration file in our local machine:

>> sudo nano ~/.ssh/config

and add the following lines:

Host {compute_node_name}*
        User {username_cluster}
        ProxyCommand  ssh -t -W %h:%p -q {username_cluster}@{head_node_address}# example:
# Host compute-node-*
#        User antonis
#        ProxyCommand  ssh -t -W %h:%p -q antonis@cluster.X.uva.nl

Reserving a compute node that will execute our code

Before connecting over SSH, we need to fire up a compute node (your connection might be rejected if you have no running job in this particular compute node). To do so, we need to login to (the head node of) the cluster and start a job in a certain machine.

Using the slurm workload manager, the following command would request a machine with 24 cpu cores and 1 GPU (the machine is located in the gpu partition of the cluster), for 3 hours. The last bit means that the job itself is a bash terminal.

srun -c 24 --mem=32gb --gres=gpu:1 -p gpu --time=3:00:00 --pty bash

Tip: you can also nest the bash job in a tmux ([1]) or screen session to make sure that the remote job will keep running in case of inactivity or connection issues.

Now, you can use your local terminal to connect to the compute node over SSH using:

ssh {username}@{compute_node_name}
# in the example above this would be: 
# ssh antonis@compute-node-001

Hopefully, your terminal should be connected to the compute node. To verify run:

>> hostname# should return:
# compute-node-001

Creating a new tunnel for PyCharm

Now that we have a tunneled connection over terminal, we need to setup PyCharm.

Unfortunately, we cannot ssh antonis@compute-node-001 from pycharm yet, as PyCharm uses an internal SSH configuration file. That’s why we need one additional step: Instead of connecting directly, from PyCharm to compute node, we will add an extra stop. We will add one last SSH tunnel, that will redirect one of our local ports to the compute node. Then, we can instruct PyCharm to connect to our local port directly, and all good!

The following command shall do the trick:

ssh -L {local_port}:{compute_node_name}:22 {username}@{head_node_address}

Since we will be reusing this command every time we want to debug something remotely, we can wrap it up in a nice function and put it in our (local) .bash_profile . Simply paste these lines:

connect_compute_node(){
port=${2:-12343}
echo “Connecting to Slurm compute node “ $1 “ at local port “ $2;
ssh -L “$port”:{compute_node_name}-$1:22 {username}@{head_node_address};
}
# # Or according to the example:# connect_compute_node(){
# port=${2:-12343}
# echo “Connecting to Slurm compute node “ $1 “ at local port “ $2;
# ssh -L “$port”:compute-node-$1:22 antonis@cluster.X.uva.nl;
# }

Now, if we have a node running at node 001, we can open the last tunnel using connect_compute_node 001. The function will use by default port 12343, but we can change that to ie. port 6263 by using connect_compute_node 001 6263 instead.

Configuring PyCharm

Now we are almost ready to go! All we need to do is configure a remote interpreter on PyCharm. In this section I will briefly describe the setup I use, but I also encourage you to experiment yourself and check other tutorials (the official ones from JetBrains, or also this one!

First, we need to create a remote server configuration between PyCharm and the remote host, as explained here. To do so, go to PyCharm Preferences > Build, Execution, Deployment > Deployment and fill in the connection details of the local port we just opened (our local host is 127.0.0.1):

Once this is done and you verify that “Test Connection” does not give an error, we are ready to set-up the remote Python interpreter.

To do this, go to the Add interpreter menu (under Project: X >Project interpreter>Add / or the interpreter selection menu on the bottom right). On the left bar, select SSH interpreter and select the remote connection we just created.

At this point, you need to configure the interpreter path on the remote machine. You should choose the python executable of your preference (ie. if Anaconda is used, this would look something like /home/antonis/anaconda3/envs/test/bin/python ).

Additionally, you need to select how to synchronize files between local and remote machine. My personal preference is to do this over git, rather than PyCharm. In this case, you need to untick Automatically upload project files to the server.

After all that, now you should be able to see the pip packages installed in the remote machine.

You can also define path mappings linking a remote path to a local one, and also ignore certain files during the sync (ie. model checkpoints or logs) if you use the PyCharm sync option. For additional information on these subjects, please refer to the original PyCharm documentation.

Testing and Usage

Finally, everything is set!

We can now run a terminal SSH connection from PyCharm on Tools>Start SSH session. Again, to verify we are on the remote compute node we can run the hostname command, or use nvidia-smi to check our GPU stats.

To execute and debug python code, all we need to do is edit the Run/Debug configurations and select the appropriate remote interpreter and working directories. Then we can define breakpoints and debug our code as if it was running locally, using our remote environment and the more powerful CPUs or GPUs of the compute nodes!

Conclusion

In this post, I demonstrated how to do remote debugging on your (local) PyCharm IDE, using remote compute resources that are protected behind firewalls in a compute cluster with a Slurm workload manager. The biggest benefit of directly debugging on the remote compute nodes is that (a) we do not need to synchronize packages between the local and remote environment and (b) we can use directly the computational resources on the cluster — which would also allow us to debug GPU code on a MacBook using the computational resources of the cluster.

To do so, we created several tunnels from our local machine (that runs the IDE) to the head node of the compute cluster, to the compute node which is eventually used to run our heavy computations, either in its CPU or GPU.

Acknowledgements

I would like to thank Auke Fokkers from the FEIOG team of the University of Amsterdam and Vaishali Pal for their help in debugging and figuring things out!