How to run your R script with Compute Canada

Julie Fortin
the nature of food
Published in
9 min readOct 16, 2020

You may be like me: a researcher who has little formal training in computer science, but who still does a fair amount of programming in their day-to-day. You may have learned how to program in R, because that’s what everyone in your field uses. But sometimes, you may have pretty demanding scripts that are not ideally suited for your laptop.

Well, it turns out you can have access to more computing power. For free.*

*Caveat: you must be a researcher at a Canadian institution. See Compute Canada for more details on eligibility.

The great folks at Compute Canada have state-of-the-art high performance computing (HPC) facilities, and their main job is to help you, a Canadian scientist, use their infrastructure and services to improve your research!

When I learned about these services, I was eager to use them, but intimidated by all the jargon. Thankfully, Compute Canada has excellent help documentation, offers various training workshops, and hires experts and technical support specialists across the country whom you can contact.

Photo by Alexandre Debiève on Unsplash

Still, I spent many many hours lost in the documentation. In my experience, rather than trying to learn everything about HPC from A to Z, it would have been helpful to learn strictly what I need to know to run my scripts. So I decided to lay out a tutorial with that in mind for others who may be in a similar position.

For this piece, I assume you have:

  • An R script you would like to run (I use R examples throughout, but you could probably find tips for other languages!)
  • Some familiarity with the command line
  • No knowledge of using servers/remote computers

If you’re on the same page as me, let’s dive in to learn how to take advantage of the amazing resources available through Compute Canada!

1. Get an account with Compute Canada.

This is a multi-step process. If you are working outside of a university, you need to contact Compute Canada. If you are at a Canadian university, read on:

  1. The Principal Investigator (PI) (i.e. any faculty member) must get a Compute Canada account. (This requires having an up-to-date Canadian Common CV — you have been warned)
  2. The PI can sponsor any number of lab members/collaborators. Each person must register with the Compute Canada Database using a code provided by the PI.
  3. The PI will receive an email within ~2 days of the request, and can confirm the user.

Note that PIs who have specific computational needs for their work can apply for large amounts of resources through a Resource Allocation Competition. However, each PI also gets a default allocation of 2TB of storage and “opportunistic use” of resources. This “opportunistic use” may well be enough to satisfy your needs, and is what I’m using for this tutorial.

2. Have a script in R that runs, but takes a lot of time on your computer.

Potential solution #1: make your code run faster. E.g. avoid loops and use the apply functions when you can, or parallelize your tasks.

Potential solution #2: use something other than R. Python and Julia are also open-source, but quicker in some benchmarking tasks.

Alas, as a researcher with certain priorities and time constraints, it doesn’t always make sense to spend a lot of time optimizing code, or learning a whole new language. So let’s assume we need to run the script that we have.

3. Make sure all the input/output files referred to in your code have relative paths. This means relative to the folder in which the script is located.

./file.csv — this is a file in the same directory as the script
../file.csv — this is a file in the script directory’s parent directory

This is important because when we run the script on a Compute Canada computer, the file paths can’t be linking to folders on your personal computer.

I recommend using the R package “here” when defining input and output paths — it finds the project folder, then uses your paths from that point.

4. Connect to Compute Canada computers.

Compute Canada resources are physically located in data centres across the country, and they are given cool names to make them easier to remember/refer to. For example, there is one called “Cedar” at SFU. These data centres are essentially big rooms (with specialized cooling systems) full of stacks upon stacks of computers.

Photo by Science in HD on Unsplash

So how do you use one of those computers without physically going there? You can access it using your Compute Canada login information! You need to use a thing called “Secure Shell” which is a program that allows you to connect to remote computers in a secure way. You can use a secure shell from Terminal (Mac/Linux) or a third-party program like PuTTY (Windows).

  1. Open Terminal
  2. Write the following command:
    ssh -Y yourusername@cedar.computecanada.ca
    (I use Cedar in this example, but you could log into any of the Compute Canada clusters)
  3. Enter your password

Ta-da! You are in the “login node”. The login node is a computer at a Compute Canada centre that lets you send jobs to the other computers.

5. Install any packages you need

Just as you would need to install a package before using it in R on your computer, you need to install packages on the Compute Canada computer. But first, we need to run R. And we unfortunately can’t just double-click on the R logo to do that. In the login node, type:

  1. module spider r
    This will list versions of R you could use.
  2. Then, look at the version you want specifically to see if there are any other modules you need to load in order to run it. For example, if I want to use R 4.0.2, I would run:
    module spider r/4.0.2
    This will show any other modules & versions you need to load before you can run R.
  3. module load nixpkgs/16.09 gcc/7.3.0 r/4.0.2
    This loads two packages required to run R version 4.0.2, then loads R. It is important that they be in the correct order (nixpkgs and gcc before r).
  4. R
    This opens an interactive R session.
  5. > install.packages("tidyverse", repos="https://mirror.rcg.sfu.ca/mirror/CRAN/")
    Install the packages you need, specifying the repos argument.
  6. > q()
    Close the interactive R session.

If you try installing packages and receive error messages about specific modules needing to be loaded, check which versions you need with module spider and add them to the job script. For example, to use the package raster, you need to load modules for gdal and proj.

6. Get your files set up on a Compute Canada computer

Your script and input files are sitting on your computer. In order to run your script on a Compute Canada computer, you need to copy the files there.

You can do this via Globus File Transfer:

  1. Log into Globus using your Compute Canada username and password.
  2. Create an endpoint (i.e. link to a computer you want to transfer files from/to) on your computer and give it a clear name
  3. Copy the set-up key
  4. Install Globus Connect Personal on your computer (Mac/Windows/Linux)
  5. During set-up, paste the set-up key you copied earlier
    (you only need to do steps 1–5 once!)
  6. Go to the Compute Canada Globus portal
  7. Select your personal endpoint (username#endpointname)
  8. Select a Compute Canada endpoint (computecanada#cedar-dtn)
  9. Navigate to the files/folders you want to copy from/to
  10. Select files and click ‘Start’

You’ll eventually get an email confirming the file transfer is complete.

Great, now the files you need should be on your Compute Canada account. Make sure you’ve maintained the same folder structure as on your computer so that when you run your script, it knows where to put the input and output files.

7. Create a job script

A job script is a file that contains information about the script you want to run. You need to describe what program to use (i.e. R), what file to run, how much time and memory and cores it should take.

Via the “login node”, you send this job script to the “scheduler”. The scheduler then takes your job script, decides when to run your job (based on available resources), and sends it to the “computing nodes” (i.e. the computers that actually do the work).

  1. Still in Terminal, while in the login node, navigate to your user’s folder using the cd command. You should see the files we imported earlier using Globus.
  2. Create a job script file:
    nano jobname.sh (this opens a text editor that will save to jobname.sh)
  3. Write the job script (see image below).
  4. Then click ctrl+X and Y to exit the editor and save your shell file.

How do I know how much time I need?

  • Different Compute Canada centres have different time limits, ranging from 24 hours to 28 days. If you can, try running your script on a subset of data and seeing what it might scale up to (while being aware that not all processes scale linearly!)

How do I know how much memory I need?

  • Start large (4GB) on a test script. Then:
  • While the script is running, check how much memory is used in real time by typing sstat yourjobID.batch --format="JobID,MaxRSS"
    Or,
  • When the script is done running, check how much was used by typing sacct -o MaxRSS -j yourjobID
  • If you check the slurm.out file and you’re getting “oom-kill” errors, you need to request more memory
  • If you’re using less than you asked for, it’s beneficial to reduce the memory in --mem or --mem-per-cpu (this way your job will get scheduled sooner)
  • Resubmit your job with your new estimate.

How do I know how many CPUs I need?

  • If you didn’t explicitly do anything in your code to make your script run in parallel, it is set up to run on 1 core.
  • If you have, then you can request many cores. “How many?” It depends, how many do you need to run your script?
  • As an example, I once ran a script processing 38 years of daily climate data. Each year would take just under 1.5 hours to run. So, I parallelized my script and requested 38 cores for 1.5 hours (each core processing one year of data). It took less than 10 minutes for the job to be scheduled by SLURM. So I had my results for all 38 years in under 2 hours.

8. Submit the job

We have all the information we need in our shell script. Now we just need to submit it to the scheduler. In Terminal, while in your user folder in the login node, write the following command:

sbatch jobname.sh

Now it’s out of your hands! The scheduler will see how many resources you are asking for, and try to slot in your job whenever there are some compute nodes available.
You can check up on your job by typing squeue -u yourusename in Terminal.

You can check any outputs from the R console in the slurm-yourjobID.out file. Anything that you put in a print() command in your R script should appear in the slurm.out file, which can help with debugging!

9. Once the job is done, transfer your output files back to your personal/lab computer.

Once again, you can use Globus File Transfer to copy the files back over to your computer.

Voila! You have learned how to run your R script on a Compute Canada node! Hopefully it managed to run more quickly than it would have on your personal computer, leaving you more time to focus on the research itself.

This barely scratches the surface of what Compute Canada can do for you to help you bring your research to the next level. But it might be a good first step to learning how to harness high performance computing resources.

Disclaimer: I am not a Compute Canada expert. I recommend reading the Compute Canada documentation or contacting them when troubleshooting! However, as of August 16th, 2021, this post has been reviewed by folks at ACENET (Compute Canada Atlantic regional partner) who kindly corrected a few inaccuracies (specifically regarding how to load R packages).

--

--

Julie Fortin
the nature of food

Data scientist and lab manager with the LUGE lab at the University of British Columbia