How to use TensorFlow on TACC supercomputer

Li Yin
Li Yin
Jul 22, 2017 · 6 min read

I have tried to figure this out. It seems that I failed to google a tutorial for this. But I figured out there is a lot of GPUs resource on Maverick on TACC. And they do have TensorFlow module installed.

At first you need to have an account to Maverick.

  1. Login with the maverick: ssh username@maverick.tacc.utexas.edu
  2. idev to get into the computing environment, one node. Use nvidia-smi to see the gpu resource.
  3. Load the module tensorflow-gpu or cpu,
module spider modulename (tensorflow-gpu) to get the information

To load the GPU-enabled module: This module defines the environmental variables TACC_TENSORFLOW_DIR, TACC_TENSORFLOW_LIB, and TACC_TENSORFLOW_BIN for the location of the main TensorFlow directory, libraries, and the binaries.The location of the libraries are added to your LD_LIBRARY_PATH and to your PYTHONPATH.

module resetmodule load gcc/4.9.1 cuda/8.0 cudnn/5.1 python3/3.5.2 tensorflow-gpu/1.0.0module load gcc/4.9.1 cuda/8.0 cudnn/5.1 python/2.7.12 tensorflow-gpu/1.0.0

TACC Marverick queue

4. Tutorials

5. Now get in the tensorflow:

OR we can use the visualization tool:

pip3 install --user module-name #to install my own module

Or we can go to this https://vis.tacc.utexas.edu , log in and use tensorflow with jupyter, and coding there.

Download file from google drive

!pip3 install --user googledrivedownloader

Run code

!python3 main.py

Other deep learning on TACC

The whole process:

  1. Current at the $HOME directory, save files

3. Get on a compute node using idev and assign a job

$ idev -A PROJECT -q QUEUE -N 1 -n 1 -t 01:00:00

Interactive with vnc (Good for short task within one hour):

cp /share/doc/slurm/job.vnc ./

Then vi job.vnc to change the configuration. The following is the configuration for my task which is going to be run on the GPU.

Start to allocate the resource.

sbatch job.vnc

Offline(Good for long job):

module load launcher
cp $TACC_LAUNCHER_DIR/launcher.slurm ./
vi launcher.slurm
launcher.slurm

Then add a commands file to execute the program

Then submit the job:

sbatch launcher.slurm

More explanation is here.


Then use the following command to check the status

Using the squeue command with the --start and -j options can provide an estimate of when a particular job will be scheduled:

login1$ squeue --start -j 1676354
JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON)
1676534 normal hellow janeuser PD 2013-08-21T13:42:03 256 (Resources)

Even more extensive job information can be found using the “scontrol" command. The output shows quite a bit about the job: job dependencies, submission time, number of codes, location of the job script and the working directory, etc. See the man page for more details.

login1$ scontrol show job 1676354
JobId=1676991 Name=mpi-helloworld
UserId=slindsey(804387) GroupId=G-40300(40300)
Priority=1397 Account=TG-STA110012S QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=15:30:00 TimeMin=N/A
SubmitTime=2013-09-11T15:12:49 EligibleTime=2013-09-11T15:12:49
StartTime=2013-09-11T17:40:00 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=normal AllocNode:Sid=login4:27520
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=256-256 NumCPUs=4096 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home1/01158/slindsey/mpi/submit.slurm
WorkDir=/home1/01158/slindsey/mpi

Job Deletion with scancel

The scancel command is used to remove pending and running jobs from the queue. Include a space-separated list of job IDs that you want to cancel on the command-line:

login1$ scancel job_id1 job_id2 ...

Example job scripts are available online in /share/doc/slurm . They include details for launching large jobs, running multiple executables with different MPI stacks, executing hybrid applications, and other operations.

4. Copy Model and Data, you may copy the test dir to other places

$ cp -r /work/00946/zzhang/maverick/test $WORK

5. Enter the dir

$ cd $WORK/test

6. Load TensorFlow Module

module load tensorflow-gpu

7. Run the program there, use the screen, which means the data is saved to the $WORK

rsync -avtr save $HOME #use this command to sync save directory to home

8. We can use command to check the submitted queue

squeue -u liyin

Then we try to connect on the working nodelist

ssh c221-601
First idv15184 is running with ssh with $WORK, the second tvp_ipyt is from visualization portal tool with $HOME
c221–501: HOME directory with information of ipython

If we use top on each of the working node, it showes the same PID running. All the tasks submitted from the terminal will be working under $WORK directory, and that is where the result is saved.

Thus, we can copy the result from $WORK to HOME that we can visualize the data with jupyter notebook easily from the visualization portal tool. It would be even nicer if the $WORK directory is visible to all the users on the jupyter notebook too.

We can run program in the $WORK directory from jupyter notebook

!python3 /work/04035/liyin/maverick/Face-Age/mainnodz.py

If we are running deep neural network that takes hours or days to finish, we can just close the visualization portal website but do not hit the log out. The job will keep running till it is done.

scp

localhost% scp filename \   
username@stampede.tacc.utexas.edu:/path/to/project/directory

Tensorboard

ssh -f -g -N -R 6006:c224–202:6006 login2

Notice I used the same port number (6006) that I was told to connect to by the tensorboard. Also note I used the compute node host name (c224–202) in that command. Your node will likely have a different name. Use whatever node you land on.

I then pointed my browser to login2.maverick.tacc.utexas.edu:6006 and was able to connect.

2. A easier solution is to go to TACC Visualization Portal to start a VNC .

First, start the tensorboard.

Then go to the application menu to choose browser:

And then navigate to the given address:

Machine Learning for Li

This publication will include all the stories I wrote about the Neural Network and the machine learning techniques learned or interested.

Li Yin

Written by

Li Yin

Computer scientist. ✍️A book in progress@https://github.com/liyin2015/Algorithms-and-Coding-Interviews.

Machine Learning for Li

This publication will include all the stories I wrote about the Neural Network and the machine learning techniques learned or interested.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade