How to use TensorFlow on TACC supercomputer
I have tried to figure this out. It seems that I failed to google a tutorial for this. But I figured out there is a lot of GPUs resource on Maverick on TACC. And they do have TensorFlow module installed.
At first you need to have an account to Maverick.
- Login with the maverick: ssh username@maverick.tacc.utexas.edu
- idev to get into the computing environment, one node. Use nvidia-smi to see the gpu resource.
- Load the module tensorflow-gpu or cpu,
module spider modulename (tensorflow-gpu) to get the informationTo load the GPU-enabled module: This module defines the environmental variables TACC_TENSORFLOW_DIR, TACC_TENSORFLOW_LIB, and TACC_TENSORFLOW_BIN for the location of the main TensorFlow directory, libraries, and the binaries.The location of the libraries are added to your LD_LIBRARY_PATH and to your PYTHONPATH.
module resetmodule load gcc/4.9.1 cuda/8.0 cudnn/5.1 python3/3.5.2 tensorflow-gpu/1.0.0module load gcc/4.9.1 cuda/8.0 cudnn/5.1 python/2.7.12 tensorflow-gpu/1.0.0
TACC Marverick queue

4. Tutorials
https://github.com/aymericdamien/TensorFlow-Examples
5. Now get in the tensorflow:
Note that if you use python3, use the python3 to get into the command-line, then try the import tensorflow as tf
OR we can use the visualization tool:
pip3 install --user module-name #to install my own moduleOr we can go to this https://vis.tacc.utexas.edu , log in and use tensorflow with jupyter, and coding there.
Download file from google drive
!pip3 install --user googledrivedownloaderRun code
!python3 main.pyThe whole process:
- Log in with putty
- Current at the $HOME directory, save files
3. Get on a compute node using idev and assign a job
$ idev -A PROJECT -q QUEUE -N 1 -n 1 -t 01:00:00Interactive with vnc (Good for short task within one hour):
However, a better way to do it is to use sbatch. First, copy a job.vnc file to the home directory
cp /share/doc/slurm/job.vnc ./Then vi job.vnc to change the configuration. The following is the configuration for my task which is going to be run on the GPU.

Start to allocate the resource.
sbatch job.vncOffline(Good for long job):
Here we need to use slurm to help us run a program offline. TACC has supplied a sample launcher script which we can modify to queue and execute our job. Here’s how:
module load launcher
cp $TACC_LAUNCHER_DIR/launcher.slurm ./
vi launcher.slurm
Then add a commands file to execute the program

Then submit the job:
sbatch launcher.slurmMore explanation is here.
Then use the following command to check the status

Using the squeue command with the --start and -j options can provide an estimate of when a particular job will be scheduled:
login1$ squeue --start -j 1676354
JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON)
1676534 normal hellow janeuser PD 2013-08-21T13:42:03 256 (Resources)Even more extensive job information can be found using the “scontrol" command. The output shows quite a bit about the job: job dependencies, submission time, number of codes, location of the job script and the working directory, etc. See the man page for more details.
login1$ scontrol show job 1676354
JobId=1676991 Name=mpi-helloworld
UserId=slindsey(804387) GroupId=G-40300(40300)
Priority=1397 Account=TG-STA110012S QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:00:00 TimeLimit=15:30:00 TimeMin=N/A
SubmitTime=2013-09-11T15:12:49 EligibleTime=2013-09-11T15:12:49
StartTime=2013-09-11T17:40:00 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=normal AllocNode:Sid=login4:27520
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=256-256 NumCPUs=4096 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/home1/01158/slindsey/mpi/submit.slurm
WorkDir=/home1/01158/slindsey/mpiThe scancel command is used to remove pending and running jobs from the queue. Include a space-separated list of job IDs that you want to cancel on the command-line:
login1$ scancel job_id1 job_id2 ...Example job scripts are available online in /share/doc/slurm . They include details for launching large jobs, running multiple executables with different MPI stacks, executing hybrid applications, and other operations.
4. Copy Model and Data, you may copy the test dir to other places
$ cp -r /work/00946/zzhang/maverick/test $WORK5. Enter the dir
$ cd $WORK/test6. Load TensorFlow Module
module load tensorflow-gpu7. Run the program there, use the screen, which means the data is saved to the $WORK
rsync -avtr save $HOME #use this command to sync save directory to home8. We can use command to check the submitted queue
squeue -u liyin
Then we try to connect on the working nodelist
ssh c221-601

If we use top on each of the working node, it showes the same PID running. All the tasks submitted from the terminal will be working under $WORK directory, and that is where the result is saved.
Thus, we can copy the result from $WORK to HOME that we can visualize the data with jupyter notebook easily from the visualization portal tool. It would be even nicer if the $WORK directory is visible to all the users on the jupyter notebook too.
We can run program in the $WORK directory from jupyter notebook
!python3 /work/04035/liyin/maverick/Face-Age/mainnodz.pyIf we are running deep neural network that takes hours or days to finish, we can just close the visualization portal website but do not hit the log out. The job will keep running till it is done.
scp
Data transfer from any Linux system can be accomplished using the scp utility to copy data to and from the login node. A file can be copied from your local system to the remote server by using the command:
localhost% scp filename \
username@stampede.tacc.utexas.edu:/path/to/project/directoryTensorboard
- If we are using ssh tunnel from either your terminal or a Putty. The following guidance is from the TACC consultant team. From a compute node I started the tensorboard server. My particular node name was c224–202. I then set up a reverse tunnel to one of the login nodes (login2 in my case). I did that by using this command:
ssh -f -g -N -R 6006:c224–202:6006 login2Notice I used the same port number (6006) that I was told to connect to by the tensorboard. Also note I used the compute node host name (c224–202) in that command. Your node will likely have a different name. Use whatever node you land on.
I then pointed my browser to login2.maverick.tacc.utexas.edu:6006 and was able to connect.
2. A easier solution is to go to TACC Visualization Portal to start a VNC .
First, start the tensorboard.

Then go to the application menu to choose browser:

And then navigate to the given address:


