Deep Learning on the Supercomputer

Setting Up Supercomputer

To get started on the supercomputer you need to follow the instructions and get an account from https://marylou.byu.edu/. Once you have this set up you can SSH in with

 ssh <username>@ssh.fsl.byu.edu

Welcome to our new home directory. Using the supercomputer means we have to remember elementary school and be nice and share. This means we can only used software that is approved and stored in modules. For this class we want to use python, tensorflow, and cuda for the GPU. We setup our environment by creating a file called .modules and telling it what we want to use.

#%Module

module load defaultenv
module add cuda
module add cudnn/4.0_gcc-4.4.7
module add tensorflow/0.9.0_python-2.7.11+cuda

UPDATE: apparently, the following module file works better:

#%Module

module load defaultenv
module load cuda/8.0
module load cudnn/5.1_cuda-8.0
module load python/2/7

The computer lab grants most memory to the compute directory, so from now on we will make sure to put all data and code in there.

Running Programs

Now that our environment is set up we can run a python script. This is done by submitting a job with whatever deep learning magic you want to run. The method for submitting this is called a SLURM script. This is just a bash script that tells the recourse manager how much memory you need, CPU's, time, and so on.

So time for a baby example with using the tensorflow program hello.py:

import tensorflow as tf

hello = tf.constant('Hello')

sess = tf.Session()
print sess.run(hello)

Now to create our slurm script we can use the GUI at https://marylou.byu.edu/documentation/slurm/script-generator. This will give us a file that I will name slurm.sh that looks like this

#!/bin/bash

#SBATCH --time=01:00:00   # walltime - this is one hour
#SBATCH --ntasks=1   # number of processor cores (i.e. tasks)
#SBATCH --nodes=1   # number of nodes
#SBATCH --gres=gpu:1
#SBATCH --mem-per-cpu=4096M   # memory per CPU core

# LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
python hello.py

Simple enough, right? Also it is important to make sure we tell it how much memory and time we expect. If we give it a lot we will have less priority and have to weight longer for the job to start.

To submit your job, use the sbatch command, as in sbatch ./slurm.sh.

Pro Tips

Make sure your tf code uses the GPU
to see all your jobs status its helpful to make an alias with the command `watch squeue -u<username> –Format=jobid,numcpus,state,timeused,timeleft'

BYU CS classes

Table of Contents

Deep Learning on the Supercomputer

Setting Up Supercomputer

Running Programs

Pro Tips

BYU CS classes

User Tools

Site Tools

Table of Contents

Deep Learning on the Supercomputer

Setting Up Supercomputer

Running Programs

Pro Tips

Page Tools