This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
supercomputer [2017/08/23 20:00] sean created |
supercomputer [2017/08/23 20:22] sean [Deep Learning on the Supercomputer] |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Deep Learning on the Supercomputer ====== | ====== Deep Learning on the Supercomputer ====== | ||
- | ==== Intro to the Supercomputer ==== | + | |
+ | |||
+ | ==== Setting Up Supercomputer ==== | ||
To get started on the supercomputer you need to follow the instructions and get an account from [[https://marylou.byu.edu/]]. Once you have this set up you can SSH in with | To get started on the supercomputer you need to follow the instructions and get an account from [[https://marylou.byu.edu/]]. Once you have this set up you can SSH in with | ||
Line 20: | Line 22: | ||
The computer lab grants most memory to the **compute** directory, so from now on we will make sure to put all data and code in there. | The computer lab grants most memory to the **compute** directory, so from now on we will make sure to put all data and code in there. | ||
+ | ==== Running Programs ==== | ||
+ | |||
+ | Now that our environment is set up we can run a python script. This is done by submitting a job with whatever deep learning magic you want to run. The method for submitting this is called a SLURM script. This is just a bash script that tells the recourse manager how much memory you need, CPU's, time, and so on. | ||
+ | |||
+ | So time for a baby example with using the tensorflow program hello.py: | ||
+ | |||
+ | <code> | ||
+ | import tensorflow as tf | ||
+ | |||
+ | hello = tf.constant('Hello') | ||
+ | |||
+ | sess = tf.Session() | ||
+ | print sess.run(hello) | ||
+ | </code> | ||
+ | |||
+ | Now to create our slurm script we can use the GUI at [[https://marylou.byu.edu/documentation/slurm/script-generator]]. This will give us a file that I will name slurm.sh that looks like this | ||
+ | |||
+ | <code> | ||
+ | #!/bin/bash | ||
+ | |||
+ | #SBATCH --time=01:00:00 # walltime | ||
+ | #SBATCH --ntasks=1 # number of processor cores (i.e. tasks) | ||
+ | #SBATCH --nodes=1 # number of nodes | ||
+ | #SBATCH --gres=gpu:1 | ||
+ | #SBATCH --mem-per-cpu=4096M # memory per CPU core | ||
+ | |||
+ | # LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE | ||
+ | python hello.py | ||
+ | </code> | ||
+ | |||
+ | Simple enough, right? Also it is important to make sure we tell it how much memory and time we expect. If we give it a lot we will have less priority and have to weight longer for the job to start. | ||
+ | |||
+ | Now we just do execute ./slurm.sh to run it. | ||
+ | |||
+ | ==== Pro Tips ==== | ||
+ | * Make sure your tf code uses the GPU | ||
+ | * to see all your jobs status its helpful to make an alias with the command `watch squeue -u<username> --Format=jobid,numcpus,state,timeused,timeleft' |