User Tools

Site Tools


supercomputer

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
supercomputer [2017/08/23 20:00]
sean created
supercomputer [2021/06/30 23:42] (current)
Line 1: Line 1:
 ====== Deep Learning on the Supercomputer ====== ====== Deep Learning on the Supercomputer ======
  
-==== Intro to the Supercomputer ====+ 
 + 
 +==== Setting Up Supercomputer ====
  
 To get started on the supercomputer you need to follow the instructions and get an account from [[https://​marylou.byu.edu/​]]. Once you have this set up you can SSH in with  To get started on the supercomputer you need to follow the instructions and get an account from [[https://​marylou.byu.edu/​]]. Once you have this set up you can SSH in with 
Line 17: Line 19:
 module add tensorflow/​0.9.0_python-2.7.11+cuda module add tensorflow/​0.9.0_python-2.7.11+cuda
 </​code>​ </​code>​
 +
 +**UPDATE: apparently, the following module file works better:**
 +
 +<​code>​
 +#%Module
 +
 +module load defaultenv
 +module load cuda/8.0
 +module load cudnn/​5.1_cuda-8.0
 +module load python/2/7
 +
 +</​code>​
 +
  
 The computer lab grants most memory to the **compute** directory, so from now on we will make sure to put all data and code in there. The computer lab grants most memory to the **compute** directory, so from now on we will make sure to put all data and code in there.
  
 +==== Running Programs ====
 +
 +Now that our environment is set up we can run a python script. This is done by submitting a job with whatever deep learning magic you want to run. The method for submitting this is called a SLURM script. This is just a bash script that tells the recourse manager how much memory you need, CPU's, time, and so on.
 +
 +So time for a baby example with using the tensorflow program hello.py:
 +
 +<​code>​
 +import tensorflow as tf
 +
 +hello = tf.constant('​Hello'​)
 +
 +sess = tf.Session()
 +print sess.run(hello)
 +</​code>​
 +
 +Now to create our slurm script we can use the GUI at [[https://​marylou.byu.edu/​documentation/​slurm/​script-generator]]. This will give us a file that I will name slurm.sh that looks like this
 +
 +<​code>​
 +#!/bin/bash
 +
 +#SBATCH --time=01:​00:​00 ​  # walltime - this is one hour
 +#SBATCH --ntasks=1 ​  # number of processor cores (i.e. tasks)
 +#SBATCH --nodes=1 ​  # number of nodes
 +#SBATCH --gres=gpu:​1
 +#SBATCH --mem-per-cpu=4096M ​  # memory per CPU core
 +
 +# LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
 +python hello.py
 +</​code>​
 +
 +Simple enough, right? Also it is important to make sure we tell it how much memory and time we expect. If we give it a lot we will have less priority and have to weight longer for the job to start.
 +
 +To submit your job, use the ''​sbatch''​ command, as in ''​sbatch ./​slurm.sh''​.
 +
 +==== Pro Tips ====
 +  * Make sure your tf code uses the GPU
 +  * to see all your jobs status its helpful to make an alias with the command `watch squeue -u<​username>​ --Format=jobid,​numcpus,​state,​timeused,​timeleft'​
supercomputer.1503518432.txt.gz · Last modified: 2021/06/30 23:40 (external edit)