GPU Access

From UFRC
Jump to: navigation, search

Researchers may purchase access to GPUs in the form of Normalized Graphics Processor Units (NGUs) which include all of the infrastructure (memory, network, rack space, cooling) necessary for the GPUs to do useful work.

Research Computing has a significant investment in GPU-enabled servers:

HiPerGator 1.0:
GPU Quantity Host Quantity Host Architecture Host Memory Host Interconnect Host Features Notes
Tesla M2090 45 22 AMD Opteron 6220 32 GB QDR IB m2090 -
HiPerGator 2.0:
GPU Quantity Host Quantity Host Architecture Host Memory Host Interconnect Host Attributes Notes
Tesla K80 32 8 INTEL E5-2683 128 GB FDR IB k80 -

Requesting Access (SLURM)

In order to request interactive access to a GPU under SLURM, use commands similar to those that follow.

To request access to one GPU (of any type) for a default 10-minute session:
srun -p hpg1-gpu --gres=gpu:1 --pty -u bash -i
OR:
srun -p hpg2-gpu --gres=gpu:1 --pty -u bash -i
To request access to two Tesla GPUs on a single node for a 1-hour session:
srun -p hpg1-gpu --gres=gpu:tesla:2 --time=01:00:00  --pty -u bash -i
OR:
srun -p hpg2-gpu --gres=gpu:tesla:2 --time=01:00:00  --pty -u bash -i

If no units are accessible, your request will be queued and your connection established once the next GPU becomes available. Otherwise, you may choose to try connecting again at a later time. If you have requested for a longer time than is needed, please be sure to end your session so that the GPU will be available for other users.

Batch Jobs

For batch jobs, use lines similar to those that follow in your submission script.

In this example, the user will job will be allocated two Tesla GPUs on a single server (--nodes defaults to "1"):
#SBATCH --partition=hpg1-gpu
#SBATCH --gres=gpu:tesla:2

OR:

#SBATCH --partition=hpg2-gpu
#SBATCH --gres=gpu:tesla:2

Examples

gpuMemTest

The following is a sample job script for requesting and using GPUs under SLURM, provided here as an example for reference:

#!/bin/bash
#SBATCH --job-name=gpuMemTest
#SBATCH --output=gpuMemTest.out
#SBATCH --error=gpuMemTest.err
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
#SBATCH --distribution=cyclic:cyclic
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=2000
#SBATCH --mail-type=ALL
#SBATCH --mail-user=some_user@some_domain.com
#SBATCH --account=your_account
#SBATCH --qos=your_qos
#SBATCH --partition=hpg2-gpu
#SBATCH --gres=gpu:tesla:2
 
module load cuda/8.0
 
cudaMemTest=/home/taylor/Cuda/cudaMemTest/cuda_memtest
 
cudaDevs=$(echo $CUDA_VISIBLE_DEVICES | sed -e 's/,/ /g')
 
for cudaDev in $cudaDevs
do
  echo cudaDev = $cudaDev
  #srun --gres=gpu:tesla:1 -n 1 --exclusive ./gpuMemTest.sh > gpuMemTest.out.$cudaDev 2>&1 &
  $cudaMemTest --num_passes 1 --device $cudaDev > gpuMemTest.out.$cudaDev 2>&1 &
done
wait

LAMMPS

#!/bin/bash
#SBATCH --job-name=rhodo-gpu
#SBATCH --output=rhodo-gpu.log
#SBATCH --nodes=1
#SBATCH --ntasks=28
#SBATCH --cpus-per-task=1
#SBATCH --distribution=cyclic:cyclic
#SBATCH --extra-node-info=2:14:1
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=3600
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<some_user>@<some_domain>
#SBATCH --account=<your_account>
#SBATCH --qos=<your_qos>
#SBATCH --partition=hpg2-gpu
#SBATCH --gres=gpu:tesla:2
#---------------------------------------------
module load intel/2016.0.109
module load openmpi/1.10.2
module load cuda/8.0
module load lammps/30Jul16
#---------------------------------------------
job=rhodo
ompNumThreads=1
coresPerNode=28
socketsPerNode=2
ranksPerNode=$((coresPerNode/ompNumThreads))
ranksPerSocket=$((coresPerNode/socketsPerNode/ompNumThreads))
#
runDir=$(pwd)/latest/${job}-gpu-${SLURM_NTASKS}
echo runDir = $runDir
 
if [ ! -d $runDir ]; then
  mkdir -p $runDir
fi
 
cp in.$job $runDir
cp in.$job.scaled $runDir
cp data/data.$job $runDir
#---------------------------------------------
cd $runDir
export LAMMPS=lmp_ufhpc
export OMP_NUM_THREADS=$ompNumThreads
export MKL_NUM_THREADS=$ompNumThreads
#---------------------------------------------
echo "runDir                     = " $runDir
echo "XHPL                       = " $XHPL 
echo "OMP_NUM_THREADS            = " $OMP_NUM_THREADS
echo "MKL_NUM_THREADS            = " $MKL_NUM_THREADS
#
echo "------------------------------------"
mpiexec -np $SLURM_NTASKS \
        $LAMMPS -sf gpu -pk gpu 2 -var x 2 -var y 4 -var z 4 -in in.$job.scale