Difference between revisions of "GPU Access"
m (→Batch Jobs) |
|||
Line 88: | Line 88: | ||
For batch jobs, to request GPU resources, use lines similar to the following: | For batch jobs, to request GPU resources, use lines similar to the following: | ||
− | *In this example, two | + | *In this example, two A100 GPUs on a single server (--nodes defaults to "1") will be allocated to the job: |
<pre> | <pre> | ||
#SBATCH --partition=gpu | #SBATCH --partition=gpu | ||
− | #SBATCH --gpus= | + | #SBATCH --gpus=a100:2 |
</pre> | </pre> | ||
Revision as of 14:25, 30 August 2021
Normalized Graphics Processor Units (NGUs) include all of the infrastructure (memory, network, rack space, cooling) necessary for GPU-accelerated computation. Each NGU is equivalent to 1 GPU presently, however newer GPUs such as the A100s may require more than 1 NGU to access in the future.
Researchers can add NGUs to their allocations by filling out the Purchase Form or requesting a Trial Allocation.
GPU-enabled Services
Types of GPUs are listed below. Two partitions contain GPUs - the hwgui partition for visualization and the gpu partition for general computation.
Hardware Accelerated GUI
GPUs in these servers are used to accelerate rendering for graphical applications. These servers are in the SLURM "hwgui" partition. Refer to the Hardware Accelerated GUI Sessions page for more information on available resources and usage.
GPU Assisted Computation
A number of high performance applications installed on HiPerGator implement GPU-accelerated computing functions via CUDA to achieve significant speed-up over CPU calculations. These servers are in the SLURM "gpu" partition (--partition=gpu
).
Hardware Specifications for the GPU Partition
We have the following types of NVIDIA GPU nodes available in the "gpu" partition:
- NVIDIA GeForce GTX 1080Ti, with 2 GPUs per node. See technical specifications for reference.
- NVIDIA GeForce RTX 2080Ti, with 8 GPUs per node. See technical specifications for reference.
- NVIDIA Quadro RTX 6000, with 8 GPUs per node. These GPUs have SLI bridging See technical specifications for reference.
- AI NVIDIA DGX A100 SuperPod, with 8 GPUs per node. These GPUs have NVSWITCH interconnects See technical specifications for reference.
|
For a list of additional node features, see the Available Node Features page.
To select a specific type of GPU within a partition please use either a SLURM constraint (e.g. --constraint=rtx6000) or a GRES with the needed GPU type (--gres or --gpu=a100:1). See more examples below.
Compiling CUDA Enabled Programs
The most direct way to develop a custom GPU accelerated algorithm is with the CUDA programming, please refer to the Nvidia CUDA Toolkit page. The current CUDA environment is cuda/11. However, C++ or Python packages numba and PyCuda are other ways to program GPU algorithms.
Slurm and GPU Use
Policy
- After purchase, NGU allocations are included in your groups resources (quality of service).
- To increase the availability of GPU resources, the time limit for the gpu partition is 7-days (at most
#SBATCH --time=7-00:00:00
). If you have a workload requiring more time, please create a help request.
Interactive Access
In order to request interactive command line access to a GPU under SLURM, use commands similar to these:
- To request access to one GPU (of any type) for a default 1 hour session:
srun -p gpu --gpus=1 --pty -u bash -i
- To request access to two A100 GPUs on a single node for a 3-hour session with 300gb RAM:
srun -p gpu --nodes=1 --gpus=a100:2 --time=03:00:00 --mem=300gb --pty -u bash -i
- To request access to two GeForce GPUs with multiple CPUs:
srun -p gpu --nodes=1 --gpus=geforce:2 --time=01:00:00 --ntasks=1 --cpus-per-task=8 --mem 300gb --pty -u bash -i
Interactive sessions are limited to 12 hours.
Open On Demand Access
To access GPUs using Open-On-Demand, please check the form for your application. If your application supports multiple GPU types, choose the GPU partition and specify number of GPUs and type:
- To request access to one GPU (of any type, use this gres string):
gpu:1
- To request multiple GPUs (of any type, use this gres string were n is the number of GPUs you need):
gpu:n
- To request a specific type of GPU, use this gres string (requesting geforce GPUs in this example):
gpu:geforce:1
- To request a A100 GPU, use this gres string:
gpu:a100:1
Batch Jobs
For batch jobs, to request GPU resources, use lines similar to the following:
- In this example, two A100 GPUs on a single server (--nodes defaults to "1") will be allocated to the job:
#SBATCH --partition=gpu #SBATCH --gpus=a100:2
- In this example, two GeForce GPUs on a single server (--nodes defaults to "1") will be allocated to the job:
#SBATCH --partition=gpu #SBATCH --gpus=geforce:2
Alternatively, use '--gres=gpu:1
' or '--gres=gpu:geforce:1
' format. Note, if '--gpus=' format is used SLURM will not provide the data on GPU usage to slurmInfo and those GPUs will not be shown in slurmInfo output.
If no GPUs are available, your request will be queued and your connection established once the next GPU becomes available. Otherwise, you may cancel your job and try lowering requested resources. If you have requested a longer time than is needed, please be sure to end your session so that the GPU will be available for other users.
SLURM Options for A100 GPUs
To use A100 GPUs for interactive sessions or batch jobs, please use one of the following SLURM parameters:
--partition=gpu --gpus=a100:2
Job Script Examples
MPI Parallel
This is a sample script for MPI parallel VASP job requesting and using GPUs under SLURM:
#!/bin/bash #SBATCH --job-name=vasptest #SBATCH --output=vasp.out #SBATCH --error=vasp.err #SBATCH --mail-type=ALL #SBATCH --mail-user=email@ufl.edu #SBATCH --nodes=1 #SBATCH --ntasks=8 #SBATCH --cpus-per-task=1 #SBATCH --ntasks-per-node=8 #SBATCH --ntasks-per-socket=4 #SBATCH --mem-per-cpu=7000mb #SBATCH --distribution=cyclic:cyclic #SBATCH --partition=gpu #SBATCH --gres=gpu:geforce:4 #SBATCH --time=00:30:00 echo "Date = $(date)" echo "host = $(hostname -s)" echo "Directory = $(pwd)" module purge module load cuda/10.0.130 intel/2018 openmpi/4.0.0 vasp/5.4.4 T1=$(date +%s) srun --mpi=pmix_v3 vasp_gpu T2=$(date +%s) ELAPSED=$((T2 - T1)) echo "Elapsed Time = $ELAPSED"