Difference between revisions of "HPG Scheduling"
Moskalenko (talk | contribs) |
|||
(2 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
Back to [[Getting Started]] | Back to [[Getting Started]] | ||
− | You may want to run multiple jobs or the same job repeatedly, [[Slurm|SLURM]] allows you to schedule these jobs and allocate resources to them. | + | You may want to run multiple jobs or the same job repeatedly, [[Slurm|SLURM]] allows you to schedule these jobs and allocate resources to them while staying compliant with [https://www.rc.ufl.edu/documentation/policies/login-nodes/ HiPerGator's use policies]. |
− | == | + | ==Using SLURM== |
UFRC uses the Simple Linux Utility for Resource Management, or '''SLURM''', to allocate resources and schedule jobs. Users can create SLURM job scripts to submit jobs to the system. These scripts can, and should, be modified in order to control several aspects of your job, like resource allocation, email notifications, or an output destination. | UFRC uses the Simple Linux Utility for Resource Management, or '''SLURM''', to allocate resources and schedule jobs. Users can create SLURM job scripts to submit jobs to the system. These scripts can, and should, be modified in order to control several aspects of your job, like resource allocation, email notifications, or an output destination. | ||
− | + | ===Submitting a SLURM Job=== | |
− | * See the [[Annotated SLURM Script]] for a walk-through of the basic components of a SLURM job script | + | * Start interactive sessions. See [[Development and Testing]] and [[Open OnDemand]]. |
− | + | * Submit SLURM job scripts. See the [[Annotated SLURM Script]] for a walk-through of the basic components of a SLURM job script. For a list of sample Slurm scripts, please see [https://help.rc.ufl.edu/doc/Sample_SLURM_Scripts Sample SLURM scripts]. | |
{| cellpadding = "20" | {| cellpadding = "20" | ||
Line 13: | Line 13: | ||
| | | | ||
To submit a job script from one of the login nodes accessed via hpg.rc.ufl.edu, use the following command: | To submit a job script from one of the login nodes accessed via hpg.rc.ufl.edu, use the following command: | ||
− | $ sbatch <your_job_script> | + | $ sbatch <your_job_script.sh> |
|| | || | ||
To check the status of submitted jobs, use the following command: | To check the status of submitted jobs, use the following command: | ||
$ squeue -u <username> | $ squeue -u <username> | ||
|} | |} | ||
− | |||
− | |||
===Managing Cores and Memory=== | ===Managing Cores and Memory=== | ||
Line 47: | Line 45: | ||
<div class="mw-collapsible-content" style="padding: 5px;"> | <div class="mw-collapsible-content" style="padding: 5px;"> | ||
<pre> | <pre> | ||
− | $ slurmInfo - displays resource usage for your group | + | $ slurmInfo - displays resource usage for your group and overall cluster utilization |
$ slurmInfo -p - displays resource usage per partition | $ slurmInfo -p - displays resource usage per partition | ||
$ showQos - displays your available QoS | $ showQos - displays your available QoS | ||
Line 63: | Line 61: | ||
</div> | </div> | ||
</div> | </div> | ||
+ | |||
+ | See also [[SLURM Commands]]. | ||
+ | |||
+ | You can also see a graphical representation of cluster total use via [[https://help.rc.ufl.edu/doc/HiPerGator_Metrics Grafana]]. |
Latest revision as of 15:24, 10 July 2024
Back to Getting Started
You may want to run multiple jobs or the same job repeatedly, SLURM allows you to schedule these jobs and allocate resources to them while staying compliant with HiPerGator's use policies.
Using SLURM
UFRC uses the Simple Linux Utility for Resource Management, or SLURM, to allocate resources and schedule jobs. Users can create SLURM job scripts to submit jobs to the system. These scripts can, and should, be modified in order to control several aspects of your job, like resource allocation, email notifications, or an output destination.
Submitting a SLURM Job
- Start interactive sessions. See Development and Testing and Open OnDemand.
- Submit SLURM job scripts. See the Annotated SLURM Script for a walk-through of the basic components of a SLURM job script. For a list of sample Slurm scripts, please see Sample SLURM scripts.
To submit a job script from one of the login nodes accessed via hpg.rc.ufl.edu, use the following command: $ sbatch <your_job_script.sh> |
To check the status of submitted jobs, use the following command: $ squeue -u <username> |
Managing Cores and Memory
See Account and QOS limits under SLURM for the main documentation on efficient management of computational resources, and an extensive explanation of QOS and SLURM account use.
The amount of resources within an investment is calculated in NCU (Normalized Computing Units), which correspond to 1 CPU core and about 8GB of memory for each NCU purchased. CPUs (cores) and RAM are allocated to jobs independently as requested by your job script.
Your group's investment can run out of **cores** (SLURM may show QOSGrpCpuLimit
in the reason a job is pending) OR **memory** (SLURM may show QOSGrpMemLimit
in the reason a job is pending) depending on current use by running jobs.
The majority of HiPerGator nodes have the same ratio of about 8 GB of RAM per core, which, after accounting for the operating system and system services, leaves about 8 GB usable for jobs; hence the ratio of 1 core and 8GB RAM per NCU.
Most HiPerGator nodes have 128 CPU cores and 1000GB of RAM. The bigmem nodes go up to 4TB of available memory. See Available_Node_Features for the exact data on resources available on all types of nodes on HiPerGator.
You must specify both the number of cores and the amount of RAM needed in the job script for SLURM with the --mem
(total job memory) or --mem-per-cpu
(per-core memory) options. Otherwise, the job will be assigned the default 4gb of memory.
Monitoring Your Workloads
You can see presently running workloads with the squeue command e.g.
$ squeuemine
OpenOnDemand offers a method to monitor jobs using the Jobs menu in the upper toolbar on your dashboard. This will show your current running, pending, and recently completed jobs. Select: Jobs -> Active Jobs from the upper dashboard menu.
We provide a number of helpful commands in the UFRC module. The ufrc
module is loaded by default at login, but you can also load the ufrc
module with the following command:
$ module load ufrc
Examples of commands for SLURM or HiPerGator specific UFRC environment module
Expand this section to view Examples of commands
$ slurmInfo - displays resource usage for your group and overall cluster utilization $ slurmInfo -p - displays resource usage per partition $ showQos - displays your available QoS $ home_quota - displays your /home quota $ blue_quota - displays your /blue quota $ orange_quota - displays your /orange quota $ sacct - displays job id and state of your recent workloads $ nodeInfo - displays partitions by node types, showing total RAM and other features $ sinfo -p partition - displays the status of nodes in a partition $ jobhtop - displays resource usage for jobs $ jobnvtop - displays resource usage for GPU jobs $ which python - displays path to the Python install of the environment modules you have loaded
See also SLURM Commands.
You can also see a graphical representation of cluster total use via [Grafana].