Difference between revisions of "Account and QOS limits under SLURM"
Moskalenko (talk | contribs) |
|||
(28 intermediate revisions by 9 users not shown) | |||
Line 1: | Line 1: | ||
− | [[Category: | + | [[Category:Scheduler]] |
+ | {|align=right | ||
+ | |__TOC__ | ||
+ | |} | ||
Every group on HiPerGator (HPG) must have an '''investment''' with a corresponding hardware allocation to be able to do any work on HPG. Each allocation is associated with a scheduler '''account'''. Each account has two quality of service (QOS) levels - high-priority '''investment QOS''' and a low-priority '''burst QOS'''. The latter allows short-term borrowing of unused resources from other groups' accounts. In turn, each user in a group has a scheduler account association. In the end, it is this association which determines which QOSes are available to a particular user. Users with secondary Linux group membership will have associations with QOSes from their secondary groups. | Every group on HiPerGator (HPG) must have an '''investment''' with a corresponding hardware allocation to be able to do any work on HPG. Each allocation is associated with a scheduler '''account'''. Each account has two quality of service (QOS) levels - high-priority '''investment QOS''' and a low-priority '''burst QOS'''. The latter allows short-term borrowing of unused resources from other groups' accounts. In turn, each user in a group has a scheduler account association. In the end, it is this association which determines which QOSes are available to a particular user. Users with secondary Linux group membership will have associations with QOSes from their secondary groups. | ||
Line 5: | Line 8: | ||
==Account and QOS== | ==Account and QOS== | ||
− | + | ===Using/Finding the resources from a secondary group=== | |
− | + | To view instructions on using SLURM resources from one of your secondary groups, or find what those associations are, view [[Checking and Using Secondary Resources]] | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
=== CPU cores and Memory (RAM) Resource Use=== | === CPU cores and Memory (RAM) Resource Use=== | ||
CPU cores and RAM are allocated to jobs independently as requested in job scripts. Considerations for selecting how many CPU cores and how much memory to request for a job must take into account the QOS limits based on the group investment, the limitations of the hardware (compute nodes), and the desire to be a good neighbor on a shared resource like HiPerGator to ensure that system resources are allocated efficiently, used fairly, and everyone has a chance to get their work done without causing negative impacts on work performed by other researchers. | CPU cores and RAM are allocated to jobs independently as requested in job scripts. Considerations for selecting how many CPU cores and how much memory to request for a job must take into account the QOS limits based on the group investment, the limitations of the hardware (compute nodes), and the desire to be a good neighbor on a shared resource like HiPerGator to ensure that system resources are allocated efficiently, used fairly, and everyone has a chance to get their work done without causing negative impacts on work performed by other researchers. | ||
− | HiPerGator consists of many interconnected servers (compute nodes). The hardware resources of each compute node, including CPU cores, memory, memory bandwidth, network bandwidth, and [[Temporary_Directories|local storage]] are limited. If any single one of the above resources is fully consumed the remaining unused resources can become effectively wasted, which makes it progressively harder or even impossible to achieve the shared goals of Research Computing and the UF Researcher Community stated above. See the [[Available Node Features]] for details on the hardware on compute nodes. Nodes with similar hardware are generally separated into partitions | + | HiPerGator consists of many interconnected servers (compute nodes). The hardware resources of each compute node, including CPU cores, memory, memory bandwidth, network bandwidth, and [[Temporary_Directories|local storage]] are limited. If any single one of the above resources is fully consumed the remaining unused resources can become effectively wasted, which makes it progressively harder or even impossible to achieve the shared goals of Research Computing and the UF Researcher Community stated above. See the [[Available Node Features]] for details on the hardware on compute nodes. Nodes with similar hardware are generally separated into partitions. If the job requires larger nodes or particular hardware make sure to explicitly specify a partition. |
'''Example:''' | '''Example:''' | ||
--partition=bigmem | --partition=bigmem | ||
− | When a job is submitted, if no resource request is provided, the default limits of 1 CPU core, | + | When a job is submitted, if no resource request is provided, the default limits of 1 CPU core, 4gb of memory, and a 10 minute time limit will be set on the job by the scheduler. Check the resource request if it's not clear why the job ended before the analysis was done. Premature exit can be due to the job exceeding the time limit or the application using more memory than the request. |
Run testing jobs to find out what resource a particular analysis needs. To make sure that the analysis is performed successfully without wasting valuable resources you must specify both the number of CPU cores and the amount of memory needed for the analysis in the job script. See [[Sample SLURM Scripts]] for examples of specifying CPU core requests depending on the nature of the application running in a job. Use <code>--mem</code> (total job memory on a node) or <code>--mem-per-cpu</code> (per-core memory) options to request memory. Use <code>--time</code> to set a time limit to an appropriate value within the QOS limit. | Run testing jobs to find out what resource a particular analysis needs. To make sure that the analysis is performed successfully without wasting valuable resources you must specify both the number of CPU cores and the amount of memory needed for the analysis in the job script. See [[Sample SLURM Scripts]] for examples of specifying CPU core requests depending on the nature of the application running in a job. Use <code>--mem</code> (total job memory on a node) or <code>--mem-per-cpu</code> (per-core memory) options to request memory. Use <code>--time</code> to set a time limit to an appropriate value within the QOS limit. | ||
− | As jobs are submitted and the resources under a particular account are consumed the group may reach either the CPU or Memory group limit. The group has consumed all cores in a QOS if the scheduler shows <code>QOSGrpCpuLimit</code> or memory if the scheduler shows <code>QOSGrpMemLimit</code> in the reason a job is pending ('NODELIST(REASON)' column of the <code>squeue</code> command output). | + | As jobs are submitted and the resources under a particular account are consumed the group may reach either the CPU or Memory group limit. The group has consumed all cores in a QOS if the scheduler shows <code>QOSGrpCpuLimit</code> or memory if the scheduler shows <code>QOSGrpMemLimit</code> in the reason a job is pending ('NODELIST(REASON)' column of the <code>squeue</code> command output). '''Example:''' |
− | |||
− | '''Example:''' | ||
<pre> | <pre> | ||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
Line 138: | Line 36: | ||
</pre> | </pre> | ||
− | == | + | === Time and Resource Limits === |
− | |||
− | + | See [[SLURM Partition Limits]] for partition time limits. | |
− | |||
− | for | ||
− | |||
− | |||
− | + | For details on the limits placed on time and resources like GPUs on SLURM, view [[QOS Limits]]. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | To view a summary of currently active jobs for a group, use the slurmInfo command from the [https://help.rc.ufl.edu/doc/UFRC_environment_module ufrc module]. | |
− | |||
− | + | slurmInfo -pu -g <i>groupname</i> | |
− | == | + | ==Choosing QOS for a Job== |
+ | For advice on choosing QOS, go to [[Choosing QOS for a Job]]. | ||
+ | ==Example== | ||
A hypothetical group ($GROUP in the examples below) has an investment of 42 CPU cores and 148GB of memory. That's the group's so-called ''soft limit'' for HiPerGator jobs in the investment qos for up to 744 hour time limit at high priority. The hard limit accessible through the so-called ''burst qos'' is 9 times that giving a group potentially a total of 10x the invested resources i.e. 420 total CPU cores and 1480GB of total memory with burst qos providing 378 CPU cores and 1330GB of total memory for up to 96 hours at low base priority. | A hypothetical group ($GROUP in the examples below) has an investment of 42 CPU cores and 148GB of memory. That's the group's so-called ''soft limit'' for HiPerGator jobs in the investment qos for up to 744 hour time limit at high priority. The hard limit accessible through the so-called ''burst qos'' is 9 times that giving a group potentially a total of 10x the invested resources i.e. 420 total CPU cores and 1480GB of total memory with burst qos providing 378 CPU cores and 1330GB of total memory for up to 96 hours at low base priority. | ||
Line 236: | Line 112: | ||
==Pending Job Reasons== | ==Pending Job Reasons== | ||
To reiterate, the following ''Reasons'' can be seen in the <code>NODELIST(REASON)</code> column of the <code>squeue</code> command when the group reaches the resource limit for a QOS: | To reiterate, the following ''Reasons'' can be seen in the <code>NODELIST(REASON)</code> column of the <code>squeue</code> command when the group reaches the resource limit for a QOS: | ||
− | + | <div style="column-count:2"> | |
;QOSGrpCpuLimit | ;QOSGrpCpuLimit | ||
All CPU cores available for the listed account within the respective QOS are in use. | All CPU cores available for the listed account within the respective QOS are in use. | ||
− | |||
;QOSGrpMemLimit | ;QOSGrpMemLimit | ||
All memory available for the listed account within the respective QOS as described in the previous section is in use. | All memory available for the listed account within the respective QOS as described in the previous section is in use. | ||
− | + | </div> | |
− | + | {{Note|Once it has marked any jobs in the group's list of pending jobs with a reason of <code>QOSGrpCpuLimit</code> or <code>QOSGrpMemLimit</code>, SLURM may not evaluate other jobs and they may simply be listed with the <code>Priority</code> reason code. See [https://help.rc.ufl.edu/doc/FAQ FAQ] at the bottom of the page for a list of reasons.|reminder}} | |
− | Once it has marked any jobs in the group's list of pending jobs with a reason of <code>QOSGrpCpuLimit</code> or <code>QOSGrpMemLimit</code>, SLURM may not evaluate other jobs and they may simply be listed with the <code>Priority</code> reason code. |
Latest revision as of 17:37, 26 July 2024
Every group on HiPerGator (HPG) must have an investment with a corresponding hardware allocation to be able to do any work on HPG. Each allocation is associated with a scheduler account. Each account has two quality of service (QOS) levels - high-priority investment QOS and a low-priority burst QOS. The latter allows short-term borrowing of unused resources from other groups' accounts. In turn, each user in a group has a scheduler account association. In the end, it is this association which determines which QOSes are available to a particular user. Users with secondary Linux group membership will have associations with QOSes from their secondary groups.
In summary, each HPG user has scheduler associations with group account based QOSes that determine what resources are available to the users's jobs. These QOSes can be thought of as pools of computational (CPU cores), memory (RAM), maximum run time (time limit) resources with associated starting priority levels that can be consumed by jobs to run applications according to QOS levels, which we will review below.
Account and QOS
Using/Finding the resources from a secondary group
To view instructions on using SLURM resources from one of your secondary groups, or find what those associations are, view Checking and Using Secondary Resources
CPU cores and Memory (RAM) Resource Use
CPU cores and RAM are allocated to jobs independently as requested in job scripts. Considerations for selecting how many CPU cores and how much memory to request for a job must take into account the QOS limits based on the group investment, the limitations of the hardware (compute nodes), and the desire to be a good neighbor on a shared resource like HiPerGator to ensure that system resources are allocated efficiently, used fairly, and everyone has a chance to get their work done without causing negative impacts on work performed by other researchers.
HiPerGator consists of many interconnected servers (compute nodes). The hardware resources of each compute node, including CPU cores, memory, memory bandwidth, network bandwidth, and local storage are limited. If any single one of the above resources is fully consumed the remaining unused resources can become effectively wasted, which makes it progressively harder or even impossible to achieve the shared goals of Research Computing and the UF Researcher Community stated above. See the Available Node Features for details on the hardware on compute nodes. Nodes with similar hardware are generally separated into partitions. If the job requires larger nodes or particular hardware make sure to explicitly specify a partition. Example:
--partition=bigmem
When a job is submitted, if no resource request is provided, the default limits of 1 CPU core, 4gb of memory, and a 10 minute time limit will be set on the job by the scheduler. Check the resource request if it's not clear why the job ended before the analysis was done. Premature exit can be due to the job exceeding the time limit or the application using more memory than the request.
Run testing jobs to find out what resource a particular analysis needs. To make sure that the analysis is performed successfully without wasting valuable resources you must specify both the number of CPU cores and the amount of memory needed for the analysis in the job script. See Sample SLURM Scripts for examples of specifying CPU core requests depending on the nature of the application running in a job. Use --mem
(total job memory on a node) or --mem-per-cpu
(per-core memory) options to request memory. Use --time
to set a time limit to an appropriate value within the QOS limit.
As jobs are submitted and the resources under a particular account are consumed the group may reach either the CPU or Memory group limit. The group has consumed all cores in a QOS if the scheduler shows QOSGrpCpuLimit
or memory if the scheduler shows QOSGrpMemLimit
in the reason a job is pending ('NODELIST(REASON)' column of the squeue
command output). Example:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 123456 bigmem test_job jdoe PD 0:00 1 (QOSGrpMemLimit)
Reaching a resource limit of a QOS does not interfere with job submission. However, the jobs with this reason will not run and will remain in the pending state until the QOS use falls below the limit.
If the resource request for submitted job is impossible to satisfy within either the QOS limits or HiPerGator compute node hardware for a particular partition the scheduler will refuse the job submission altogether and return the following error message,
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
Time and Resource Limits
See SLURM Partition Limits for partition time limits.
For details on the limits placed on time and resources like GPUs on SLURM, view QOS Limits.
To view a summary of currently active jobs for a group, use the slurmInfo command from the ufrc module.
slurmInfo -pu -g groupname
Choosing QOS for a Job
For advice on choosing QOS, go to Choosing QOS for a Job.
Example
A hypothetical group ($GROUP in the examples below) has an investment of 42 CPU cores and 148GB of memory. That's the group's so-called soft limit for HiPerGator jobs in the investment qos for up to 744 hour time limit at high priority. The hard limit accessible through the so-called burst qos is 9 times that giving a group potentially a total of 10x the invested resources i.e. 420 total CPU cores and 1480GB of total memory with burst qos providing 378 CPU cores and 1330GB of total memory for up to 96 hours at low base priority.
Let's test:
[marvin@gator ~]$ srun --mem=126gb --pty bash -i srun: job 123456 queued and waiting for resources #Looks good, let's terminate the request with Ctrl+C> ^C srun: Job allocation 123456 has been revoked srun: Force Terminated job 123456
On the other hand, going even 1gb over that limit results in the already encountered job limit error
[marvin@gator ~]$ srun --mem=127gb --pty bash -i srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits
At this point the group can try using the 'burst' QOS with
#SBATCH --qos=$GROUP-b
Let's test:
[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --time=96:00:00 --qos=$GROUP-b --pty bash -i srun: job 123457 queued and waiting for resources #Looks good, let's terminate with Ctrl+C ^C srun: Job allocation 123457 has been revoked srun: Force Terminated job 123457
However, now there's the burst qos time limit to consider.
[marvin@gator ~]$ srun --mem=400gb --time=300:00:00 --pty bash -i srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits
Let's reduce the time limit to what burst qos supports and try again.
[marvin@gator ~]$ srun --mem=400gb --time=96:00:00 --pty bash -i srun: job 123458 queued and waiting for resources #Looks good, let's terminate with Ctrl+C ^C srun: Job allocation 123458 has been revoked srun: Force Terminated job
Pending Job Reasons
To reiterate, the following Reasons can be seen in the NODELIST(REASON)
column of the squeue
command when the group reaches the resource limit for a QOS:
- QOSGrpCpuLimit
All CPU cores available for the listed account within the respective QOS are in use.
- QOSGrpMemLimit
All memory available for the listed account within the respective QOS as described in the previous section is in use.
QOSGrpCpuLimit
or QOSGrpMemLimit
, SLURM may not evaluate other jobs and they may simply be listed with the Priority
reason code. See FAQ at the bottom of the page for a list of reasons.