Difference between revisions of "NCU and QOS limits under SLURM"

From UFRC
Jump to navigation Jump to search
 
(22 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[Category:SLURM]]
+
#REDIRECT [[Account_and_QOS_limits_under_SLURM]]
 
 
Here's a common scenario. You try to run a job and get the following error:
 
 
 
;sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
 
 
 
What could be the reason and what can you do about it to continue your work?
 
 
 
Let's consider a scenario of a hypothetical lab looking for the answer to the "Ultimate Question of Life, The Universe, and Everything". Incidentally, that group ($GROUP in the examples below) has an investment of 42 NCUs and can run jobs within that allocation for up to 744 hours. That's the group's so-called ''soft limit'' for  HiPerGator jobs. The hard limit accessible through the so-called ''burst qos'' is +9 times that giving a group potentially a total of 10x the invested resources i.e. 420 NCUs with burst qos providing 378 NCUs of that capacity for up to '''96 hours'''. With the advent of HiPerGator2 one NCU has been re-defined to equal a single core and/or up to 3gb of memory, which gives our hypothetical group 43 cores with <=3gb of memory per core or up 42 * 3 = 126gb of total memory all group's jobs can consume until it runs against the soft (investment) limit. To summarize, the group's investment QOS limit is up to 42 cores or up to 126gb of total requested memory for ''#SBATCH --qos=$GROUP.
 
 
 
Let's test:
 
 
 
[marvin@gator ~]$ srun --mem=126gb --pty bash -i
 
 
 
srun: job 123456 queued and waiting for resources
 
 
 
<Looks good, let's terminate the request with Ctrl+C>
 
 
 
^C
 
 
 
srun: Job allocation 123456 has been revoked
 
 
 
srun: Force Terminated job 123456
 
 
 
 
 
On the other hand, going even 1gb over that limit results in the already encountered job limit error
 
 
 
[marvin@gator ~]$ srun --mem=127gb --pty bash -i
 
 
 
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits
 
 
 
 
 
 
 
At this point the group can try using the 'burst' QOS with
 
 
 
#SBATCH --qos=$GROUP-b
 
 
 
 
 
Let's test:
 
 
 
[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --time=96:00:00 --qos=$GROUP-b --pty bash -i
 
 
 
srun: job  123457 queued and waiting for resources
 
 
 
<Looks good, let's terminate with Ctrl+C>
 
 
 
^C
 
 
 
srun: Job allocation 123457 has been revoked
 
 
 
srun: Force Terminated job 123457
 
 
 
 
 
However, now there's the burst qos time limit to consider:
 
 
 
[marvin@gator ~]$ srun --mem=400gb --time=300:00:00 --pty bash -i
 
 
 
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits
 
 
 
 
 
Let's reduce the time limit to what burst qos supports and try again:
 
 
 
 
 
 
 
[marvin@gator ~]$ srun --mem=400gb --time=96:00:00 --pty bash -i
 
 
 
srun: job  123458 queued and waiting for resources
 
 
 
<Looks good, let's terminate with Ctrl+C>
 
 
 
^C
 
 
 
srun: Job allocation 123458 has been revoked
 
 
 
srun: Force Terminated job 123458
 
 
 
Please note that while a group stays under the total memory limit equal to 3 times the investment NCUs it can use all processor cores within the group's allocation. However, once the total memory limit for that QOS is reached you will see a '(QOSGrpMemLimit)' in the 'NODELIST(REASON)' column of the squeue output.
 

Latest revision as of 22:17, 2 November 2016