Difference between revisions of "NCU and QOS limits under SLURM"
Moskalenko (talk | contribs) |
Moskalenko (talk | contribs) |
||
Line 74: | Line 74: | ||
srun: Force Terminated job 123458 | srun: Force Terminated job 123458 | ||
+ | |||
+ | Please note that while a group stays under the total memory limit equal to 3 times the investment NCUs it can use all processor cores within the group's allocation. However, once the total memory limit for that QOS is reached you will see a '(QOSGrpMemLimit)' in the 'NODELIST(REASON)' column of the squeue output. |
Revision as of 20:28, 18 July 2016
Here's a common scenario. You try to run a job and get the following error:
- sbatch
- error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
What could be the reason and what can you do about it to continue your work?
Let's consider a scenario of a hypothetical lab looking for the answer to the "Ultimate Question of Life, The Universe, and Everything". Incidentally, that group ($GROUP in the examples below) has an investment of 42 NCUs and can run jobs within that allocation for up to 744 hours. That's the group's so-called soft limit for HiPerGator jobs. The hard limit accessible through the so-called burst qos is +9 times that giving a group potentially a total of 10x the invested resources i.e. 420 NCUs with burst qos providing 378 NCUs of that capacity for up to 96 hours. With the advent of HiPerGator2 one NCU has been re-defined to equal a single core and/or up to 3gb of memory, which gives our hypothetical group 43 cores with <=3gb of memory per core or up 42 * 3 = 126gb of total memory all group's jobs can consume until it runs against the soft (investment) limit. To summarize, the group's investment QOS limit is up to 42 cores or up to 126gb of total requested memory for #SBATCH --qos=$GROUP.
Let's test:
[marvin@gator ~]$ srun --mem=126gb --pty bash -i
srun: job 123456 queued and waiting for resources
<Looks good, let's terminate the request with Ctrl+C>
^C
srun: Job allocation 123456 has been revoked
srun: Force Terminated job 123456
On the other hand, going even 1gb over that limit results in the already encountered job limit error
[marvin@gator ~]$ srun --mem=127gb --pty bash -i
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits
At this point the group can try using the 'burst' QOS with
- SBATCH --qos=$GROUP-b
Let's test:
[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --time=96:00:00 --qos=$GROUP-b --pty bash -i
srun: job 123457 queued and waiting for resources
<Looks good, let's terminate with Ctrl+C>
^C
srun: Job allocation 123457 has been revoked
srun: Force Terminated job 123457
However, now there's the burst qos time limit to consider:
[marvin@gator ~]$ srun --mem=400gb --time=300:00:00 --pty bash -i
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits
Let's reduce the time limit to what burst qos supports and try again:
[marvin@gator ~]$ srun --mem=400gb --time=96:00:00 --pty bash -i
srun: job 123458 queued and waiting for resources
<Looks good, let's terminate with Ctrl+C>
^C
srun: Job allocation 123458 has been revoked
srun: Force Terminated job 123458
Please note that while a group stays under the total memory limit equal to 3 times the investment NCUs it can use all processor cores within the group's allocation. However, once the total memory limit for that QOS is reached you will see a '(QOSGrpMemLimit)' in the 'NODELIST(REASON)' column of the squeue output.