Difference between revisions of "NCU and QOS limits under SLURM"

Revision as of 15:23, 29 June 2016

Here's a common scenario. You try to run a job and get the following error:

sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

What could be the reason and what can you do about it to continue your work?

Let's consider a scenario of a hypothetical lab looking for the answer to the "Ultimate Question of Life, The Universe, and Everything". Incidentally, that group ($GROUP in the examples below) has an investment of 42 NCUs and can run jobs within that allocation for up to 744 hours. That's the group's so-called soft limit for HiPerGator jobs. The hard limit accessible through the so-called burst qos is +9 times that giving a group potentially a total of 10x the invested resources i.e. 420 NCUs with burst qos providing 378 NCUs of that capacity for up to 96 hours. With the advent of HiPerGator2 one NCU has been re-defined to equal a single core and/or up to 3gb of memory, which gives our hypothetical group 43 cores with <=3gb of memory per core or up 42 * 3 = 126gb of total memory all group's jobs can consume until it runs against the soft (investment) limit. To summarize, the group's investment QOS limit is up to 42 cores or up to 126gb of total requested memory for #SBATCH --qos=$GROUP.

Let's test:

[marvin@gator ~]$ srun --mem=126gb --pty bash -i

srun: job 123456 queued and waiting for resources

^C

srun: Job allocation 123456 has been revoked

srun: Force Terminated job 123456

On the other hand, going even 1gb over that limit results in the already encountered job limit error

[marvin@gator ~]$ srun --mem=127gb --pty bash -i

srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits

At this point the group can try using the 'burst' QOS with

SBATCH --qos=$GROUP-b

Let's test:

[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --time=96:00:00 --qos=$GROUP-b --pty bash -i

srun: job queued and waiting for resources

^C

srun: Job allocation 123457 has been revoked

srun: Force Terminated job 123457

However, now there's the time limit to consider:

[marvin@gator ~]$ srun --mem=400gb --time=300:00:00 --pty bash -i

srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits

Difference between revisions of "NCU and QOS limits under SLURM"

Revision as of 15:23, 29 June 2016

Navigation menu

Search

@@ Line 39: / Line 39: @@
 Let's test:
-[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --qos=$GROUP-b --pty bash -i
+[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --time=96:00:00 --qos=$GROUP-b --pty bash -i
 srun: job  queued and waiting for resources
@@ Line 50: / Line 50: @@
 srun: Force Terminated job 123457
+However, now there's the time limit to consider:
+[marvin@gator ~]$ srun --mem=400gb --time=300:00:00 --pty bash -i
+srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits