Difference between revisions of "NCU and QOS limits under SLURM"

From UFRC
Jump to navigation Jump to search
Line 41: Line 41:
 
[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --time=96:00:00 --qos=$GROUP-b --pty bash -i
 
[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --time=96:00:00 --qos=$GROUP-b --pty bash -i
  
srun: job  queued and waiting for resources
+
srun: job  123457 queued and waiting for resources
  
 
<Looks good, let's terminate with Ctrl+C>
 
<Looks good, let's terminate with Ctrl+C>
Line 52: Line 52:
  
  
However, now there's the time limit to consider:
+
However, now there's the burst qos time limit to consider:
  
 
[marvin@gator ~]$ srun --mem=400gb --time=300:00:00 --pty bash -i
 
[marvin@gator ~]$ srun --mem=400gb --time=300:00:00 --pty bash -i
  
 
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits
 
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits
 +
 +
 +
Let's reduce the time limit to what burst qos supports and try again:
 +
 +
 +
 +
[marvin@gator ~]$ srun --mem=400gb --time=96:00:00 --pty bash -i
 +
 +
srun: job  123458 queued and waiting for resources
 +
 +
<Looks good, let's terminate with Ctrl+C>
 +
 +
^C
 +
 +
srun: Job allocation 123458 has been revoked
 +
 +
srun: Force Terminated job 123458

Revision as of 16:04, 29 June 2016


Here's a common scenario. You try to run a job and get the following error:

sbatch
error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

What could be the reason and what can you do about it to continue your work?

Let's consider a scenario of a hypothetical lab looking for the answer to the "Ultimate Question of Life, The Universe, and Everything". Incidentally, that group ($GROUP in the examples below) has an investment of 42 NCUs and can run jobs within that allocation for up to 744 hours. That's the group's so-called soft limit for HiPerGator jobs. The hard limit accessible through the so-called burst qos is +9 times that giving a group potentially a total of 10x the invested resources i.e. 420 NCUs with burst qos providing 378 NCUs of that capacity for up to 96 hours. With the advent of HiPerGator2 one NCU has been re-defined to equal a single core and/or up to 3gb of memory, which gives our hypothetical group 43 cores with <=3gb of memory per core or up 42 * 3 = 126gb of total memory all group's jobs can consume until it runs against the soft (investment) limit. To summarize, the group's investment QOS limit is up to 42 cores or up to 126gb of total requested memory for #SBATCH --qos=$GROUP.

Let's test:

[marvin@gator ~]$ srun --mem=126gb --pty bash -i

srun: job 123456 queued and waiting for resources

<Looks good, let's terminate the request with Ctrl+C>

^C

srun: Job allocation 123456 has been revoked

srun: Force Terminated job 123456


On the other hand, going even 1gb over that limit results in the already encountered job limit error

[marvin@gator ~]$ srun --mem=127gb --pty bash -i

srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits


At this point the group can try using the 'burst' QOS with

  1. SBATCH --qos=$GROUP-b


Let's test:

[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --time=96:00:00 --qos=$GROUP-b --pty bash -i

srun: job 123457 queued and waiting for resources

<Looks good, let's terminate with Ctrl+C>

^C

srun: Job allocation 123457 has been revoked

srun: Force Terminated job 123457


However, now there's the burst qos time limit to consider:

[marvin@gator ~]$ srun --mem=400gb --time=300:00:00 --pty bash -i

srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits


Let's reduce the time limit to what burst qos supports and try again:


[marvin@gator ~]$ srun --mem=400gb --time=96:00:00 --pty bash -i

srun: job 123458 queued and waiting for resources

<Looks good, let's terminate with Ctrl+C>

^C

srun: Job allocation 123458 has been revoked

srun: Force Terminated job 123458