NCU and QOS limits under SLURM

From UFRC
Revision as of 15:22, 29 June 2016 by Moskalenko (talk | contribs) (Created page with "Category:SLURM Here's a common scenario. You try to run a job and get the following error: ;sbatch: error: Batch job submission failed: Job violates accounting/QOS polic...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


Here's a common scenario. You try to run a job and get the following error:

sbatch
error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

What could be the reason and what can you do about it to continue your work?

Let's consider a scenario of a hypothetical lab looking for the answer to the "Ultimate Question of Life, The Universe, and Everything". Incidentally, that group ($GROUP in the examples below) has an investment of 42 NCUs and can run jobs within that allocation for up to 744 hours. That's the group's so-called soft limit for HiPerGator jobs. The hard limit accessible through the so-called burst qos is +9 times that giving a group potentially a total of 10x the invested resources i.e. 420 NCUs with burst qos providing 378 NCUs of that capacity for up to 96 hours. With the advent of HiPerGator2 one NCU has been re-defined to equal a single core and/or up to 3gb of memory, which gives our hypothetical group 43 cores with <=3gb of memory per core or up 42 * 3 = 126gb of total memory all group's jobs can consume until it runs against the soft (investment) limit. To summarize, the group's investment QOS limit is up to 42 cores or up to 126gb of total requested memory for #SBATCH --qos=$GROUP.

Let's test:

[marvin@gator ~]$ srun --mem=126gb --pty bash -i

srun: job 123456 queued and waiting for resources

<Looks good, let's terminate the request with Ctrl+C>

^C

srun: Job allocation 123456 has been revoked

srun: Force Terminated job 123456


On the other hand, going even 1gb over that limit results in the already encountered job limit error

[marvin@gator ~]$ srun --mem=127gb --pty bash -i

srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits


At this point the group can try using the 'burst' QOS with

  1. SBATCH --qos=$GROUP-b


Let's test:

[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --qos=$GROUP-b --pty bash -i

srun: job queued and waiting for resources

<Looks good, let's terminate with Ctrl+C>

^C

srun: Job allocation 123457 has been revoked

srun: Force Terminated job 123457