Difference between revisions of "NCU and QOS limits under SLURM"

From UFRC
Jump to navigation Jump to search
Line 34: Line 34:
  
 
==QOS Limits==
 
==QOS Limits==
The second argument that determines job's priority and resource limits is '--qos'. There are two possible choices for QOS - the main investment QOS and the burst QOS formerly known as the 'soft' and 'hard' limit under Torque/MOAB. The QOS has to be explicitly specified as SLURM will not automatically move jobs between the two. The main qos has a 744-hour time limit and a high priority to make sure that a group can fully use its investment without having to wait for resources. To specify the main QOS use the group name in conjunction with the corresponding account. For example,
 
  
  #SBATCH --account=mygroup
+
SLURM refers to the resources (NCUs (cores), Memory (RAM), accelerators, software licenses, etc.) as '''Trackable Resources''' (TRES). The TRES available to a given group are determined by the group's investments and are limited by parameters assigned to the QOS.  
  #SBATCH --qos=mygroup
 
  
The burst QOS has a 96-hour time limit and a lower priority as it depends on the availability of spare capacity on HiPerGator, but it provides the additional amount of 9x the resources of the main qos for a total of 10x of the investment between the two QOSes. To specify the burst QOS add '-b' to the group's name in the --qos argument. For example,
+
Continuing with the example above, we can see the trackable resource limits placed on the borum group's primary QOS by running,
#SBATCH --account=mygroup
+
 
#SBATCH --qos=mygroup-b
+
<source lang=bash>
 +
# showQos borum
 +
                Name                          Descr                                      GrpTRES  GrpCPUs
 +
-------------------- ------------------------------ --------------------------------------------- --------
 +
borum                borum qos                      cpu=41,mem=125952,gres/gpu=0,gres/mic=0            41
 +
</source>
 +
 
 +
from which we can see that when submitting jobs under the borum group's primary QOS, users have access to a total of 41 cores, 125 GB of RAM with no access to accelerators (GPUs, MICs). These resource will be shared among all members of the borum group running jobs under the primary QOS. Similarly, we can see the trackable resources available under the burst QOS with,
 +
 
 +
<source lang=bash>
 +
[root@slurm1 bin]# showQos borum-b
 +
                Name                          Descr                                      GrpTRES  GrpCPUs
 +
-------------------- ------------------------------ --------------------------------------------- --------
 +
borum-b             borum burst qos                cpu=369,mem=1133568,gres/gpu=0,gres/mic=0          369
 +
</source>
 +
 
 +
There are additional limits and parameters associated with QOSs in addition to the TRES limits.  Among them are the maximum wall time available under the QOS and the base priority assigned to the job. We can see this with,
 +
 
 +
<source lang=bash>
 +
# sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" borum borum-b
 +
                Name                          Descr  Priority    MaxWall
 +
-------------------- ------------------------------ ---------- -----------
 +
borum                borum qos                           36000 31-00:00:00
 +
borum-b             borum burst qos                      900  4-00:00:00
 +
</source>
 +
 
 +
From which we see that the normal and burst QOS jobs are limited to a maximum duration of 31 and 4 days, respectively.  Additionally, the base priority of a burst QOS job is 1/40th that of a normal QOS job.  It is important to remember that the base priority is only one component of the jobs overall priority and that the priority will change over time as the job waits in the queue.
 +
 
 +
By policy, the burst QOS cpu and memory limits are always nine times those of the normal QOS and are intended to allow groups to take advantage of unused resources beyond those that they have purchased for short periods of time.
  
 
There is another limit under SLURM, which applies to both QOS choices. A group can use all processor cores available to it under a particular qos only as long as it stays under the total memory limit for that QOS. The total memory limit is calculated as 'QOS NCU * 3gb'. For example, a main QOS of 30NCUs will have a group memory limit of 90gb while the burst qos for that group will be equal to '30 * 3gb * 9 = 810gb'. If the group memory limit is reached you will see a '(QOSGrpMemLimit)' status in the 'NODELIST(REASON)' column of the squeue output. For example,
 
There is another limit under SLURM, which applies to both QOS choices. A group can use all processor cores available to it under a particular qos only as long as it stays under the total memory limit for that QOS. The total memory limit is calculated as 'QOS NCU * 3gb'. For example, a main QOS of 30NCUs will have a group memory limit of 90gb while the burst qos for that group will be equal to '30 * 3gb * 9 = 810gb'. If the group memory limit is reached you will see a '(QOSGrpMemLimit)' status in the 'NODELIST(REASON)' column of the squeue output. For example,

Revision as of 14:05, 2 November 2016

Hpg2 wiki logo.png

HiPerGator 2.0 documentation

Account and QOS Use

In the HiperGator 2.0 SLURM implementation a job's priority and resource limits are determined by its QOS (Quality of Service). Every group with an HPG2 investment will have a corresponding SLURM account. Each SLURM account has two associated QOSs - the normal or high-priority QOS and a low-priority burst QOS. In turn, each user who is a member of vested group has a SLURM association. It is the association which determines which QOSs are a available to a user. In addition to the primary group association, users who have secondary group membership (in the Unix/Linux sense) will also have secondary group associations which will afford them access to the normal and burst QOSs of that group's account.

To see your SLURM associations, you can run the showAssoc command. For example,

[root@slurm1 ~]# showAssoc magitz
              User    Account   Def Acct   Def QOS                                      QOS 
------------------ ---------- ---------- --------- ---------------------------------------- 
magitz                zoo6927      ufhpc     ufhpc zoo6927,zoo6927-b                        
magitz                  ufhpc      ufhpc     ufhpc ufhpc,ufhpc-b                            
magitz                 soltis      ufhpc    soltis soltis,soltis-b                          
magitz                  borum      ufhpc     borum borum,borum-b

The output above shows that the user magitz has four SLURM associations and thus, access to 8 different QOSs. By convention, a user's default account is always the account of their primary group. Additionally, their default QOS is the normal (high-priority) QOS of their default QOS. If a user does not explicitly request a specific account and QOS, the user's default account and QOS will be assigned to the job.

However, if the user magitz wanted to use the borum group's burst QOS (to which he has access by virture of the borum account association), he would specify the account as QOS in his batch script as follows.

    #SBATCH  --account=borum
    #SBATCH  --qos=borum-b

Note that both must be specified. Otherwise, SLURM will assume the default ufhpc account is intended and the borum-b QOS will not be available to the job. Consequently, SLURM would deny the submission.

These sbatch directives can also be given as command line arguments to "srun" as in,

    srun --account=borum --qos=borum-b some_command

QOS Limits

SLURM refers to the resources (NCUs (cores), Memory (RAM), accelerators, software licenses, etc.) as Trackable Resources (TRES). The TRES available to a given group are determined by the group's investments and are limited by parameters assigned to the QOS.

Continuing with the example above, we can see the trackable resource limits placed on the borum group's primary QOS by running,

# showQos borum
                Name                          Descr                                       GrpTRES  GrpCPUs 
-------------------- ------------------------------ --------------------------------------------- -------- 
borum                borum qos                      cpu=41,mem=125952,gres/gpu=0,gres/mic=0             41

from which we can see that when submitting jobs under the borum group's primary QOS, users have access to a total of 41 cores, 125 GB of RAM with no access to accelerators (GPUs, MICs). These resource will be shared among all members of the borum group running jobs under the primary QOS. Similarly, we can see the trackable resources available under the burst QOS with,

[root@slurm1 bin]# showQos borum-b
                Name                          Descr                                       GrpTRES  GrpCPUs 
-------------------- ------------------------------ --------------------------------------------- -------- 
borum-b              borum burst qos                cpu=369,mem=1133568,gres/gpu=0,gres/mic=0          369

There are additional limits and parameters associated with QOSs in addition to the TRES limits. Among them are the maximum wall time available under the QOS and the base priority assigned to the job. We can see this with,

# sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" borum borum-b
                Name                          Descr   Priority     MaxWall 
-------------------- ------------------------------ ---------- ----------- 
borum                borum qos                           36000 31-00:00:00 
borum-b              borum burst qos                       900  4-00:00:00

From which we see that the normal and burst QOS jobs are limited to a maximum duration of 31 and 4 days, respectively. Additionally, the base priority of a burst QOS job is 1/40th that of a normal QOS job. It is important to remember that the base priority is only one component of the jobs overall priority and that the priority will change over time as the job waits in the queue.

By policy, the burst QOS cpu and memory limits are always nine times those of the normal QOS and are intended to allow groups to take advantage of unused resources beyond those that they have purchased for short periods of time.

There is another limit under SLURM, which applies to both QOS choices. A group can use all processor cores available to it under a particular qos only as long as it stays under the total memory limit for that QOS. The total memory limit is calculated as 'QOS NCU * 3gb'. For example, a main QOS of 30NCUs will have a group memory limit of 90gb while the burst qos for that group will be equal to '30 * 3gb * 9 = 810gb'. If the group memory limit is reached you will see a '(QOSGrpMemLimit)' status in the 'NODELIST(REASON)' column of the squeue output. For example,

squeue | grep MemLimit | head -n 1
           123456    bigmem test_job   jdoe PD       0:00      1 (QOSGrpMemLimit)

The above message can only be seen in the output of the 'squeue' command and does not interfere with job submission, but the job will stay queued until the group goes below its memory limit.

If the submitted job is so large that its resource request falls outside of the total resource limit within the requested QOS SLURM will refuse the job submission altogether and produce the following error

sbatch
error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

Examples

A hypothetical group ($GROUP in the examples below) has an investment of 42 NCUs. That's the group's so-called soft limit for HiPerGator jobs in the main qos for up to 744 hours at high priority. The hard limit accessible through the so-called burst qos is +9 times that giving a group potentially a total of 10x the invested resources i.e. 420 NCUs with burst qos providing 378 NCUs of that capacity for up to 96 hours at low priority.

Let's test:

[marvin@gator ~]$ srun --mem=126gb --pty bash -i

srun: job 123456 queued and waiting for resources

<Looks good, let's terminate the request with Ctrl+C>

^C

srun: Job allocation 123456 has been revoked

srun: Force Terminated job 123456


On the other hand, going even 1gb over that limit results in the already encountered job limit error

[marvin@gator ~]$ srun --mem=127gb --pty bash -i

srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits


At this point the group can try using the 'burst' QOS with

  1. SBATCH --qos=$GROUP-b


Let's test:

[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --time=96:00:00 --qos=$GROUP-b --pty bash -i

srun: job 123457 queued and waiting for resources

<Looks good, let's terminate with Ctrl+C>

^C

srun: Job allocation 123457 has been revoked

srun: Force Terminated job 123457


However, now there's the burst qos time limit to consider:

[marvin@gator ~]$ srun --mem=400gb --time=300:00:00 --pty bash -i

srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits


Let's reduce the time limit to what burst qos supports and try again:


[marvin@gator ~]$ srun --mem=400gb --time=96:00:00 --pty bash -i

srun: job 123458 queued and waiting for resources

<Looks good, let's terminate with Ctrl+C>

^C

srun: Job allocation 123458 has been revoked

srun: Force Terminated job

Group Limit Errors

The following Reasons can be seen in the 'NODELIST(REASON)' column of the job queue listing for a group when the group reaches the resource limit for the respective account/qos combination:

QOSGrpCpuLimit
means that all CPU cores available for the listed account within the respective QOS are in use.
QOSGrpMemLimit
means that all memory available for the listed account within the respective QOS as described in the previous section is in use.