Account and QOS limits under SLURM

In the HiperGator 2.0 SLURM implementation, a job's priority and resource limits are determined by its QOS (Quality of Service). Every group with an HPG2 investment will have a corresponding SLURM account. Each SLURM account has two associated QOSs - the normal or high-priority investment QOS and a low-priority burst QOS. In turn, each user who is a member of a vested group (i.e. a group with an investment and a corresponding computational allocation) has a SLURM association. It is this association which determines which QOSs are available to a user. In addition to the primary group association, users who have secondary group membership (in the Unix/Linux sense) will also have secondary group associations, which will afford them access to the investment and burst QOSs of that group's account.

Account and QOS Use

Load the ufrc environment module before running the following commands with

$ module load ufrc

To see your SLURM associations, use the following command:

$ showAssoc <username>

For example, the command $ showAssoc magitz returns the following output:

              User    Account   Def Acct   Def QOS                                      QOS 
------------------ ---------- ---------- --------- ---------------------------------------- 
magitz                zoo6927      ufhpc     ufhpc zoo6927,zoo6927-b                        
magitz                  ufhpc      ufhpc     ufhpc ufhpc,ufhpc-b                            
magitz                 soltis      ufhpc    soltis soltis,soltis-b                          
magitz                  borum      ufhpc     borum borum,borum-b

The output shows that the user magitz has four SLURM associations and thus, access to 8 different QOSs. By convention, a user's default account is always the account of their primary group. Additionally, their default QOS is the investment (high-priority) QOS. If a user does not explicitly request a specific account and QOS, the user's default account and QOS will be assigned to the job.

However, if the user magitz wanted to use the borum group's account - which he has access by virtue of the borum account association - he would specify the account and the chosen QOS in his batch script as follows:

 #SBATCH  --account=borum
 #SBATCH  --qos=borum

Or, for the burst QOS:

 #SBATCH  --account=borum
 #SBATCH  --qos=borum-b

Note that both must be specified. Otherwise SLURM will assume the default ufhpc account is intended, and neither the borum nor borum-b QOSs will be available to the job. Consequently, SLURM would deny the submission.

These sbatch directives can also be given as command line arguments to srun. For example:

$ srun --account=borum --qos=borum-b <example_command>

QOS Limits

SLURM refers to resources - NCUs (cores), Memory (RAM), accelerators, software licenses, etc. - as Trackable Resources (TRES). The TRES available to a given group are determined by the group's investments and are limited by parameters assigned to the QOS.

To view a group's trackable resource limits for a specific QOS, use the following command from the ufrc environment module:

$ showQos <specified_qos>

Continuing the example above, the command $ showQos borum returns the following output:

                Name                          Descr                                       GrpTRES  GrpCPUs 
-------------------- ------------------------------ --------------------------------------------- -------- 
borum                borum qos                      cpu=41,mem=125952,gres/gpu=0,gres/mic=0             41

From the output, we can see that when submitting jobs under the borum group investment QOS, users have access to a total of 41 cores, 125 GB of RAM, and no access to accelerators (GPUs, MICs). These resources are shared among all members of the borum group running jobs under the investment QOS.

Similarly, to check the burst QOS resource limits, the command $ showQos borum-b returns the following output:

                Name                          Descr                                       GrpTRES  GrpCPUs 
-------------------- ------------------------------ --------------------------------------------- -------- 
borum-b              borum burst qos                cpu=369,mem=1133568,gres/gpu=0,gres/mic=0          369

There are additional limits and parameters associated with QOSs in addition to the TRES limits. Among them are the maximum wall time available under the QOS and the base priority assigned to the job. Use the following command to view these parameters:

 $ sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" <specified_qos>

To continue our example, the command $ sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" borum borum-b returns the following output:

                Name                          Descr   Priority     MaxWall 
-------------------- ------------------------------ ---------- ----------- 
borum                borum qos                           36000 31-00:00:00 
borum-b              borum burst qos                       900  4-00:00:00

We see that investment and burst QOS jobs are limited to a maximum duration of 31 and 4 days, respectively. Additionally, the base priority of a burst QOS job is 1/40th that of an investment QOS job. It is important to remember that the base priority is only one component of the jobs overall priority and that the priority will change over time as the job waits in the queue.

By policy, the burst QOS cpu and memory limits are always nine times (9x) those of the investment QOS and are intended to allow groups to take advantage of unused resources beyond those that they have purchased for short periods of time.

Basis for 31 Day QOS Limit

Long running jobs have a tendency to clog the queue. The longer the walltime of a job, the harder it is for other users to get their jobs onto the system overall.
It makes maintenance intervals very hard to maintain. We do have to perform maintenance on the systems (OS updates, etc.) on a periodic basis, and if the allowable walltimes were longer, it would make these maintenance intervals ungainly. Of particular importance is the ability to install security updates on the systems quickly and efficiently. If we cannot install them because user jobs are running for months at a time, we have to choose to either kill the user jobs or risk security issues on the system that would affect all users.
The longer your job runs, the more likely it is to end prematurely due to hardware failure.

Thus, it is recommended that instead of running your jobs for extremely long times, you utilize checkpointing of your jobs so that you can restart them on a shorter periodic basis.

As a note, if you were to look at the allowed runtimes for jobs at other institutions, you will find that the 31 day QOS that we have in place is extremely generous. Examples of other institutions:

Institution	Maximum Runtime
New York University	4 days
University of Southern California	2 weeks for 1 node, otherwise 1 day
PennState	2 weeks for up to 32 cores (contributors), 4 days for up to 256 cores otherwise
UMBC	5 days
TACC: Stampede	2 days
TACC: Lonestar	1 day
Princeton: Della	6 days
Princeton: Hecate	15 days
University of Maryland	14 days

Cores (CPUs) and Memory (RAM)

The resources of any computing device are limited. This is to say that the number of cores, the amount of memory, the memory bandwidth, the I/O bandwidth, etc. - all are limited. Once you have used up all of the available cores on a machine, it is fully consumed and unavailable to other users. This is true whether you use 1 byte of RAM or all of the RAM on the machine. Likewise, if your application uses all of the memory available to the machine, whether it uses 1 core or all the cores, the machine is consumed and unavailable to others. Because of this, we place limits on both the number of cores available to a group (based on the NCU investments) and the amount of memory available to a group (NCUs x 3GB).

Limiting the amount of usable memory is necessary in order to be fair to all investors. For example, consider a system where a PI has invested in 10 NCUs. His TRES cpu limit will be "10". With no memory limits he could submit ten jobs requesting 1 cpu and 120GB each. Because each machine only has about 120GB available for applications, each job would be started on a separate machine leaving no memory available for any other jobs to run on the machine. As a result the group has invested in 10 NCUs, but is consuming 320. Such a scenario is not tenable and would quickly result in a grossly unfair allocation of resources. Thus, we place limits on both CPUs and memory.

The total memory limit is calculated as 'QOS NCU * 3 GB'. For example, an investment QOS of 30 NCUs will have a group memory limit of 90gb while the burst qos for that group will be equal to '30 * 3gb * 9 = 810gb'. If the group memory limit is reached you will see a '(QOSGrpMemLimit)' status in the 'NODELIST(REASON)' column of the squeue output, similar to the following:

 $ squeue | grep MemLimit | head -n 1
            123456    bigmem test_job   jdoe PD       0:00      1 (QOSGrpMemLimit)

The above message can only be seen in the output of the squeue command and does not interfere with job submission. However, the job will not run and will remain in the pending state until the group falls below its memory limit.

If the submitted job request so much memory or so many cores that either or both fall outside the total resource limit of the specified QOS, SLURM will refuse the job submission altogether and return the following error message,

sbatch: error: Batch job submission failed: Job violates accounting/QOS policy 
               (job submit limit, user's size and/or time limits)

Choosing QOS for a Job

When choosing between the high-priority investment QOS and the 9x larger low-priority burst QOS, you should start by considering the overall resource requirements for the job; for smaller allocations the investment QOS may not be large enough for some jobs, whereas for other smaller jobs the wait time in the burst QOS could be too long. In addition, consider the current state of the account you are planning to use for your job.

To show the status of any SLURM account as well as the overall usage of HiPerGator resources, use the following command from the ufrc environment module:

$ slurmInfo <account>

As an example, consider the following output, returned for the command $ slurmInfo ufgi:

  Allocation summary for 'ufgi' account:

QOS     Time Limit      Allocations (cpus, mem(MB),GPU,MIC)
ufgi    31-00:00:00     cpu=100,mem=307200,gpu=0,mic=0
ufgi-b  4-00:00:00      cpu=900,mem=2764800,gpu=0,mic=0

  Current use:

    Main QOS ('ufgi'):
        81% or 81 out of 100 cores. 
        30% or 96GB out of 300GB memory limit.

    Burst QOS ('ufgi-b'):
        CPU Cores: None
        Memory: None

  Total HiPerGator usage:
        76% or 23728 out of 31080 cores

The output shows that the investment QOS for the ufgi account is actively used, and only 19 cores out of 100 and 204gb out of the maximum 300gb are currently available. On the other hand, the burst QOS is unused. Furthermore, the total HiPerGator use is at 76%, which means that there's still available capacity from which burst resources can be drawn. In this case a job submitted to the ufgi-b QOS should be able to start within a reasonable amount of time and will enjoy access to much larger amount of computational and memory resources. If the HiPerGator load was higher, or if the burst QOS was actively used, the investment QOS would be more appropriate for a smaller job.

Examples

A hypothetical group ($GROUP in the examples below) has an investment of 42 NCUs. That's the group's so-called soft limit for HiPerGator jobs in the investment qos for up to 744 hours at high priority. The hard limit accessible through the so-called burst qos is +9 times that giving a group potentially a total of 10x the invested resources i.e. 420 NCUs with burst qos providing 378 NCUs of that capacity for up to 96 hours at low priority.

Let's test:

[marvin@gator ~]$ srun --mem=126gb --pty bash -i

srun: job 123456 queued and waiting for resources

#Looks good, let's terminate the request with Ctrl+C>

^C
srun: Job allocation 123456 has been revoked
srun: Force Terminated job 123456

On the other hand, going even 1gb over that limit results in the already encountered job limit error

[marvin@gator ~]$ srun --mem=127gb --pty bash -i
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits

At this point the group can try using the 'burst' QOS with

#SBATCH --qos=$GROUP-b

Let's test:

[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --time=96:00:00 --qos=$GROUP-b --pty bash -i

srun: job  123457 queued and waiting for resources

#Looks good, let's terminate with Ctrl+C

^C
srun: Job allocation 123457 has been revoked
srun: Force Terminated job 123457

However, now there's the burst qos time limit to consider.

[marvin@gator ~]$ srun --mem=400gb --time=300:00:00 --pty bash -i

srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits

Let's reduce the time limit to what burst qos supports and try again.

[marvin@gator ~]$ srun --mem=400gb --time=96:00:00 --pty bash -i

srun: job  123458 queued and waiting for resources

#Looks good, let's terminate with Ctrl+C

^C
srun: Job allocation 123458 has been revoked
srun: Force Terminated job

Pending Job Reasons

The following Reasons can be seen in the NODELIST(REASON) column of the squeue command when the group reaches the resource limit for the respective account/qos combination:

QOSGrpCpuLimit

All CPU cores available for the listed account within the respective QOS are in use.

QOSGrpMemLimit

All memory available for the listed account within the respective QOS as described in the previous section is in use.