QOS Limits
Back to Account and QOS limits under SLURM
QOS Resource Limits
CPU cores, Memory (RAM), GPU accelerators, software licenses, etc. are referred to as Trackable Resources (TRES) by the scheduler. The TRES available in a given QOS are determined by the group's investments and the QOS configuration.
View a trackable resource limits for a QOS:
$ showQos <specified_qos>
Example: $ showQos borum
output:
Name Descr GrpTRES GrpCPUs -------------------- ------------------------------ --------------------------------------------- -------- borum borum qos cpu=9,gres/gpu=0,mem=32400M 9
We can see that the borum
investment QOS has a pool of 9 CPU cores, 32GB of RAM, and no GPUs. This pool of resources is shared among all members of the borum
group.
Similarly, the borum-b>
burst QOS resource limits shown by $ showQos borum-b
are:
Name Descr GrpTRES GrpCPUs -------------------- ------------------------------ --------------------------------------------- -------- borum-b borum burst qos cpu=81,gres/gpu=0,mem=291600M 81
There are additional base priority and run time limits associated with QOSes. To display them run
$ sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" <specified_qos>
Example: $ sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" borum borum-b
output:
Name Descr Priority MaxWall -------------------- ------------------------------ ---------- ----------- borum borum qos 36000 31-00:00:00 borum-b borum burst qos 900 4-00:00:00
The investment and burst QOS jobs are limited to 31 and 4 day run times, respectively. Additionally, the base priority of a burst QOS job is 1/40th that of an investment QOS job. It is important to remember that the base priority is only one component of the jobs overall priority and that the priority will change over time as the job waits in the queue.
The burst QOS cpu and memory limits are nine times (9x) those of the investment QOS up to a certain limit and are intended to allow groups to take advantage of unused resources short periods of time by borrowing resources from other groups.
QOS Time Limits
- Jobs with longer time limits are more difficult to schedule.
- Long time limits make system maintenance harder. We have to perform maintenance on the systems (OS updates, security patches, etc.). If the allowable time limits were longer, it could make important maintenance tasks virtually impossible. Of particular importance is the ability to install security updates on the systems quickly and efficiently. If we cannot install them because user jobs are running for months at a time, we have to choose to either kill the user jobs or risk security issues on the system, which could affect all users.
- The longer a job runs, the more likely it is to end prematurely due to random hardware failure.
Thus, if the application allows saving and resuming the analysis it is recommended that instead of running jobs for extremely long times, you utilize checkpointing of your jobs so that you can restart them and run shorter jobs instead.
The 31 day investment QOS time limit on HiPerGator is generous compared to other major institutions. Here are examples we were able to find.
Institution | Maximum Runtime |
---|---|
New York University | 4 days |
University of Southern California | 2 weeks for 1 node, otherwise 1 day |
PennState | 2 weeks for up to 32 cores (contributors), 4 days for up to 256 cores otherwise |
UMBC | 5 days |
TACC: Lonestar | 1 day |
Princeton: Della | 6 days |
University of Maryland | 14 days |
GPU Resource Limits
As per the Scheduler/Job Policy, there is no burst QOS for GPU jobs.