QOS Limits

From UFRC
Jump to navigation Jump to search

Back to Account and QOS limits under SLURM

QOS Resource Limits

CPU cores, Memory (RAM), GPU accelerators, software licenses, etc. are referred to as Trackable Resources (TRES) by the scheduler. The TRES available in a given QOS are determined by the group's investments and the QOS configuration.

View a trackable resource limits for a QOS:

$ showQos <specified_qos>

Example: $ showQos borum output:

                Name                          Descr                                       GrpTRES  GrpCPUs 
-------------------- ------------------------------ --------------------------------------------- -------- 
borum                borum qos                      cpu=9,gres/gpu=0,mem=32400M                          9 

We can see that the borum investment QOS has a pool of 9 CPU cores, 32GB of RAM, and no GPUs. This pool of resources is shared among all members of the borum group.

Similarly, the borum-b> burst QOS resource limits shown by $ showQos borum-b are:

                Name                          Descr                                       GrpTRES  GrpCPUs 
-------------------- ------------------------------ --------------------------------------------- -------- 
borum-b              borum burst qos                cpu=81,gres/gpu=0,mem=291600M                       81 

There are additional base priority and run time limits associated with QOSes. To display them run

 $ sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" <specified_qos> 

Example: $ sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" borum borum-b output:

                Name                          Descr   Priority     MaxWall 
-------------------- ------------------------------ ---------- ----------- 
borum                borum qos                           36000 31-00:00:00 
borum-b              borum burst qos                       900  4-00:00:00 

The investment and burst QOS jobs are limited to 31 and 4 day run times, respectively. Additionally, the base priority of a burst QOS job is 1/40th that of an investment QOS job. It is important to remember that the base priority is only one component of the jobs overall priority and that the priority will change over time as the job waits in the queue.

The burst QOS cpu and memory limits are nine times (9x) those of the investment QOS up to a certain limit and are intended to allow groups to take advantage of unused resources short periods of time by borrowing resources from other groups.

QOS Time Limits

  • Jobs with longer time limits are more difficult to schedule.
  • Long time limits make system maintenance harder. We have to perform maintenance on the systems (OS updates, security patches, etc.). If the allowable time limits were longer, it could make important maintenance tasks virtually impossible. Of particular importance is the ability to install security updates on the systems quickly and efficiently. If we cannot install them because user jobs are running for months at a time, we have to choose to either kill the user jobs or risk security issues on the system, which could affect all users.
  • The longer a job runs, the more likely it is to end prematurely due to random hardware failure.

Thus, if the application allows saving and resuming the analysis it is recommended that instead of running jobs for extremely long times, you utilize checkpointing of your jobs so that you can restart them and run shorter jobs instead.

The 31 day investment QOS time limit on HiPerGator is generous compared to other major institutions. Here are examples we were able to find.

Institution Maximum Runtime
New York University 4 days
University of Southern California 2 weeks for 1 node, otherwise 1 day
TACC: Lonestar 1 day
Princeton: Della 6 days
University of Maryland 14 days

GPU Resource Limits

As per the Scheduler/Job Policy, there is no burst QOS for GPU jobs.