Difference between revisions of "Account and QOS limits under SLURM"

From UFRC
Jump to navigation Jump to search
 
(22 intermediate revisions by 7 users not shown)
Line 1: Line 1:
[[Category:SLURM]]
+
[[Category:Scheduler]]
In the HiperGator 2.0 SLURM implementation, a job's priority and resource limits are determined by its QOS (Quality of Service).  Every group with an HPG2 investment will have a corresponding SLURM '''account'''. Each SLURM account has two associated QOSs - the normal or high-priority '''investment QOS''' and a low-priority '''burst QOS'''. In turn, each user who is a member of a '''vested group''' (i.e. a group with an investment and a corresponding computational allocation) has a SLURM association. It is this association which determines which QOSs are available to a user. In addition to the primary group association, users who have secondary group membership (in the Unix/Linux sense) will also have secondary group associations, which will afford them access to the investment and burst QOSs of that group's account.
+
{|align=right
==Account and QOS Use==
+
  |__TOC__
;Load the ufrc environment module before running the following commands with:
+
  |}
$ module load ufrc
+
Every group on HiPerGator (HPG) must have an '''investment''' with a corresponding hardware allocation to be able to do any work on HPG. Each allocation is associated with a scheduler '''account'''. Each account has two quality of service (QOS) levels - high-priority '''investment QOS''' and a low-priority '''burst QOS'''. The latter allows short-term borrowing of unused resources from other groups' accounts. In turn, each user in a group has a scheduler account association. In the end, it is this association which determines which QOSes are available to a particular user. Users with secondary Linux group membership will have associations with QOSes from their secondary groups.
To see your SLURM associations, use the following command:
 
  
 +
In summary, each HPG user has scheduler associations with group account based QOSes that determine what resources are available to the users's jobs. These QOSes can be thought of as pools of  computational (CPU cores), memory (RAM), maximum run time (time limit) resources with associated starting priority levels that can be consumed by jobs to run applications according to QOS levels, which we will review below.
 +
 +
==Account and QOS==
 +
 +
===Using the resources from a secondary group===
 +
By default, when you submit a job on HiPerGator, it will use the resources from your primary group. You can easily see your primary and secondary groups with the <code>id</code> command:
 +
 +
[agator@login4 ~]$ id
 +
uid=12345(agator) gid=12345(gator-group) groups=12345(gator-group),12346(second-group),12347(third-group)
 +
[agator@login4 ~]$
 +
 +
As shown above, our fictional user <code>agator</code>'s primary group is <code>gator-group</code> and they also have secondary groups of <code>second-group</code> and <code>third-group</code>.
 +
 +
To use the resources of one of their secondary groups rather than their primary group, <code>agator</code> can use the <code>--account</code> and <code>--qos</code> flags, in the submit script, in the <code>sbatch</code> command or in the boxes in the Open on Demand interface.
 +
For example, to use the <code>orange-group</code> they could:
 +
[[File:OOD.account qos.screenshot.png|thumb|right|How to set account and qos options to use resources from a secondary group in Open on Demand]]
 +
# In a submit script, add these lines:<code><br>#SBATCH --account=second-group<br>#SBATCH --qos=second-group</code>
 +
# In the <code>sbatch</code> command: <code>sbatch --account=second-group --qos=second-group my_script.sh</code>
 +
# Using Open on Demand:
 +
# '''Note:''' Jupyterhub can only use your primary group's resources and cannot be used for accessing secondary group resources.
 +
# '''Note:''' To use a secondary group's Burst QOS the --account= parameter is still 'second-group', while the --qos= parameter is 'second-group-b'.  The QOS is different, but the account remains the same.  This may make more sense by viewing the output of the showAssoc command in the ''See your associations'' section immediately below.
 +
 +
===See your associations===
 +
 +
On the command line, you can view your SLURM associations with the <code>showAssoc</code> command:
 
  $ showAssoc <username>
 
  $ showAssoc <username>
  
For example, the command <code>$ showAssoc magitz</code> returns the following output:  
+
'''Example:''' <code>$ showAssoc magitz</code> output:  
 
<pre style="color:black; background:WhiteSmoke; border:1px solid gray;">
 
<pre style="color:black; background:WhiteSmoke; border:1px solid gray;">
 
               User    Account  Def Acct  Def QOS                                      QOS  
 
               User    Account  Def Acct  Def QOS                                      QOS  
Line 17: Line 41:
 
magitz                  borum      ufhpc    borum borum,borum-b   
 
magitz                  borum      ufhpc    borum borum,borum-b   
 
</pre>
 
</pre>
The output shows that the user <code style="color:black; background:WhiteSmoke; border:1px solid gray; width:70%;">magitz</code> has four SLURM associations and thus, access to 8 different QOSs.  By convention, a user's default account is always the account of their primary group.  Additionally, their default QOS is the investment (high-priority) QOS.  If a user does not explicitly request a specific account and QOS, the user's default account and QOS will be assigned to the job.
+
The output shows that the user <code>magitz</code> has four account associations and 8 different QOSes.  By convention, a user's default account is always the account of their primary group.  Additionally, their default QOS is the investment (high-priority) QOS.  If a user does not explicitly request a specific account and QOS, the user's default account and QOS will be assigned to the job.
  
 
+
If the user <code>magitz</code> wanted to use the <code>borum</code> group's account - which he has access by virtue of the <code>borum</code> account association - he would specify the account and the chosen QOS in his batch script as follows:
However, if the user <code style="color:black; background:WhiteSmoke; border:1px solid gray;">magitz</code> wanted to use the <code style="color:black; background:WhiteSmoke; border:1px solid gray;">borum</code> group's account - which he has access by virtue of the <code style="color:black; background:WhiteSmoke; border:1px solid gray;">borum</code> account association - he would specify the account and the chosen QOS in his batch script as follows:
 
 
<pre>
 
<pre>
 
  #SBATCH  --account=borum
 
  #SBATCH  --account=borum
Line 30: Line 53:
 
  #SBATCH  --qos=borum-b
 
  #SBATCH  --qos=borum-b
 
</pre>
 
</pre>
Note that both must be specified.  Otherwise SLURM will assume the default <code style="color:black; background:WhiteSmoke; border:1px solid gray;">ufhpc</code> account is intended, and neither the <code style="color:black; background:WhiteSmoke; border:1px solid gray;">borum</code> nor <code style="color:black; background:WhiteSmoke; border:1px solid gray;">borum-b</code> QOSs will be available to the job. Consequently, SLURM would deny the submission.
+
Note that both <code>--account</code> and <code>--qos</code> must be specified.  Otherwise scheduler will assume the default <code>ufhpc</code> account is intended, and neither the <code>borum</code> nor <code>borum-b</code> QOSes will be available to the job. Consequently, scheduler would deny the submission. In addition, you cannot mix and match resources from different allocations.  
  
 
These sbatch directives can also be given as command line arguments to <code>srun</code>. For example:  
 
These sbatch directives can also be given as command line arguments to <code>srun</code>. For example:  
 
  $ srun --account=borum --qos=borum-b <example_command>
 
  $ srun --account=borum --qos=borum-b <example_command>
  
===QOS Limits===
+
===QOS Resource Limits===
  
SLURM refers to resources - NCUs (cores), Memory (RAM), accelerators, software licenses, etc. - as '''Trackable Resources''' (TRES). The TRES available to a given group are determined by the group's investments and are limited by parameters assigned to the QOS.
+
CPU cores, Memory (RAM), GPU accelerators, software licenses, etc. are referred to as '''Trackable Resources''' (TRES) by the scheduler. The TRES available in a given QOS are determined by the group's investments and the QOS configuration.
 
 
To view a group's trackable resource limits for a specific QOS, use the following command from the ''ufrc'' environment module:
 
  
 +
View a trackable resource limits for a QOS:
 
  $ showQos <specified_qos>
 
  $ showQos <specified_qos>
  
Continuing the example above, the command <code>$ showQos borum</code> returns the following output:
+
'''Example:''' <code>$ showQos borum</code> output:
 
<pre style="color:black; background:WhiteSmoke; border:1px solid gray;">
 
<pre style="color:black; background:WhiteSmoke; border:1px solid gray;">
 
                 Name                          Descr                                      GrpTRES  GrpCPUs  
 
                 Name                          Descr                                      GrpTRES  GrpCPUs  
 
-------------------- ------------------------------ --------------------------------------------- --------  
 
-------------------- ------------------------------ --------------------------------------------- --------  
borum                borum qos                      cpu=41,mem=125952,gres/gpu=0,gres/mic=0            41
+
borum                borum qos                      cpu=9,gres/gpu=0,mem=32400M                          9
 
</pre>
 
</pre>
  
From the output, we can see that when submitting jobs under the <code style="color:black; background:WhiteSmoke; border:1px solid gray;">borum</code> group investment QOS, users have access to a total of 41 cores, 125 GB of RAM, and no access to accelerators (GPUs, MICs). These resources are shared among all members of the <code style="color:black; background:WhiteSmoke; border:1px solid gray;">borum</code> group running jobs under the investment QOS.   
+
We can see that the <code>borum</code> investment QOS has a pool of 9 CPU cores, 32GB of RAM, and no GPUs. This pool of resources is shared among all members of the <code>borum</code> group.   
  
  
Similarly, to check the burst QOS resource limits, the command <code lang=bash>$ showQos borum-b</code> returns the following output:
+
Similarly, the <code>borum-b></code> burst QOS resource limits shown by <code lang=bash>$ showQos borum-b</code> are:
  
 
<pre style="color:black; background:WhiteSmoke; border:1px solid gray;">
 
<pre style="color:black; background:WhiteSmoke; border:1px solid gray;">
 
                 Name                          Descr                                      GrpTRES  GrpCPUs  
 
                 Name                          Descr                                      GrpTRES  GrpCPUs  
 
-------------------- ------------------------------ --------------------------------------------- --------  
 
-------------------- ------------------------------ --------------------------------------------- --------  
borum-b              borum burst qos                cpu=369,mem=1133568,gres/gpu=0,gres/mic=0          369
+
borum-b              borum burst qos                cpu=81,gres/gpu=0,mem=291600M                      81
 
</pre>
 
</pre>
  
  
There are additional limits and parameters associated with QOSs in addition to the TRES limits. Among them are the maximum wall time available under the QOS and the base priority assigned to the job.  Use the following command to view these parameters:
+
There are additional base priority and run time limits associated with QOSes. To display them run
 
<pre>
 
<pre>
 
  $ sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" <specified_qos>  
 
  $ sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" <specified_qos>  
 
</pre>
 
</pre>
To continue our example, the command <code>$ sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" borum borum-b</code> returns the following output:
+
'''Example:''' <code>$ sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" borum borum-b</code> output:
  
 
<pre style="color:black; background:WhiteSmoke; border:1px solid gray;">
 
<pre style="color:black; background:WhiteSmoke; border:1px solid gray;">
Line 75: Line 97:
 
</pre>
 
</pre>
  
We see that investment and burst QOS jobs are limited to a maximum duration of 31 and 4 days, respectively. Additionally, the base priority of a burst QOS job is 1/40th that of an investment QOS job. It is important to remember that the base priority is only one component of the jobs overall priority and that the priority will change over time as the job waits in the queue.
+
The investment and burst QOS jobs are limited to 31 and 4 day run times, respectively. Additionally, the base priority of a burst QOS job is 1/40th that of an investment QOS job. It is important to remember that the base priority is only one component of the jobs overall priority and that the priority will change over time as the job waits in the queue.
 +
 
 +
The burst QOS cpu and memory limits are nine times (9x) those of the investment QOS up to a certain limit and are intended to allow groups to take advantage of unused resources short periods of time by borrowing resources from other groups.
  
By policy, the burst QOS cpu and memory limits are always nine times (9x) those of the investment QOS and are intended to allow groups to take advantage of unused resources beyond those that they have purchased for short periods of time.
+
=== QOS Time Limits ===
+
* Jobs with longer time limits are more difficult to schedule.
=== Basis for 31 Day QOS Limit ===
+
* Long time limits make system maintenance harder. We have to perform maintenance on the systems (OS updates, security patches, etc.). If the allowable time limits were longer, it could make important maintenance tasks virtually impossible. Of particular importance is the ability to install security updates on the systems quickly and efficiently. If we cannot install them because user jobs are running for months at a time, we have to choose to either kill the user jobs or risk security issues on the system, which could affect all users.
* Long running jobs have a tendency to clog the queue. The longer the walltime of a job, the harder it is for other users to get their jobs onto the system overall.
+
* The longer a job runs, the more likely it is to end prematurely due to random hardware failure.
* It makes maintenance intervals very hard to maintain. We do have to perform maintenance on the systems (OS updates, etc.) on a periodic basis, and if the allowable walltimes were longer, it would make these maintenance intervals ungainly. Of particular importance is the ability to install security updates on the systems quickly and efficiently. If we cannot install them because user jobs are running for months at a time, we have to choose to either kill the user jobs or risk security issues on the system that would affect all users.  
 
* The longer your job runs, the more likely it is to end prematurely due to hardware failure.  
 
  
Thus, it is recommended that instead of running your jobs for extremely long times, you utilize checkpointing of your jobs so that you can restart them on a shorter periodic basis.
+
Thus, if the application allows saving and resuming the analysis it is recommended that instead of running jobs for extremely long times, you utilize checkpointing of your jobs so that you can restart them and run shorter jobs instead.
  
As a note, if you were to look at the allowed runtimes for jobs at other institutions, you will find that the 31 day QOS that we have in place is extremely generous. Examples of other institutions:
+
The 31 day investment QOS time limit on HiPerGator is generous compared to other major institutions. Here are examples we were able to find.
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
Line 98: Line 120:
 
|-
 
|-
 
|UMBC|| 5 days
 
|UMBC|| 5 days
|-
 
|TACC: Stampede||2 days
 
 
|-
 
|-
 
|TACC: Lonestar|| 1 day
 
|TACC: Lonestar|| 1 day
 
|-
 
|-
 
|Princeton: Della|| 6 days
 
|Princeton: Della|| 6 days
|-
 
|Princeton: Hecate|| 15 days
 
 
|-
 
|-
 
|University of Maryland || 14 days
 
|University of Maryland || 14 days
 
|}
 
|}
  
=== Cores (CPUs) and Memory (RAM)===
+
=== CPU cores and Memory (RAM) Resource Use===
The resources of any computing device are limited. This is to say that the number of cores, the amount of memory, the memory bandwidth, the I/O bandwidth, etc. - all are limited. Once you have used up all of the available cores on a machine, it is fully consumed and unavailable to other users. This is true whether you use 1 byte of RAM or all of the RAM on the machine. Likewise, if your application uses all of the memory available to the machine, whether it uses 1 core or all the cores, the machine is consumed and unavailable to others. Because of this, we place limits on both the number of cores available to a group (based on the NCU investments) '''and''' the amount of memory available to a group (NCUs x 3.5GB).
+
CPU cores and RAM are allocated to jobs independently as requested in job scripts. Considerations for selecting how many CPU cores and how much memory to request for a job must take into account the QOS limits based on the group investment, the limitations of the hardware (compute nodes), and the desire to be a good neighbor on a shared resource like HiPerGator to ensure that system resources are allocated efficiently, used fairly, and everyone has a chance to get their work done without causing negative impacts on work performed by other researchers.
  
Limiting the amount of usable memory is necessary in order to be fair to all investors. For example, consider a system where a PI has invested in 10 NCUs. His TRES cpu limit will be "10". With no memory limits he could submit ten jobs requesting 1 cpu and 120GB each. Because each machine only has about 120GB available for applications, each job would be started on a separate machine leaving no memory available for any other jobs to run on the machine. As a result the group has invested in 10 NCUs, but is consuming 320. Such a scenario is not tenable and would quickly result in a grossly unfair allocation of resources. Thus, we place limits on both CPUs and memory.
+
HiPerGator consists of many interconnected servers (compute nodes). The hardware resources of each compute node, including CPU cores, memory, memory bandwidth, network bandwidth, and [[Temporary_Directories|local storage]] are limited. If any single one of the above resources is fully consumed the remaining unused resources can become effectively wasted, which makes it progressively harder or even impossible to achieve the shared goals of Research Computing and the UF Researcher Community stated above. See the [[Available Node Features]] for details on the hardware on compute nodes. Nodes with similar hardware are generally separated into partitions. If the job requires larger nodes or particular hardware make sure to explicitly specify a partition.
 +
'''Example:'''
 +
--partition=bigmem
  
The total memory limit is calculated as 'QOS NCU * 3 GB'. For example, an investment QOS of 30 NCUs will have a group memory limit of 90gb while the burst qos for that group will be equal to '30 * 3gb * 9 = 810gb'. If the group memory limit is reached you will see a '(QOSGrpMemLimit)' status in the 'NODELIST(REASON)' column of the <code>squeue</code> output, similar to the following:
+
When a job is submitted, if no resource request is provided, the default limits of 1 CPU core, 600MB of memory, and a 10 minute time limit will be set on the job by the scheduler. Check the resource request if it's not clear why the job ended before the analysis was done. Premature exit can be due to the job exceeding the time limit or the application using more memory than the request.
  
<source lang=bash>
+
Run testing jobs to find out what resource a particular analysis needs. To make sure that the analysis is performed successfully without wasting valuable resources you must specify both the number of CPU cores and the amount of memory needed for the analysis in the job script. See [[Sample SLURM Scripts]] for examples of specifying CPU core requests depending on the nature of the application running in a job. Use <code>--mem</code> (total job memory on a node) or <code>--mem-per-cpu</code> (per-core memory) options to request memory. Use <code>--time</code> to set a time limit to an appropriate value within the QOS limit.
$ squeue | grep MemLimit | head -n 1
 
            123456    bigmem test_job  jdoe PD      0:00      1 (QOSGrpMemLimit)
 
</source>
 
  
The above message can only be seen in the output of the <code>squeue</code> command and does not interfere with job submission. However, the job will not run and will remain in the pending state until the group falls below its memory limit.  
+
As jobs are submitted and the resources under a particular account are consumed the group may reach either the CPU or Memory group limit. The group has consumed all cores in a QOS if the scheduler shows <code>QOSGrpCpuLimit</code> or memory if the scheduler shows <code>QOSGrpMemLimit</code> in the reason a job is pending ('NODELIST(REASON)' column of the <code>squeue</code> command output).
 +
 
 +
'''Example:'''
 +
<pre>
 +
            JOBID PARTITION    NAME    USER ST      TIME  NODES NODELIST(REASON)
 +
            123456    bigmem test_job    jdoe PD      0:00      1 (QOSGrpMemLimit)
 +
</pre>
  
If the submitted job request so much memory or so many cores that either or both fall outside the total resource limit of the specified QOS, SLURM will refuse the job submission altogether and return the following error message,
+
Reaching a resource limit of a QOS does not interfere with job submission. However, the jobs with this reason will not run and will remain in the pending state until the QOS use falls below the limit.
  
<source lang=bash>
+
If the resource request for submitted job is impossible to satisfy within either the QOS limits or HiPerGator compute node hardware for a particular partition the scheduler will refuse the job submission altogether and return the following error message,
 +
<pre>
 
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy  
 
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy  
 
               (job submit limit, user's size and/or time limits)
 
               (job submit limit, user's size and/or time limits)
</source>
+
</pre>
 +
 
 +
=== GPU Resource Limits ===
 +
 
 +
As per the [https://www.rc.ufl.edu/documentation/policies/scheduler/ Scheduler/Job Policy], there is no burst QOS for GPU jobs.
  
 
==Choosing QOS for a Job==
 
==Choosing QOS for a Job==
When choosing between the high-priority investment QOS and the 9x larger low-priority burst QOS, you should start by considering the overall resource requirements for the job; for smaller allocations the investment QOS may not be large enough for some jobs, whereas for other smaller jobs the wait time in the burst QOS could be too long. In addition, consider the current state of the account you are planning to use for your job.  
+
When choosing between the high-priority investment QOS and the 9x larger low-priority burst QOS, you should start by considering the overall resource requirements for the job. For smaller allocations the investment QOS may not be large enough for some jobs, whereas for other smaller jobs the wait time in the burst QOS could be too long. In addition, consider the current state of the account you are planning to use for your job.
  
To show the status of any SLURM account as well as the overall usage of HiPerGator resources, use the following command from the ''ufrc'' environment module:
+
{{Note|For any individual jobs submitted to the '''Burst QOS''' we do not guarantee that they will ever start, although historical data shows that burst jobs do start and provide significant additional throughput to groups that use them correctly as 'long queues' i.e.
$ slurmInfo <account>
 
  
As an example, consider the following output, returned for the command <code>$ slurmInfo ufgi</code>:
+
* Submit only non-time-critical jobs to the Burst QOS.
<pre style="color:black; background:WhiteSmoke; border:1px solid gray;">
+
* Parallelize analyses to make sure they can run within the 4-day window.
  Allocation summary for 'ufgi' account:
+
* Let the scheduler take its time to find unused resources to run burst jobs.
  
QOS     Time Limit      Allocations (cpus, mem(MB),GPU,MIC)
+
In summary, the Burst QOS is best handled in a "hands-off" fashion. If any of your analyses are time-critical then you should be submitting them to the appropriately sized investment qos.|reminder}}
ufgi    31-00:00:00    cpu=100,mem=307200,gpu=0,mic=0
 
ufgi-b  4-00:00:00      cpu=900,mem=2764800,gpu=0,mic=0
 
  
  Current use:
+
To show the status of any SLURM account as well as the overall usage of HiPerGator resources, use the following command from the [[UFRC_environment_module|UFRC module]]:
 +
$ slurmInfo
 +
for the primary account or
 +
$ slurmInfo <account>
 +
for another account
  
     Main QOS ('ufgi'):
+
'''Example:''' <code>$ slurmInfo ufgi</code>:
        81% or 81 out of 100 cores.
+
<pre style="color:black; background:WhiteSmoke; border:1px solid gray;">
         30% or 96GB out of 300GB memory limit.
+
----------------------------------------------------------------------
 +
Allocation summary:    Time Limit            Hardware Resources
 +
  Investment QOS          Hours          CPU     MEM(GB)    GPU
 +
----------------------------------------------------------------------
 +
            ufgi            744          150        527      0
 +
----------------------------------------------------------------------
 +
CPU/MEM Usage:                Running        Pending        Total
 +
                      CPU  MEM(GB)    CPU  MEM(GB)    CPU  MEM(GB)
 +
----------------------------------------------------------------------
 +
    Investment (ufgi):   100      280    0        0  100      280
 +
----------------------------------------------------------------------
 +
HiPerGator Utilization
 +
              CPU: Used (%) / Total        MEM(GB): Used (%) / Total
 +
----------------------------------------------------------------------
 +
         Total :  43643 (92%) / 47300    113295500 (57%) /  196328830
 +
----------------------------------------------------------------------
 +
* Burst QOS uses idle cores at low priority with a 4-day time limit
  
    Burst QOS ('ufgi-b'):
+
Run 'slurmInfo -h' to see all available options
        CPU Cores: None
 
        Memory: None
 
 
 
  Total HiPerGator usage:
 
        76% or 23728 out of 31080 cores
 
 
</pre>
 
</pre>
  
The output shows that the investment QOS for the <code style="color:black; background:WhiteSmoke; border:1px solid gray;">ufgi</code> account is actively used, and only 19 cores out of 100 and 204gb out of the maximum 300gb are currently available. On the other hand, the burst QOS is unused. Furthermore, the total HiPerGator use is at 76%, which means that there's still available capacity from which burst resources can be drawn. In this case a job submitted to the <code style="color:black; background:WhiteSmoke; border:1px solid gray;">ufgi-b</code> QOS should be able to start within a reasonable amount of time and will enjoy access to much larger amount of computational and memory resources. If the HiPerGator load was higher, or if the burst QOS was actively used, the investment QOS would be more appropriate for a smaller job.
+
The output shows that the investment QOS for the <code style="color:black; background:WhiteSmoke; border:1px solid gray;">ufgi</code> account is actively used. Since 100 CPU cores out of 150 available are used only 50 cores are available. In the same vein since 280GB out of 527GB in the investment QOS are used 247GB are still available. The <code>ufgi-b</code> burst QOS is unused. Te total HiPerGator use is 92% of all CPU cores and 57% of all memory on compute nodes, which means that there is little available capacity from which burst resources can be drawn. In this case a job submitted to the <code style="color:black; background:WhiteSmoke; border:1px solid gray;">ufgi-b</code> QOS would likely take a long time to start. If the overall utilization was below 80% it would be easier to start a burst job within a reasonable amount of time. When the HiPerGator load is high, or if the burst QOS is actively used, the investment QOS is more appropriate for a smaller job.
 
 
  
 
==Examples==
 
==Examples==
  
A hypothetical group ($GROUP in the examples below) has an investment of 42 NCUs. That's the group's so-called ''soft limit'' for HiPerGator jobs in the investment qos for up to 744 hours at high priority. The hard limit accessible through the so-called ''burst qos'' is +9 times that giving a group potentially a total of 10x the invested resources i.e. 420 NCUs with burst qos providing 378 NCUs of that capacity for up to 96 hours at low priority.
+
A hypothetical group ($GROUP in the examples below) has an investment of 42 CPU cores and 148GB of memory. That's the group's so-called ''soft limit'' for HiPerGator jobs in the investment qos for up to 744 hour time limit at high priority. The hard limit accessible through the so-called ''burst qos'' is 9 times that giving a group potentially a total of 10x the invested resources i.e. 420 total CPU cores and 1480GB of total memory with burst qos providing 378 CPU cores and 1330GB of total memory for up to 96 hours at low base priority.
  
 
Let's test:
 
Let's test:
<source lang=bash>
+
<pre>
[marvin@gator ~]$ srun --mem=126gb --pty bash -i
+
[marvin@gator ~]$ srun --mem=126gb --pty bash -i
  
 
srun: job 123456 queued and waiting for resources
 
srun: job 123456 queued and waiting for resources
Line 177: Line 216:
 
srun: Job allocation 123456 has been revoked
 
srun: Job allocation 123456 has been revoked
 
srun: Force Terminated job 123456
 
srun: Force Terminated job 123456
</source>
+
</pre>
  
 
On the other hand, going even 1gb over that limit results in the already encountered job limit error
 
On the other hand, going even 1gb over that limit results in the already encountered job limit error
  
<source lang=bash>
+
<pre>
 
[marvin@gator ~]$ srun --mem=127gb --pty bash -i
 
[marvin@gator ~]$ srun --mem=127gb --pty bash -i
 
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits
 
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits
</source>
+
</pre>
  
 
At this point the group can try using the 'burst' QOS with
 
At this point the group can try using the 'burst' QOS with
 
+
#SBATCH --qos=$GROUP-b
<source lang=bash>
 
#SBATCH --qos=$GROUP-b
 
</source>
 
  
 
Let's test:
 
Let's test:
 
+
<pre>
<source lang=bash>
 
 
[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --time=96:00:00 --qos=$GROUP-b --pty bash -i
 
[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --time=96:00:00 --qos=$GROUP-b --pty bash -i
  
Line 204: Line 239:
 
srun: Job allocation 123457 has been revoked
 
srun: Job allocation 123457 has been revoked
 
srun: Force Terminated job 123457
 
srun: Force Terminated job 123457
</source>
+
</pre>
  
 
However, now there's the burst qos time limit to consider.
 
However, now there's the burst qos time limit to consider.
  
<source lang=bash>
+
<pre>
 
[marvin@gator ~]$ srun --mem=400gb --time=300:00:00 --pty bash -i
 
[marvin@gator ~]$ srun --mem=400gb --time=300:00:00 --pty bash -i
  
 
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits
 
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits
</source>
+
</pre>
  
 
Let's reduce the time limit to what burst qos supports and try again.
 
Let's reduce the time limit to what burst qos supports and try again.
  
<source lang=bash>
+
<pre>
 
[marvin@gator ~]$ srun --mem=400gb --time=96:00:00 --pty bash -i
 
[marvin@gator ~]$ srun --mem=400gb --time=96:00:00 --pty bash -i
  
Line 226: Line 261:
 
srun: Job allocation 123458 has been revoked
 
srun: Job allocation 123458 has been revoked
 
srun: Force Terminated job
 
srun: Force Terminated job
</source>
+
</pre>
 
 
  
 
==Pending Job Reasons==
 
==Pending Job Reasons==
The following ''Reasons'' can be seen in the <code>NODELIST(REASON)</code> column of the <code>squeue</code> command when the group reaches the resource limit for the respective account/qos combination:
+
To reiterate, the following ''Reasons'' can be seen in the <code>NODELIST(REASON)</code> column of the <code>squeue</code> command when the group reaches the resource limit for a QOS:
  
 
;QOSGrpCpuLimit
 
;QOSGrpCpuLimit
Line 237: Line 271:
 
;QOSGrpMemLimit
 
;QOSGrpMemLimit
 
All memory available for the listed account within the respective QOS as described in the previous section is in use.
 
All memory available for the listed account within the respective QOS as described in the previous section is in use.
 +
 +
{{Note|Once it has marked any jobs in the group's list of pending jobs with a reason of <code>QOSGrpCpuLimit</code> or <code>QOSGrpMemLimit</code>, SLURM may not evaluate other jobs and they may simply be listed with the <code>Priority</code> reason code. See [https://help.rc.ufl.edu/doc/Why_is_my_job_not_running Why is my job not running] for a list of reasons.|reminder}}

Latest revision as of 14:00, 29 November 2022

Every group on HiPerGator (HPG) must have an investment with a corresponding hardware allocation to be able to do any work on HPG. Each allocation is associated with a scheduler account. Each account has two quality of service (QOS) levels - high-priority investment QOS and a low-priority burst QOS. The latter allows short-term borrowing of unused resources from other groups' accounts. In turn, each user in a group has a scheduler account association. In the end, it is this association which determines which QOSes are available to a particular user. Users with secondary Linux group membership will have associations with QOSes from their secondary groups.

In summary, each HPG user has scheduler associations with group account based QOSes that determine what resources are available to the users's jobs. These QOSes can be thought of as pools of computational (CPU cores), memory (RAM), maximum run time (time limit) resources with associated starting priority levels that can be consumed by jobs to run applications according to QOS levels, which we will review below.

Account and QOS

Using the resources from a secondary group

By default, when you submit a job on HiPerGator, it will use the resources from your primary group. You can easily see your primary and secondary groups with the id command:

[agator@login4 ~]$ id
uid=12345(agator) gid=12345(gator-group) groups=12345(gator-group),12346(second-group),12347(third-group)
[agator@login4 ~]$ 

As shown above, our fictional user agator's primary group is gator-group and they also have secondary groups of second-group and third-group.

To use the resources of one of their secondary groups rather than their primary group, agator can use the --account and --qos flags, in the submit script, in the sbatch command or in the boxes in the Open on Demand interface. For example, to use the orange-group they could:

How to set account and qos options to use resources from a secondary group in Open on Demand
  1. In a submit script, add these lines:
    #SBATCH --account=second-group
    #SBATCH --qos=second-group
  2. In the sbatch command: sbatch --account=second-group --qos=second-group my_script.sh
  3. Using Open on Demand:
  4. Note: Jupyterhub can only use your primary group's resources and cannot be used for accessing secondary group resources.
  5. Note: To use a secondary group's Burst QOS the --account= parameter is still 'second-group', while the --qos= parameter is 'second-group-b'. The QOS is different, but the account remains the same. This may make more sense by viewing the output of the showAssoc command in the See your associations section immediately below.

See your associations

On the command line, you can view your SLURM associations with the showAssoc command:

$ showAssoc <username>

Example: $ showAssoc magitz output:

              User    Account   Def Acct   Def QOS                                      QOS 
------------------ ---------- ---------- --------- ---------------------------------------- 
magitz                zoo6927      ufhpc     ufhpc zoo6927,zoo6927-b                        
magitz                  ufhpc      ufhpc     ufhpc ufhpc,ufhpc-b                            
magitz                 soltis      ufhpc    soltis soltis,soltis-b                          
magitz                  borum      ufhpc     borum borum,borum-b   

The output shows that the user magitz has four account associations and 8 different QOSes. By convention, a user's default account is always the account of their primary group. Additionally, their default QOS is the investment (high-priority) QOS. If a user does not explicitly request a specific account and QOS, the user's default account and QOS will be assigned to the job.

If the user magitz wanted to use the borum group's account - which he has access by virtue of the borum account association - he would specify the account and the chosen QOS in his batch script as follows:

 #SBATCH  --account=borum
 #SBATCH  --qos=borum

Or, for the burst QOS:

 #SBATCH  --account=borum
 #SBATCH  --qos=borum-b

Note that both --account and --qos must be specified. Otherwise scheduler will assume the default ufhpc account is intended, and neither the borum nor borum-b QOSes will be available to the job. Consequently, scheduler would deny the submission. In addition, you cannot mix and match resources from different allocations.

These sbatch directives can also be given as command line arguments to srun. For example:

$ srun --account=borum --qos=borum-b <example_command>

QOS Resource Limits

CPU cores, Memory (RAM), GPU accelerators, software licenses, etc. are referred to as Trackable Resources (TRES) by the scheduler. The TRES available in a given QOS are determined by the group's investments and the QOS configuration.

View a trackable resource limits for a QOS:

$ showQos <specified_qos>

Example: $ showQos borum output:

                Name                          Descr                                       GrpTRES  GrpCPUs 
-------------------- ------------------------------ --------------------------------------------- -------- 
borum                borum qos                      cpu=9,gres/gpu=0,mem=32400M                          9 

We can see that the borum investment QOS has a pool of 9 CPU cores, 32GB of RAM, and no GPUs. This pool of resources is shared among all members of the borum group.


Similarly, the borum-b> burst QOS resource limits shown by $ showQos borum-b are:

                Name                          Descr                                       GrpTRES  GrpCPUs 
-------------------- ------------------------------ --------------------------------------------- -------- 
borum-b              borum burst qos                cpu=81,gres/gpu=0,mem=291600M                       81 


There are additional base priority and run time limits associated with QOSes. To display them run

 $ sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" <specified_qos> 

Example: $ sacctmgr show qos format="name%-20,Description%-30,priority,maxwall" borum borum-b output:

                Name                          Descr   Priority     MaxWall 
-------------------- ------------------------------ ---------- ----------- 
borum                borum qos                           36000 31-00:00:00 
borum-b              borum burst qos                       900  4-00:00:00 

The investment and burst QOS jobs are limited to 31 and 4 day run times, respectively. Additionally, the base priority of a burst QOS job is 1/40th that of an investment QOS job. It is important to remember that the base priority is only one component of the jobs overall priority and that the priority will change over time as the job waits in the queue.

The burst QOS cpu and memory limits are nine times (9x) those of the investment QOS up to a certain limit and are intended to allow groups to take advantage of unused resources short periods of time by borrowing resources from other groups.

QOS Time Limits

  • Jobs with longer time limits are more difficult to schedule.
  • Long time limits make system maintenance harder. We have to perform maintenance on the systems (OS updates, security patches, etc.). If the allowable time limits were longer, it could make important maintenance tasks virtually impossible. Of particular importance is the ability to install security updates on the systems quickly and efficiently. If we cannot install them because user jobs are running for months at a time, we have to choose to either kill the user jobs or risk security issues on the system, which could affect all users.
  • The longer a job runs, the more likely it is to end prematurely due to random hardware failure.

Thus, if the application allows saving and resuming the analysis it is recommended that instead of running jobs for extremely long times, you utilize checkpointing of your jobs so that you can restart them and run shorter jobs instead.

The 31 day investment QOS time limit on HiPerGator is generous compared to other major institutions. Here are examples we were able to find.

Institution Maximum Runtime
New York University 4 days
University of Southern California 2 weeks for 1 node, otherwise 1 day
PennState 2 weeks for up to 32 cores (contributors), 4 days for up to 256 cores otherwise
UMBC 5 days
TACC: Lonestar 1 day
Princeton: Della 6 days
University of Maryland 14 days

CPU cores and Memory (RAM) Resource Use

CPU cores and RAM are allocated to jobs independently as requested in job scripts. Considerations for selecting how many CPU cores and how much memory to request for a job must take into account the QOS limits based on the group investment, the limitations of the hardware (compute nodes), and the desire to be a good neighbor on a shared resource like HiPerGator to ensure that system resources are allocated efficiently, used fairly, and everyone has a chance to get their work done without causing negative impacts on work performed by other researchers.

HiPerGator consists of many interconnected servers (compute nodes). The hardware resources of each compute node, including CPU cores, memory, memory bandwidth, network bandwidth, and local storage are limited. If any single one of the above resources is fully consumed the remaining unused resources can become effectively wasted, which makes it progressively harder or even impossible to achieve the shared goals of Research Computing and the UF Researcher Community stated above. See the Available Node Features for details on the hardware on compute nodes. Nodes with similar hardware are generally separated into partitions. If the job requires larger nodes or particular hardware make sure to explicitly specify a partition. Example:

--partition=bigmem

When a job is submitted, if no resource request is provided, the default limits of 1 CPU core, 600MB of memory, and a 10 minute time limit will be set on the job by the scheduler. Check the resource request if it's not clear why the job ended before the analysis was done. Premature exit can be due to the job exceeding the time limit or the application using more memory than the request.

Run testing jobs to find out what resource a particular analysis needs. To make sure that the analysis is performed successfully without wasting valuable resources you must specify both the number of CPU cores and the amount of memory needed for the analysis in the job script. See Sample SLURM Scripts for examples of specifying CPU core requests depending on the nature of the application running in a job. Use --mem (total job memory on a node) or --mem-per-cpu (per-core memory) options to request memory. Use --time to set a time limit to an appropriate value within the QOS limit.

As jobs are submitted and the resources under a particular account are consumed the group may reach either the CPU or Memory group limit. The group has consumed all cores in a QOS if the scheduler shows QOSGrpCpuLimit or memory if the scheduler shows QOSGrpMemLimit in the reason a job is pending ('NODELIST(REASON)' column of the squeue command output).

Example:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            123456    bigmem test_job     jdoe PD       0:00      1 (QOSGrpMemLimit)

Reaching a resource limit of a QOS does not interfere with job submission. However, the jobs with this reason will not run and will remain in the pending state until the QOS use falls below the limit.

If the resource request for submitted job is impossible to satisfy within either the QOS limits or HiPerGator compute node hardware for a particular partition the scheduler will refuse the job submission altogether and return the following error message,

sbatch: error: Batch job submission failed: Job violates accounting/QOS policy 
               (job submit limit, user's size and/or time limits)

GPU Resource Limits

As per the Scheduler/Job Policy, there is no burst QOS for GPU jobs.

Choosing QOS for a Job

When choosing between the high-priority investment QOS and the 9x larger low-priority burst QOS, you should start by considering the overall resource requirements for the job. For smaller allocations the investment QOS may not be large enough for some jobs, whereas for other smaller jobs the wait time in the burst QOS could be too long. In addition, consider the current state of the account you are planning to use for your job.

For any individual jobs submitted to the Burst QOS we do not guarantee that they will ever start, although historical data shows that burst jobs do start and provide significant additional throughput to groups that use them correctly as 'long queues' i.e.
  • Submit only non-time-critical jobs to the Burst QOS.
  • Parallelize analyses to make sure they can run within the 4-day window.
  • Let the scheduler take its time to find unused resources to run burst jobs.
In summary, the Burst QOS is best handled in a "hands-off" fashion. If any of your analyses are time-critical then you should be submitting them to the appropriately sized investment qos.

To show the status of any SLURM account as well as the overall usage of HiPerGator resources, use the following command from the UFRC module:

$ slurmInfo

for the primary account or

$ slurmInfo <account>

for another account

Example: $ slurmInfo ufgi:

----------------------------------------------------------------------
Allocation summary:    Time Limit             Hardware Resources
   Investment QOS           Hours          CPU     MEM(GB)     GPU
----------------------------------------------------------------------
             ufgi             744          150         527       0
----------------------------------------------------------------------
CPU/MEM Usage:                Running        Pending        Total
                       CPU   MEM(GB)    CPU   MEM(GB)    CPU   MEM(GB)
----------------------------------------------------------------------
     Investment (ufgi):   100      280     0        0   100      280
----------------------------------------------------------------------
HiPerGator Utilization
               CPU: Used (%) / Total        MEM(GB): Used (%) / Total
----------------------------------------------------------------------
        Total :  43643 (92%) / 47300    113295500 (57%) /   196328830
----------------------------------------------------------------------
* Burst QOS uses idle cores at low priority with a 4-day time limit

Run 'slurmInfo -h' to see all available options

The output shows that the investment QOS for the ufgi account is actively used. Since 100 CPU cores out of 150 available are used only 50 cores are available. In the same vein since 280GB out of 527GB in the investment QOS are used 247GB are still available. The ufgi-b burst QOS is unused. Te total HiPerGator use is 92% of all CPU cores and 57% of all memory on compute nodes, which means that there is little available capacity from which burst resources can be drawn. In this case a job submitted to the ufgi-b QOS would likely take a long time to start. If the overall utilization was below 80% it would be easier to start a burst job within a reasonable amount of time. When the HiPerGator load is high, or if the burst QOS is actively used, the investment QOS is more appropriate for a smaller job.

Examples

A hypothetical group ($GROUP in the examples below) has an investment of 42 CPU cores and 148GB of memory. That's the group's so-called soft limit for HiPerGator jobs in the investment qos for up to 744 hour time limit at high priority. The hard limit accessible through the so-called burst qos is 9 times that giving a group potentially a total of 10x the invested resources i.e. 420 total CPU cores and 1480GB of total memory with burst qos providing 378 CPU cores and 1330GB of total memory for up to 96 hours at low base priority.

Let's test:

 [marvin@gator ~]$ srun --mem=126gb --pty bash -i

srun: job 123456 queued and waiting for resources

#Looks good, let's terminate the request with Ctrl+C>

^C
srun: Job allocation 123456 has been revoked
srun: Force Terminated job 123456

On the other hand, going even 1gb over that limit results in the already encountered job limit error

[marvin@gator ~]$ srun --mem=127gb --pty bash -i
srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits

At this point the group can try using the 'burst' QOS with

#SBATCH --qos=$GROUP-b

Let's test:

[marvin@gator3 ~]$ srun -p bigmem --mem=400gb --time=96:00:00 --qos=$GROUP-b --pty bash -i

srun: job  123457 queued and waiting for resources

#Looks good, let's terminate with Ctrl+C

^C
srun: Job allocation 123457 has been revoked
srun: Force Terminated job 123457

However, now there's the burst qos time limit to consider.

[marvin@gator ~]$ srun --mem=400gb --time=300:00:00 --pty bash -i

srun: error: Unable to allocate resources: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits

Let's reduce the time limit to what burst qos supports and try again.

[marvin@gator ~]$ srun --mem=400gb --time=96:00:00 --pty bash -i

srun: job  123458 queued and waiting for resources

#Looks good, let's terminate with Ctrl+C

^C
srun: Job allocation 123458 has been revoked
srun: Force Terminated job

Pending Job Reasons

To reiterate, the following Reasons can be seen in the NODELIST(REASON) column of the squeue command when the group reaches the resource limit for a QOS:

QOSGrpCpuLimit

All CPU cores available for the listed account within the respective QOS are in use.

QOSGrpMemLimit

All memory available for the listed account within the respective QOS as described in the previous section is in use.

Once it has marked any jobs in the group's list of pending jobs with a reason of QOSGrpCpuLimit or QOSGrpMemLimit, SLURM may not evaluate other jobs and they may simply be listed with the Priority reason code. See Why is my job not running for a list of reasons.