Difference between revisions of "SLURM Job States"
Jump to navigation
Jump to search
Moskalenko (talk | contribs) |
(Update obsolete link) |
||
(4 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
+ | {{obsolete|scheduler/job_states}} | ||
+ | |||
[[Category:Scheduler]] | [[Category:Scheduler]] | ||
See the [https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES SLURM squeue documentation] for the full list of job states/reason codes. Here we list the most frequently encountered job states on HiPerGator for quick reference. | See the [https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES SLURM squeue documentation] for the full list of job states/reason codes. Here we list the most frequently encountered job states on HiPerGator for quick reference. | ||
+ | =Squeue= | ||
* '''BadConstraints''': The job resource request or constraints cannot be satisfied. The resource request includes such parameters as nodes, ntask, cpus-per-task, gpus (or gres=gpu...), partition, feature, constraint. In some cases this state is temporary e.g. if a job requests a full node or a large enough portion of a node that the scheduler cannot find appropriate resources right away. | * '''BadConstraints''': The job resource request or constraints cannot be satisfied. The resource request includes such parameters as nodes, ntask, cpus-per-task, gpus (or gres=gpu...), partition, feature, constraint. In some cases this state is temporary e.g. if a job requests a full node or a large enough portion of a node that the scheduler cannot find appropriate resources right away. | ||
* '''Resources''': The job request is waiting for resources. This could be due to a large number of job requests in the queue, and the scheduler cannot provide the resources immediately. | * '''Resources''': The job request is waiting for resources. This could be due to a large number of job requests in the queue, and the scheduler cannot provide the resources immediately. | ||
− | * '''QOSGrpCpuLimit''': All CPU cores available for the listed account within the respective QOS are in use. | + | * '''QOSGrpCpuLimit''': All CPU cores available for the listed account within the respective QOS are in use. QOSGrpCpuLimit as a job status indicates the group is using all their resources. When submitting a job, it means that the user is requesting more resources than the group has allocated. |
* '''QOSGrpMemLimit''': All memory available for the listed account within the respective QOS as described in the previous section is in use. | * '''QOSGrpMemLimit''': All memory available for the listed account within the respective QOS as described in the previous section is in use. | ||
+ | |||
+ | =Sacct= | ||
+ | * '''COMPLETED''': As far as the scheduler knows the job finished without problems and exited with a 'zero' exit code. | ||
+ | * '''OUT_OF_MEMORY''' (can also look like '''OUT_OF_ME+''' in the default sacct output): the job tried using more memory than it requested. Resubmit the job with a larger memory request. | ||
+ | * '''TIMEOUT''': Job was terminated because its duration reached the time limit set by the '--time' argument. Note that the default time limit is 10 minutes. Resubmit the job with a longer time limit, but also verify that it uses requested cpu / gpu resources correctly. | ||
+ | * '''FAILED''': The process(es) running within the job exited with an error exit code. The scheduler does not know the details. See the job log for clues. | ||
+ | * '''CANCELLED''': Job was cancelled with 'scancel' or an equivalent mechanism. |
Latest revision as of 14:50, 30 October 2024
This page is obsolete. It is being retained for archival purposes. The current version of this page can be found at https://docs.rc.ufl.edu/scheduler/job_states
See the SLURM squeue documentation for the full list of job states/reason codes. Here we list the most frequently encountered job states on HiPerGator for quick reference.
Squeue
- BadConstraints: The job resource request or constraints cannot be satisfied. The resource request includes such parameters as nodes, ntask, cpus-per-task, gpus (or gres=gpu...), partition, feature, constraint. In some cases this state is temporary e.g. if a job requests a full node or a large enough portion of a node that the scheduler cannot find appropriate resources right away.
- Resources: The job request is waiting for resources. This could be due to a large number of job requests in the queue, and the scheduler cannot provide the resources immediately.
- QOSGrpCpuLimit: All CPU cores available for the listed account within the respective QOS are in use. QOSGrpCpuLimit as a job status indicates the group is using all their resources. When submitting a job, it means that the user is requesting more resources than the group has allocated.
- QOSGrpMemLimit: All memory available for the listed account within the respective QOS as described in the previous section is in use.
Sacct
- COMPLETED: As far as the scheduler knows the job finished without problems and exited with a 'zero' exit code.
- OUT_OF_MEMORY (can also look like OUT_OF_ME+ in the default sacct output): the job tried using more memory than it requested. Resubmit the job with a larger memory request.
- TIMEOUT: Job was terminated because its duration reached the time limit set by the '--time' argument. Note that the default time limit is 10 minutes. Resubmit the job with a longer time limit, but also verify that it uses requested cpu / gpu resources correctly.
- FAILED: The process(es) running within the job exited with an error exit code. The scheduler does not know the details. See the job log for clues.
- CANCELLED: Job was cancelled with 'scancel' or an equivalent mechanism.