Revision as of 17:15, 9 February 2023

See the SLURM squeue documentation for the full list of job states/reason codes. Here we list the most frequently encountered job states on HiPerGator for quick reference.

Squeue

BadConstraints: The job resource request or constraints cannot be satisfied. The resource request includes such parameters as nodes, ntask, cpus-per-task, gpus (or gres=gpu...), partition, feature, constraint. In some cases this state is temporary e.g. if a job requests a full node or a large enough portion of a node that the scheduler cannot find appropriate resources right away.
Resources: The job request is waiting for resources. This could be due to a large number of job requests in the queue, and the scheduler cannot provide the resources immediately.
QOSGrpCpuLimit: All CPU cores available for the listed account within the respective QOS are in use.
QOSGrpMemLimit: All memory available for the listed account within the respective QOS as described in the previous section is in use.

Sacct

COMPLETED: As far as the scheduler knows the job finished without problems and exited with a 'zero' exit code.
OUT_OF_MEMORY (can also look like OUT_OF_ME+ in the default sacct output): the job tried using more memory than it requested. Resubmit the job with a larger memory request.
TIMEOUT: Job was terminated because its duration reached the time limit set by the '--time' argument. Note that the default time limit is 10 minutes. Resubmit the job with a longer time limit, but also verify that it uses requested cpu / gpu resources correctly.
FAILED: The process(es) running within the job exited with an error exit code. The scheduler does not know the details. See the job log for clues.

@@ Line 3: / Line 3: @@
 See the [https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES SLURM squeue documentation] for the full list of job states/reason codes. Here we list the most frequently encountered job states on HiPerGator for quick reference.
+=Squeue=
 * '''BadConstraints''': The job resource request or constraints cannot be satisfied. The resource request includes such parameters as nodes, ntask, cpus-per-task, gpus (or gres=gpu...), partition, feature, constraint. In some cases this state is temporary e.g. if a job requests a full node or a large enough portion of a node that the scheduler cannot find appropriate resources right away.
 * '''Resources''': The job request is waiting for resources. This could be due to a large number of job requests in the queue, and the scheduler cannot provide the resources immediately.
 * '''QOSGrpCpuLimit''': All CPU cores available for the listed account within the respective QOS are in use.
 * '''QOSGrpMemLimit''': All memory available for the listed account within the respective QOS as described in the previous section is in use.
+=Sacct=
+* '''COMPLETED''': As far as the scheduler knows the job finished without problems and exited with a 'zero' exit code.
+* '''OUT_OF_MEMORY''' (can also look like '''OUT_OF_ME+''' in the default sacct output): the job tried using more memory than it requested. Resubmit the job with a larger memory request.
+* '''TIMEOUT''': Job was terminated because its duration reached the time limit set by the '--time' argument. Note that the default time limit is 10 minutes. Resubmit the job with a longer time limit, but also verify that it uses requested cpu / gpu resources correctly.
+* '''FAILED''': The process(es) running within the job exited with an error exit code. The scheduler does not know the details. See the job log for clues.

Difference between revisions of "SLURM Job States"

Revision as of 17:15, 9 February 2023

Squeue

Sacct

Navigation menu

Search