Difference between revisions of "SLURM Job States"

From UFRC
Jump to navigation Jump to search
Line 14: Line 14:
 
* '''TIMEOUT''': Job was terminated because its duration reached the time limit set by the '--time' argument. Note that the default time limit is 10 minutes. Resubmit the job with a longer time limit, but also verify that it uses requested cpu / gpu resources correctly.
 
* '''TIMEOUT''': Job was terminated because its duration reached the time limit set by the '--time' argument. Note that the default time limit is 10 minutes. Resubmit the job with a longer time limit, but also verify that it uses requested cpu / gpu resources correctly.
 
* '''FAILED''': The process(es) running within the job exited with an error exit code. The scheduler does not know the details. See the job log for clues.
 
* '''FAILED''': The process(es) running within the job exited with an error exit code. The scheduler does not know the details. See the job log for clues.
 +
* '''CANCELLED''': Job was cancelled with 'scancel' or an equivalent mechanism.

Revision as of 17:16, 9 February 2023


See the SLURM squeue documentation for the full list of job states/reason codes. Here we list the most frequently encountered job states on HiPerGator for quick reference.

Squeue

  • BadConstraints: The job resource request or constraints cannot be satisfied. The resource request includes such parameters as nodes, ntask, cpus-per-task, gpus (or gres=gpu...), partition, feature, constraint. In some cases this state is temporary e.g. if a job requests a full node or a large enough portion of a node that the scheduler cannot find appropriate resources right away.
  • Resources: The job request is waiting for resources. This could be due to a large number of job requests in the queue, and the scheduler cannot provide the resources immediately.
  • QOSGrpCpuLimit: All CPU cores available for the listed account within the respective QOS are in use.
  • QOSGrpMemLimit: All memory available for the listed account within the respective QOS as described in the previous section is in use.

Sacct

  • COMPLETED: As far as the scheduler knows the job finished without problems and exited with a 'zero' exit code.
  • OUT_OF_MEMORY (can also look like OUT_OF_ME+ in the default sacct output): the job tried using more memory than it requested. Resubmit the job with a larger memory request.
  • TIMEOUT: Job was terminated because its duration reached the time limit set by the '--time' argument. Note that the default time limit is 10 minutes. Resubmit the job with a longer time limit, but also verify that it uses requested cpu / gpu resources correctly.
  • FAILED: The process(es) running within the job exited with an error exit code. The scheduler does not know the details. See the job log for clues.
  • CANCELLED: Job was cancelled with 'scancel' or an equivalent mechanism.