SLURM Job States

From UFRC
Jump to navigation Jump to search

This page is obsolete. It is being retained for archival purposes. The current version of this page can be found at https://docs.rc.ufl.edu/scheduler/job_states

See the SLURM squeue documentation for the full list of job states/reason codes. Here we list the most frequently encountered job states on HiPerGator for quick reference.

Squeue

  • BadConstraints: The job resource request or constraints cannot be satisfied. The resource request includes such parameters as nodes, ntask, cpus-per-task, gpus (or gres=gpu...), partition, feature, constraint. In some cases this state is temporary e.g. if a job requests a full node or a large enough portion of a node that the scheduler cannot find appropriate resources right away.
  • Resources: The job request is waiting for resources. This could be due to a large number of job requests in the queue, and the scheduler cannot provide the resources immediately.
  • QOSGrpCpuLimit: All CPU cores available for the listed account within the respective QOS are in use. QOSGrpCpuLimit as a job status indicates the group is using all their resources. When submitting a job, it means that the user is requesting more resources than the group has allocated.
  • QOSGrpMemLimit: All memory available for the listed account within the respective QOS as described in the previous section is in use.

Sacct

  • COMPLETED: As far as the scheduler knows the job finished without problems and exited with a 'zero' exit code.
  • OUT_OF_MEMORY (can also look like OUT_OF_ME+ in the default sacct output): the job tried using more memory than it requested. Resubmit the job with a larger memory request.
  • TIMEOUT: Job was terminated because its duration reached the time limit set by the '--time' argument. Note that the default time limit is 10 minutes. Resubmit the job with a longer time limit, but also verify that it uses requested cpu / gpu resources correctly.
  • FAILED: The process(es) running within the job exited with an error exit code. The scheduler does not know the details. See the job log for clues.
  • CANCELLED: Job was cancelled with 'scancel' or an equivalent mechanism.