SLURM Commands

From UFRC
Jump to navigation Jump to search

SLURM Commands

See also: Sample SLURM Scripts

While there is a lot of documentation available on the SLURM web page, we provide these commands to help users with examples and handy references. Have a favorite SLURM command? Users can edit the wiki pages, please add your examples.

Submit a Job

Submit a job script to the SLURM scheduler with

  • sbatch script

Interactive Session

An interactive SLURM session i.e. a shell prompt within a running job can be started with

  • srun <resources> --pty bash -i

For example, a single node 2 CPU core job with 2gb of RAM for 90 minutes can be started with

  • srun --ntasks=1 --cpus-per-task=2 --mem=2gb -t 90 --pty bash -i

Checking on the queue

The basic command is squeue. The full documentation for squeue is available on the SLURM web page, but we hope these examples are useful as they are and as templates for further customization.

  • For a list of jobs running under a particular group, use the -A flag (for Account) with the group name.
    • squeue -A group_name
  • For a summary that is similar to the MOAB/Torque showq command (again, -u user or -A group can be added):
    • squeue -o "%.10A %.18u %.4t %.8C %.20L %.30S"
  • To include qos and limit to a group:
    • squeue -O jobarrayid,qos,name,username,timelimit,numcpus,reasonlist -A group_name

Checking job information

The basic command is sacct. The full documentation for sacct is available on the SLURM web page, but we hope these examples are useful as they are and as templates for further customization.

By default, sacct will only show your in the queue or running since midnight of the current day. To view jobs from a particular date, you can specify a start time (-S or --starttime) with one of a number of formats, for example since May 1st (0501):

sacct -S 0501

The default columns displayed are:

JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 

To other information can either be pulled from the -l view which has a long list of columns, or by specifying the information you want to view. For example to see the number of CPUs, total memory use and walltime of all jobs since May 1st (0501), you could use:

sacct -S 0501 -o JobIDRaw,JobName,NCPUS,MaxRSS,Elapsed

To do the same for a whole group:

sacct -S 0501 -o JobIDRaw,JobName,User,NCPUS,MaxRSS,Elapsed -a -A group_name

To view memory use of jobs:

sacct --format=User,JobID,ReqMem,MaxRss

The above operations get information about completed jobs from the SLURM database. To look at the currently running jobs use the sstat command. For example,

 sstat -j 123456.batch -o maxrss
    MaxRSS
----------
 16111996K

See man sstat manual page on the cluster for more details or go to https://slurm.schedmd.com/sstat.html.

Canceling Jobs

scancel jobID

or, for cancelling multiple jobs with names that follow a wildcard pattern

scancel pattern


Using sreport to view group summaries

The basic command is report. The full documentation for sreport is available on the SLURM web page, but we hope these examples are useful as they are and as templates for further customization.

To view a summary of group usage since a given date (May 1st in this example):

sreport cluster AccountUtilizationByUser Start=0501 Accounts=group_name

Or for a particular month (the month of May):

sreport cluster AccountUtilizationByUser Start=0501 End=0531 Accounts=group_name

Or for more information

sreport -t Hours cluster AccountUtilizationByUser Start=2022-01-01T00:00:00 End=2022-01-31T23:59:59 Accounts=group_name

Viewing Resources Available to a Group

To check the resources available to a group for running jobs, you can use the sacctmgr command (substitute the group_name with your group)

sacctmgr show qos group_name format="Name%-16,GrpSubmit,MaxWall,GrpTres%-45"

or for the burst allocation:

sacctmgr show qos group_name-b format="Name%-16,GrpSubmit,MaxWall,GrpTres%-45"

Using sinfo to view partition information and node features

sinfo is one command that users can use to learn about the resources managed by SLURM. sinfo provides information on the configuration of partitions and the details of nodes within each partition. Using sinfo, users can view the features attributed to the nodes, and then use those features as constraints when submitting jobs to, for example, request only nodes with Intel processors.

sinfo -s

Provides a summary of the partitions and the nodes within each, including the numbers of nodes that are Allocated, Idle, Offiline, and Total.

sinfo -o %P,%D,%c,%X,%m,%f 

or

module load ufrc
nodeInfo

Shows the partitions, number of nodes, number of cores per node, number of sockets per node, amount of RAM per node, and the features associated with the nodes. These features can be used to request constraints in sbatch. For example:

#SBATCH --partition=hpg2-compute
#SBATCH --constraint='hgp2'

Would constrain a job to run on one of the 32-core AMD nodes from HiPerGator 2.

While constraints can be used to target particular resources, users should realize that using constraints also limits where a job can run and may delay scheduling a job.