Difference between revisions of "Sample SLURM Scripts"

From UFRC
Jump to navigation Jump to search
(38 intermediate revisions by 8 users not shown)
Line 1: Line 1:
[[Category:SLURM]]
+
[[Category:Scheduler]]
{{HPG2}}
 
  
 
=Sample SLURM Scripts=
 
=Sample SLURM Scripts=
  
Below are a number of sample scripts that can be used as a template for building your own SLURM submission scripts for use on HiPerGator 2.0. These scripts are also located at: /ufrc/data/training/SLURM/, and can be copied from there. If you choose to copy one of these sample scripts, please make sure you understand what each line of the sbatch directives before using it to submit your jobs.  Otherwise, you may not get the result you want and may waste valuable computing resources.
+
Below are a number of sample scripts that can be used as a template for building your own SLURM submission scripts for use on HiPerGator 2.0. These scripts are also located at: /data/training/SLURM/, and can be copied from there. If you choose to copy one of these sample scripts, please make sure you understand what each <code>#SBATCH</code> directive before before using the script to submit your jobs.  Otherwise, you may not get the result you want and may waste valuable computing resources.
  
==Basic, single-processor job==
+
'''Note:''' There is a maximum limit of 3000 jobs per user.  
This script can serve as the template for many single-processor applications. The mem-per-cpu flag can be used to request the appropriate amount of memory for your job. Please make sure to test your application and set this value to a reasonable number based on actual memory use. The %j in the -o (can also use --output) line tells SLURM to substitute the job ID in the name of the output file. You can also add a -e or --error with an error file name to separate output and error logs.
 
  
Download the [{{#fileLink: single_core_job_annotated.sh}} single_core_job_annotated.sh] script
+
See [[Annotated SLURM Script]] for a step-by-step explanation of all options.
{{#fileAnchor: single_core_job_annotated.sh}}
+
 
<source lang=bash>
+
==Memory requests==
 +
A large number of users request far more memory than their jobs use (100-10,000 times!). As an example, since August 1st, looking at groups that have run over 1,000 jobs, there are 28 groups whose users have requested 100x the memory used in over half of those jobs. Groups often find themselves with jobs pending due to having reached their memory limits (QOSGrpMemLimit).
 +
 
 +
While it is important to request more memory than will be used (10-20% is usually sufficient), requesting 100x, or even 10,000x, more memory only reduces the number of jobs that a group can run as well as overall throughput on the cluster. Many groups, and our overall user community, will be able to run far more jobs if they request more reasonable amounts of memory.
 +
 
 +
The email sent when a job finishes shows users how much memory the job actually used and can be used to adjust memory requests for future jobs. The SLURM directives for memory requests are the --mem or --mem-per-cpu. It is in the user’s best interest to adjust the memory request to a more realistic value.
 +
 
 +
Requesting more memory than needed will not speed up analyses. Based on their experience of finding their personal computers run faster when adding more memory, users often believe that requesting more memory will make their analyses run faster. This is not the case. An application running on the cluster will have access to all of the memory it requests, and we never swap RAM to disk. If an application can use more memory, it will get more memory. Only when the job crosses the limit based on the memory request does SLURM kill the job.
 +
 
 +
==Basic, Single-Threaded Job==
 +
This script can serve as the template for many single-processor applications. The mem-per-cpu flag can be used to request the appropriate amount of memory for your job. Please make sure to test your application and set this value to a reasonable number based on actual memory use. The <code>%j</code> in the <code>--output</code> line tells SLURM to substitute the job ID in the name of the output file. You can also add a <code>-e</code> or <code>--error</code> line with an error file name to separate output and error logs.
 +
 
 +
<pre>
 
#!/bin/bash
 
#!/bin/bash
 
#SBATCH --job-name=serial_job_test    # Job name
 
#SBATCH --job-name=serial_job_test    # Job name
Line 26: Line 36:
 
echo "Running plot script on a single CPU core"
 
echo "Running plot script on a single CPU core"
  
python /ufrc/data/training/SLURM/plot_template.py
+
python /data/training/SLURM/plot_template.py
  
 
date
 
date
</source>
+
</pre>
  
 
+
==Multi-Threaded SMP Job==
==Threaded or multi-processor job==
 
 
This script can serve as a template for applications that are capable of using multiple processors on a single server or physical computer. These applications are commonly referred to as threaded, OpenMP, PTHREADS, or shared memory applications. While they can use multiple processors, they cannot make use of multiple servers and all the processors must be on the same node.
 
This script can serve as a template for applications that are capable of using multiple processors on a single server or physical computer. These applications are commonly referred to as threaded, OpenMP, PTHREADS, or shared memory applications. While they can use multiple processors, they cannot make use of multiple servers and all the processors must be on the same node.
  
Line 42: Line 51:
 
**For some applications, use a command line option when calling that application.
 
**For some applications, use a command line option when calling that application.
  
 
+
<pre>
 
 
Download the [{{#fileLink: multicore_job.sh}} multicore_job.sh] script
 
{{#fileAnchor: multicore_job.sh}}
 
<source lang=bash>
 
 
#!/bin/bash
 
#!/bin/bash
 
#SBATCH --job-name=parallel_job      # Job name
 
#SBATCH --job-name=parallel_job      # Job name
 
#SBATCH --mail-type=END,FAIL        # Mail events (NONE, BEGIN, END, FAIL, ALL)
 
#SBATCH --mail-type=END,FAIL        # Mail events (NONE, BEGIN, END, FAIL, ALL)
 
#SBATCH --mail-user=email@ufl.edu    # Where to send mail
 
#SBATCH --mail-user=email@ufl.edu    # Where to send mail
#SBATCH --ntasks=1                  # Run a single task
+
#SBATCH --nodes=1                    # Run all processes on a single node
 +
#SBATCH --ntasks=1                  # Run a single task
 
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
 
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
 
#SBATCH --mem=1gb                    # Job memory request
 
#SBATCH --mem=1gb                    # Job memory request
Line 60: Line 66:
 
echo "Running prime number generator program on $SLURM_CPUS_ON_NODE CPU cores"
 
echo "Running prime number generator program on $SLURM_CPUS_ON_NODE CPU cores"
  
module load gcc/5.2.0
+
/data/training/SLURM/prime/prime
 
 
/ufrc/data/training/SLURM/prime/prime
 
  
 
date
 
date
</source>
+
</pre>
  
  
 
Another example, setting OMP_NUM_THREADS:
 
Another example, setting OMP_NUM_THREADS:
  
Download the [{{#fileLink: parallel_job2.sh}} multi_processor_job2.sh] script
+
<pre>
{{#fileAnchor: parallel_job2.sh}}
 
<source lang=bash>
 
 
#!/bin/bash
 
#!/bin/bash
 
#SBATCH --job-name=parallel_job_test # Job name
 
#SBATCH --job-name=parallel_job_test # Job name
 
#SBATCH --mail-type=END,FAIL        # Mail events (NONE, BEGIN, END, FAIL, ALL)
 
#SBATCH --mail-type=END,FAIL        # Mail events (NONE, BEGIN, END, FAIL, ALL)
 
#SBATCH --mail-user=email@ufl.edu    # Where to send mail
 
#SBATCH --mail-user=email@ufl.edu    # Where to send mail
 +
#SBATCH --nodes=1                    # Run all processes on a single node
 
#SBATCH --ntasks=1                  # Run a single task
 
#SBATCH --ntasks=1                  # Run a single task
 
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
 
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
Line 91: Line 94:
  
 
date
 
date
</source>
+
</pre>
 +
 
 +
If you run multi-processing code, for example using python multiprocess module, make sure to specify a single node and the number of tasks that your code will use.
 +
 
 +
<pre>
 +
#!/bin/bash
 +
#SBATCH --job-name=parallel_job_test # Job name
 +
#SBATCH --mail-type=END,FAIL        # Mail events (NONE, BEGIN, END, FAIL, ALL)
 +
#SBATCH --mail-user=email@ufl.edu    # Where to send mail
 +
#SBATCH --nodes=1                    # Run all processes on a single node
 +
#SBATCH --ntasks=4                  # Number of processes
 +
#SBATCH --mem=1gb                    # Total memory limit
 +
#SBATCH --time=01:00:00              # Time limit hrs:min:sec
 +
#SBATCH --output=multiprocess_%j.log # Standard output and error log
 +
date;hostname;pwd
 +
 
 +
module load python/3
 +
 
 +
python script.py
 +
 
 +
date
 +
</pre>
  
==MPI job==
+
==Message Passing Interface (MPI) Jobs==
;Note: During the transition to SLURM 17 and OpenMPI 3.1 use <code>srun --mpi=pmpi_v1</code> for applications built against OpenMPI-3.0.0 and <code>srun --mpi=pmi_v2</code> for applications built against OpenMPI-3.1.x. If in doubt try ''pmi_v1'' first.  
+
===PMIx Versions===
 +
When launching applications linked against our OpenMPI libraries via ''srun'', you must specify the correct version of PMIx using the '''"--mpi"''' ''srun'' option.
 +
Generally speaking you can determine the appropriate PMIx version to use by running the ''ompi_info'' command after loading the desired OpenMPI environment module.
 +
For example,
 +
<pre>
 +
$ module load intel/2018 openmpi/3.1.2
 +
$ ompi_info --param pmix all
 +
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v3.1.2)
 +
                MCA pmix: ext2x (MCA v2.1.0, API v2.0.0, Component v3.1.2)
 +
                MCA pmix: s1 (MCA v2.1.0, API v2.0.0, Component v3.1.2)
 +
                MCA pmix: s2 (MCA v2.1.0, API v2.0.0, Component v3.1.2)
 +
$ ml purge
 +
$ ml intel/2019 openmpi/4.0.1
 +
$ ompi_info --param pmix all
 +
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.1)
 +
                MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.0.1)
 +
                MCA pmix: s1 (MCA v2.1.0, API v2.0.0, Component v4.0.1)
 +
                MCA pmix: s2 (MCA v2.1.0, API v2.0.0, Component v4.0.1)
 +
</pre>
  
 +
In the examples above, you would specify pmix_v2 (i.e. ext2x) for the combination of intel/2018 and openmpi/3.1.2 and pmix_v3 (ext3x) for the second set of modules, intel/2019 and openmpi/4.0.1.
 +
 +
===Important srun/sbatch/salloc Options===
 
This script can serve as a template for MPI, or message passing interface, applications. These are applications that can use multiple processors that may, or may not, be on multiple compute nodes.  
 
This script can serve as a template for MPI, or message passing interface, applications. These are applications that can use multiple processors that may, or may not, be on multiple compute nodes.  
  
 
Our testing has found that it is best to be very specific about how you want your MPI ranks laid out across nodes and even sockets (multi-core CPUs). SLURM and OpenMPI have some conflicting behavior if you leave too much to chance. Please refer to the full [http://slurm.schedmd.com/sbatch.html SLURM sbatch documentation], but the following directives are the main directives to pay attention to:
 
Our testing has found that it is best to be very specific about how you want your MPI ranks laid out across nodes and even sockets (multi-core CPUs). SLURM and OpenMPI have some conflicting behavior if you leave too much to chance. Please refer to the full [http://slurm.schedmd.com/sbatch.html SLURM sbatch documentation], but the following directives are the main directives to pay attention to:
 
*<code>-c, --cpus-per-task=<ncpus></code>
 
*<code>-c, --cpus-per-task=<ncpus></code>
**Advise the Slurm controller that ensuing job steps will require ncpus number of processors per task.
+
**Request ''ncpus'' cores per task.
 
*<code>-m, --distribution=arbitrary|<block|cyclic|plane=<options>[:block|cyclic|fcyclic]>  </code>
 
*<code>-m, --distribution=arbitrary|<block|cyclic|plane=<options>[:block|cyclic|fcyclic]>  </code>
 
**Specify alternate distribution methods for remote processes.
 
**Specify alternate distribution methods for remote processes.
Line 113: Line 158:
 
**Request the maximum ntasks be invoked on each socket
 
**Request the maximum ntasks be invoked on each socket
 
**Notes on socket layout:
 
**Notes on socket layout:
 +
***hpg3-compute nodes have 2 sockets, each with 64 cores.
 
***hpg2-compute nodes have 2 sockets, each with 16 cores.
 
***hpg2-compute nodes have 2 sockets, each with 16 cores.
 
***hpg1-compute nodes have 4 sockets, each with 16 cores.
 
***hpg1-compute nodes have 4 sockets, each with 16 cores.
  
 +
===Example===
 +
The following example requests 24 tasks, each with a single core. It further specifies that these should be split evenly on 2 nodes, and within the nodes, the 12 tasks should be evenly split on the two sockets. So each CPU on the two nodes will have 6 tasks, each with its own dedicated core. The '''--distribution''' option will ensure that tasks are assigned cyclically among the allocated nodes and sockets.  Please see the SchedMD sbatch documentation for more detailed explanations of each of the ''sbatch'' options below.
  
The following example requests 24 tasks, each with one core. It further specifies that these should be split evenly on 2 nodes, and within the nodes, the 12 tasks should be evenly split on the two sockets. So each CPU on the two nodes will have 6 tasks, each with its own dedicated core. The distribution option will ensure that MPI ranks are distributed cyclically on nodes and sockets.
+
SLURM is very flexible and allows users to be very specific about their resource requests. Thinking about your application and doing some testing will be important to determine the best set of resources for your specific job.
 
 
SLURM is very flexible and allows users to be very specific about their resource requests. Thinking about your application and doing some testing will be important to determine the best request for your specific use.
 
  
Download the [{{#fileLink: mpi_job.sh}} mpi_job.sh] script
+
<pre>
{{#fileAnchor: mpi_job.sh}}
 
<source lang=bash>
 
 
#!/bin/bash
 
#!/bin/bash
 
#SBATCH --job-name=mpi_job_test      # Job name
 
#SBATCH --job-name=mpi_job_test      # Job name
 
#SBATCH --mail-type=END,FAIL        # Mail events (NONE, BEGIN, END, FAIL, ALL)
 
#SBATCH --mail-type=END,FAIL        # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu    # Where to send mail
+
#SBATCH --mail-user=email@ufl.edu    # Where to send mail.  Set this to your email address
#SBATCH --ntasks=24                  # Number of MPI ranks
+
#SBATCH --ntasks=24                  # Number of MPI tasks (i.e. processes)
#SBATCH --cpus-per-task=1            # Number of cores per MPI rank
+
#SBATCH --cpus-per-task=1            # Number of cores per MPI task
#SBATCH --nodes=2                    # Number of nodes
+
#SBATCH --nodes=2                    # Maximum number of nodes to be allocated
#SBATCH --ntasks-per-node=12        # How many tasks on each node
+
#SBATCH --ntasks-per-node=12        # Maximum number of tasks on each node
#SBATCH --ntasks-per-socket=6        # How many tasks on each CPU or socket
+
#SBATCH --ntasks-per-socket=6        # Maximum number of tasks on each socket
#SBATCH --distribution=cyclic:cyclic # Distribute tasks cyclically on nodes and sockets
+
#SBATCH --distribution=cyclic:cyclic # Distribute tasks cyclically first among nodes and then among sockets within a node
#SBATCH --mem-per-cpu=600mb          # Memory per processor
+
#SBATCH --mem-per-cpu=600mb          # Memory (i.e. RAM) per processor
#SBATCH --time=00:05:00              # Time limit hrs:min:sec
+
#SBATCH --time=00:05:00              # Wall time limit (days-hrs:min:sec)
#SBATCH --output=mpi_test_%j.log    # Standard output and error log
+
#SBATCH --output=mpi_test_%j.log    # Path to the standard output and error files relative to the working directory
pwd; hostname; date
 
  
echo "Running prime number generator program on $SLURM_JOB_NUM_NODES nodes with $SLURM_NTASKS tasks, each with $SLURM_CPUS_PER_TASK cores."
+
echo "Date              = $(date)"
 +
echo "Hostname          = $(hostname -s)"
 +
echo "Working Directory = $(pwd)"
 +
echo ""
 +
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
 +
echo "Number of Tasks Allocated      = $SLURM_NTASKS"
 +
echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"
  
 
module load intel/2018.1.163 openmpi/3.0.0
 
module load intel/2018.1.163 openmpi/3.0.0
 
+
srun --mpi=pmix_v1 /data/training/SLURM/prime/prime_mpi
srun --mpi=pmi_v1 /ufrc/data/training/SLURM/prime/prime_mpi
+
</pre>
 
 
date
 
</source>
 
  
 
==Hybrid MPI/Threaded job==
 
==Hybrid MPI/Threaded job==
This script can serve as a template for hybrid MPI/Threaded applications. These are MPI applications where each MPI rank is threaded and can use multiple processors.  
+
This script can serve as a template for hybrid MPI/SMP applications. These are MPI applications where each MPI process is multi-threaded (usually via either '''OpenMP''' or '''POSIX Threads''') and can use multiple processors.  
  
Our testing has found that it is best to be very specific about how you want your MPI ranks laid out across nodes and even sockets (multi-core CPUs). SLURM and OpenMPI have some conflicting behavior if you leave too much to chance. Please refer to the full SLURM sbatch documentation, as well as the information in the MPI example above.
+
Our testing has found that it is best to be very specific about how you want your MPI ranks laid out across nodes and even sockets (multi-core CPUs). '''SLURM''' and '''OpenMPI''' have some conflicting behavior if you leave too much to chance. Please refer to the full '''SLURM''' ''sbatch'' documentation, as well as the information in the MPI example above.
  
 
The following example requests 8 tasks, each with 4 cores. It further specifies that these should be split evenly on 2 nodes, and within the nodes, the 4 tasks should be evenly split on the two sockets. So each CPU on the two nodes will have 2 tasks, each with 4 cores. The distribution option will ensure that MPI ranks are distributed cyclically on nodes and sockets.
 
The following example requests 8 tasks, each with 4 cores. It further specifies that these should be split evenly on 2 nodes, and within the nodes, the 4 tasks should be evenly split on the two sockets. So each CPU on the two nodes will have 2 tasks, each with 4 cores. The distribution option will ensure that MPI ranks are distributed cyclically on nodes and sockets.
  
Download the [{{#fileLink: hybrid_pthreads_job.sh}} hybrid_pthreads_job.sh] script
+
<pre>
{{#fileAnchor: hybrid_pthreads_job.sh}}
 
<source lang=bash>
 
 
#!/bin/bash
 
#!/bin/bash
 
#SBATCH --job-name=hybrid_job_test      # Job name
 
#SBATCH --job-name=hybrid_job_test      # Job name
Line 174: Line 218:
 
module load  intel/2018.1.163  openmpi/3.0.0 raxml/8.2.12
 
module load  intel/2018.1.163  openmpi/3.0.0 raxml/8.2.12
 
   
 
   
srun --mpi=pmi_v1 raxmlHPC-HYBRID-SSE3 -T $SLURM_CPUS_PER_TASK \
+
srun --mpi=pmix_v1 raxmlHPC-HYBRID-SSE3 -T $SLURM_CPUS_PER_TASK \
       -f a -m GTRGAMMA -s /ufrc/data/training/SLURM/dna.phy -p $RANDOM \
+
       -f a -m GTRGAMMA -s /data/training/SLURM/dna.phy -p $RANDOM \
 
       -x $RANDOM -N 500 -n dna
 
       -x $RANDOM -N 500 -n dna
 
   
 
   
 
date
 
date
</source>
+
</pre>
  
 
The following example requests 8 tasks, each with 8 cores. It further specifies that these should be split evenly on 4 nodes, and within the nodes, the 2 tasks should be split, one on each of the two sockets. So each CPU on the two nodes will have 1 task, each with 8 cores. The distribution option will ensure that MPI ranks are distributed cyclically on nodes and sockets.
 
The following example requests 8 tasks, each with 8 cores. It further specifies that these should be split evenly on 4 nodes, and within the nodes, the 2 tasks should be split, one on each of the two sockets. So each CPU on the two nodes will have 1 task, each with 8 cores. The distribution option will ensure that MPI ranks are distributed cyclically on nodes and sockets.
Line 185: Line 229:
 
Also note setting OMP_NUM_THREADS so that OpenMP knows how many threads to use per task.
 
Also note setting OMP_NUM_THREADS so that OpenMP knows how many threads to use per task.
  
Download the [{{#fileLink: hybrid_OpenMP_job.sh}} hybrid_OpenMP_job.sh] script
+
<pre>
<source lang=bash>
 
 
#!/bin/bash
 
#!/bin/bash
 
#SBATCH --job-name=LAMMPS
 
#SBATCH --job-name=LAMMPS
Line 208: Line 251:
  
 
date
 
date
</source>
+
</pre>
  
 
* Note that MPI gets -np from SLURM automatically.
 
* Note that MPI gets -np from SLURM automatically.
Line 218: Line 261:
  
 
==Array job==
 
==Array job==
Please see the [[SLURM_Job_Arrays]] page for information on job arrays. Note that we use the simplest 'single-threaded' process example from above and extending it to an array of jobs. Modify the following script using the parallel, mpi, or hybrid job layout as needed.
+
Please see the [[SLURM Job Arrays]] page for information on job arrays. Note that we use the simplest 'single-threaded' process example from above and extending it to an array of jobs. Modify the following script using the parallel, mpi, or hybrid job layout as needed.
  
Download the [{{#fileLink: array_job.sh}} array_job.sh] script
+
<pre>
{{#fileAnchor: array_job.sh}}
 
<source lang=bash>
 
 
#!/bin/bash
 
#!/bin/bash
 
#SBATCH --job-name=array_job_test  # Job name
 
#SBATCH --job-name=array_job_test  # Job name
Line 237: Line 278:
  
 
date
 
date
</source>
+
</pre>
  
 
Note the use of %A for the master job ID of the array, and the %a for the task ID in the output filename.
 
Note the use of %A for the master job ID of the array, and the %a for the task ID in the output filename.
  
 
== GPU job ==
 
== GPU job ==
Please see [[GPU_Access]] for more information regarding the use of HiPerGator GPUs.  
+
Please see [[GPU Access]] for more information regarding the use of HiPerGator GPUs. Note that the order in which the environment modules are loaded is important.
 
 
Download the [{{#fileLink: Gpu_job.zip}} gpu_job.sh] script
 
<source lang=bash>
 
  
 +
===VASP===
 +
<pre>
 
#!/bin/bash
 
#!/bin/bash
#SBATCH --job-name=gpuMemTest
+
#SBATCH --job-name=vasptest
#SBATCH --output=gpuMemTest_%j.out
+
#SBATCH --output=vasp.out
#SBATCH --error=gpuMemTest_%j.err
+
#SBATCH --error=vasp.err
#SBATCH --ntasks=2
+
#SBATCH --mail-type=ALL
 +
#SBATCH --mail-user=email@ufl.edu
 +
#SBATCH --nodes=1
 +
#SBATCH --ntasks=8
 
#SBATCH --cpus-per-task=1
 
#SBATCH --cpus-per-task=1
 +
#SBATCH --ntasks-per-node=8
 
#SBATCH --distribution=cyclic:cyclic
 
#SBATCH --distribution=cyclic:cyclic
#SBATCH --time=12:00:00
+
#SBATCH --mem-per-cpu=7000mb
#SBATCH --mem-per-cpu=2000
 
##SBATCH --mail-type=END,FAIL
 
##SBATCH --mail-user=email@ufl.edu
 
 
#SBATCH --partition=gpu
 
#SBATCH --partition=gpu
#SBATCH --gres=gpu:tesla:2
+
#SBATCH --gpus=a100:4
date;hostname;pwd
+
#SBATCH --time=00:30:00
 +
 
 +
module purge
 +
module load cuda/10.0.130  intel/2018  openmpi/4.0.0 vasp/5.4.4
 +
 
 +
srun --mpi=pmix_v3 vasp_gpu
 +
 
 +
</pre>
  
module load cuda/9.1.85
+
===NAMD===
 +
<pre>
 +
#!/bin/bash
 +
#SBATCH --job-name=stmv
 +
#SBATCH --output=std.out
 +
#SBATCH --error=std.err
 +
#SBATCH --nodes=1
 +
#SBATCH --ntasks=1
 +
#SBATCH --ntasks-per-socket=1
 +
#SBATCH --cpus-per-task=4
 +
#SBATCH --distribution=block:block
 +
#SBATCH --time=30:00:00
 +
#SBATCH --mem-per-cpu=1gb
 +
#SBATCH --mail-type=NONE
 +
#SBATCH --mail-user=some_user@ufl.edu
 +
#SBATCH --partition=gpu
 +
#SBATCH --gpus=a100:2
  
cudaMemTest=/ufrc/ufhpc/chasman/Cuda/cudaMemTest/cuda_memtest
+
module load cuda/11.0.207 intel/2020.0.166 namd/2.14b2
  
cudaDevs=$(echo $CUDA_VISIBLE_DEVICES | sed -e 's/,/ /g')
+
echo "NAMD2                = $(which namd2)"
 +
echo "SBATCH_CPU_BIND_LIST = $SBATCH_CPU_BIND_LIST"
 +
echo "SBATCH_CPU_BIND      = $SBATCH_CPU_BIND    "
 +
echo "CUDA_VISIBLE_DEVICES = $CUDA_VISIBLE_DEVICES"
 +
echo "SLURM_CPUS_PER_TASK  = $SLURM_CPUS_PER_TASK "
  
for cudaDev in $cudaDevs
+
gpuList=$(echo $CUDA_VISIBLE_DEVICES | sed -e 's/,/ /g')
 +
N=0
 +
devList=""
 +
for gpu in $gpuList
 
do
 
do
  echo cudaDev = $cudaDev
+
    devList="$devList $N"
  #srun --gres=gpu:tesla:1 -n 1 --exclusive ./gpuMemTest.sh > gpuMemTest.out.$cudaDev 2>&1 &
+
    N=$(($N + 1))
  $cudaMemTest --num_passes 1 --device $cudaDev > gpuMemTest.out.$cudaDev 2>&1 &
 
 
done
 
done
wait
+
devList=$(echo $devList | sed -e 's/ /,/g')
 +
echo "devList = $devList"
  
date
+
namd2 +p$SLURM_CPUS_PER_TASK +idlepoll +devices $devList stmv.namd
</source>
+
</pre>

Revision as of 20:20, 28 July 2022


Sample SLURM Scripts

Below are a number of sample scripts that can be used as a template for building your own SLURM submission scripts for use on HiPerGator 2.0. These scripts are also located at: /data/training/SLURM/, and can be copied from there. If you choose to copy one of these sample scripts, please make sure you understand what each #SBATCH directive before before using the script to submit your jobs. Otherwise, you may not get the result you want and may waste valuable computing resources.

Note: There is a maximum limit of 3000 jobs per user.

See Annotated SLURM Script for a step-by-step explanation of all options.

Memory requests

A large number of users request far more memory than their jobs use (100-10,000 times!). As an example, since August 1st, looking at groups that have run over 1,000 jobs, there are 28 groups whose users have requested 100x the memory used in over half of those jobs. Groups often find themselves with jobs pending due to having reached their memory limits (QOSGrpMemLimit).

While it is important to request more memory than will be used (10-20% is usually sufficient), requesting 100x, or even 10,000x, more memory only reduces the number of jobs that a group can run as well as overall throughput on the cluster. Many groups, and our overall user community, will be able to run far more jobs if they request more reasonable amounts of memory.

The email sent when a job finishes shows users how much memory the job actually used and can be used to adjust memory requests for future jobs. The SLURM directives for memory requests are the --mem or --mem-per-cpu. It is in the user’s best interest to adjust the memory request to a more realistic value.

Requesting more memory than needed will not speed up analyses. Based on their experience of finding their personal computers run faster when adding more memory, users often believe that requesting more memory will make their analyses run faster. This is not the case. An application running on the cluster will have access to all of the memory it requests, and we never swap RAM to disk. If an application can use more memory, it will get more memory. Only when the job crosses the limit based on the memory request does SLURM kill the job.

Basic, Single-Threaded Job

This script can serve as the template for many single-processor applications. The mem-per-cpu flag can be used to request the appropriate amount of memory for your job. Please make sure to test your application and set this value to a reasonable number based on actual memory use. The %j in the --output line tells SLURM to substitute the job ID in the name of the output file. You can also add a -e or --error line with an error file name to separate output and error logs.

#!/bin/bash
#SBATCH --job-name=serial_job_test    # Job name
#SBATCH --mail-type=END,FAIL          # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu     # Where to send mail	
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --mem=1gb                     # Job memory request
#SBATCH --time=00:05:00               # Time limit hrs:min:sec
#SBATCH --output=serial_test_%j.log   # Standard output and error log
pwd; hostname; date

module load python

echo "Running plot script on a single CPU core"

python /data/training/SLURM/plot_template.py

date

Multi-Threaded SMP Job

This script can serve as a template for applications that are capable of using multiple processors on a single server or physical computer. These applications are commonly referred to as threaded, OpenMP, PTHREADS, or shared memory applications. While they can use multiple processors, they cannot make use of multiple servers and all the processors must be on the same node.

These applications required shared memory and can only run on one node; as such it is important to remember the following:

  • You must set --ntasks=1, and then set --cpus-per-task to the number of OpenMP threads you wish to use.
  • You must make the application aware of how many processors to use. How that is done depends on the application:
    • For some applications, set OMP_NUM_THREADS to a value less than or equal to the number of cpus-per-task you set.
    • For some applications, use a command line option when calling that application.
#!/bin/bash
#SBATCH --job-name=parallel_job      # Job name
#SBATCH --mail-type=END,FAIL         # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu    # Where to send mail	
#SBATCH --nodes=1                    # Run all processes on a single node	
#SBATCH --ntasks=1                   # Run a single task		
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
#SBATCH --mem=1gb                    # Job memory request
#SBATCH --time=00:05:00              # Time limit hrs:min:sec
#SBATCH --output=parallel_%j.log     # Standard output and error log
pwd; hostname; date

echo "Running prime number generator program on $SLURM_CPUS_ON_NODE CPU cores"

/data/training/SLURM/prime/prime

date


Another example, setting OMP_NUM_THREADS:

#!/bin/bash
#SBATCH --job-name=parallel_job_test # Job name
#SBATCH --mail-type=END,FAIL         # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu    # Where to send mail	
#SBATCH --nodes=1                    # Run all processes on a single node	
#SBATCH --ntasks=1                   # Run a single task	
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
#SBATCH --mem=600mb                  # Total memory limit
#SBATCH --time=00:05:00              # Time limit hrs:min:sec
#SBATCH --output=parallel_%j.log     # Standard output and error log
date;hostname;pwd

export OMP_NUM_THREADS=4

module load intel

./YOURPROGRAM INPUT

date

If you run multi-processing code, for example using python multiprocess module, make sure to specify a single node and the number of tasks that your code will use.

#!/bin/bash
#SBATCH --job-name=parallel_job_test # Job name
#SBATCH --mail-type=END,FAIL         # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu    # Where to send mail	
#SBATCH --nodes=1                    # Run all processes on a single node	
#SBATCH --ntasks=4                   # Number of processes
#SBATCH --mem=1gb                    # Total memory limit
#SBATCH --time=01:00:00              # Time limit hrs:min:sec
#SBATCH --output=multiprocess_%j.log # Standard output and error log
date;hostname;pwd

module load python/3

python script.py

date

Message Passing Interface (MPI) Jobs

PMIx Versions

When launching applications linked against our OpenMPI libraries via srun, you must specify the correct version of PMIx using the "--mpi" srun option. Generally speaking you can determine the appropriate PMIx version to use by running the ompi_info command after loading the desired OpenMPI environment module. For example,

$ module load intel/2018 openmpi/3.1.2
$ ompi_info --param pmix all
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v3.1.2)
                MCA pmix: ext2x (MCA v2.1.0, API v2.0.0, Component v3.1.2)
                MCA pmix: s1 (MCA v2.1.0, API v2.0.0, Component v3.1.2)
                MCA pmix: s2 (MCA v2.1.0, API v2.0.0, Component v3.1.2)
$ ml purge
$ ml intel/2019 openmpi/4.0.1
$ ompi_info --param pmix all
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.1)
                MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.0.1)
                MCA pmix: s1 (MCA v2.1.0, API v2.0.0, Component v4.0.1)
                MCA pmix: s2 (MCA v2.1.0, API v2.0.0, Component v4.0.1)

In the examples above, you would specify pmix_v2 (i.e. ext2x) for the combination of intel/2018 and openmpi/3.1.2 and pmix_v3 (ext3x) for the second set of modules, intel/2019 and openmpi/4.0.1.

Important srun/sbatch/salloc Options

This script can serve as a template for MPI, or message passing interface, applications. These are applications that can use multiple processors that may, or may not, be on multiple compute nodes.

Our testing has found that it is best to be very specific about how you want your MPI ranks laid out across nodes and even sockets (multi-core CPUs). SLURM and OpenMPI have some conflicting behavior if you leave too much to chance. Please refer to the full SLURM sbatch documentation, but the following directives are the main directives to pay attention to:

  • -c, --cpus-per-task=<ncpus>
    • Request ncpus cores per task.
  • -m, --distribution=arbitrary|<block|cyclic|plane=<options>[:block|cyclic|fcyclic]>
    • Specify alternate distribution methods for remote processes.
    • We recommend -m cyclic:cyclic, which tells SLURM to distribute tasks cyclically over nodes and sockets.
  • -N, --nodes=<minnodes[-maxnodes]>
    • Request that a minimum of minnodes nodes be allocated to this job.
  • -n, --ntasks=<number>
    • Number of tasks (MPI ranks)
  • --ntasks-per-node=<ntasks>
    • Request that ntasks be invoked on each node
  • --ntasks-per-socket=<ntasks>
    • Request the maximum ntasks be invoked on each socket
    • Notes on socket layout:
      • hpg3-compute nodes have 2 sockets, each with 64 cores.
      • hpg2-compute nodes have 2 sockets, each with 16 cores.
      • hpg1-compute nodes have 4 sockets, each with 16 cores.

Example

The following example requests 24 tasks, each with a single core. It further specifies that these should be split evenly on 2 nodes, and within the nodes, the 12 tasks should be evenly split on the two sockets. So each CPU on the two nodes will have 6 tasks, each with its own dedicated core. The --distribution option will ensure that tasks are assigned cyclically among the allocated nodes and sockets. Please see the SchedMD sbatch documentation for more detailed explanations of each of the sbatch options below.

SLURM is very flexible and allows users to be very specific about their resource requests. Thinking about your application and doing some testing will be important to determine the best set of resources for your specific job.

#!/bin/bash
#SBATCH --job-name=mpi_job_test      # Job name
#SBATCH --mail-type=END,FAIL         # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu    # Where to send mail.  Set this to your email address
#SBATCH --ntasks=24                  # Number of MPI tasks (i.e. processes)
#SBATCH --cpus-per-task=1            # Number of cores per MPI task 
#SBATCH --nodes=2                    # Maximum number of nodes to be allocated
#SBATCH --ntasks-per-node=12         # Maximum number of tasks on each node
#SBATCH --ntasks-per-socket=6        # Maximum number of tasks on each socket
#SBATCH --distribution=cyclic:cyclic # Distribute tasks cyclically first among nodes and then among sockets within a node
#SBATCH --mem-per-cpu=600mb          # Memory (i.e. RAM) per processor
#SBATCH --time=00:05:00              # Wall time limit (days-hrs:min:sec)
#SBATCH --output=mpi_test_%j.log     # Path to the standard output and error files relative to the working directory

echo "Date              = $(date)"
echo "Hostname          = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated      = $SLURM_NTASKS"
echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"

module load intel/2018.1.163 openmpi/3.0.0
srun --mpi=pmix_v1 /data/training/SLURM/prime/prime_mpi

Hybrid MPI/Threaded job

This script can serve as a template for hybrid MPI/SMP applications. These are MPI applications where each MPI process is multi-threaded (usually via either OpenMP or POSIX Threads) and can use multiple processors.

Our testing has found that it is best to be very specific about how you want your MPI ranks laid out across nodes and even sockets (multi-core CPUs). SLURM and OpenMPI have some conflicting behavior if you leave too much to chance. Please refer to the full SLURM sbatch documentation, as well as the information in the MPI example above.

The following example requests 8 tasks, each with 4 cores. It further specifies that these should be split evenly on 2 nodes, and within the nodes, the 4 tasks should be evenly split on the two sockets. So each CPU on the two nodes will have 2 tasks, each with 4 cores. The distribution option will ensure that MPI ranks are distributed cyclically on nodes and sockets.

#!/bin/bash
#SBATCH --job-name=hybrid_job_test      # Job name
#SBATCH --mail-type=END,FAIL            # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu       # Where to send mail	
#SBATCH --ntasks=8                      # Number of MPI ranks
#SBATCH --cpus-per-task=4               # Number of cores per MPI rank 
#SBATCH --nodes=2                       # Number of nodes
#SBATCH --ntasks-per-node=4             # How many tasks on each node
#SBATCH --ntasks-per-socket=2           # How many tasks on each CPU or socket
#SBATCH --mem-per-cpu=100mb             # Memory per core
#SBATCH --time=00:05:00                 # Time limit hrs:min:sec
#SBATCH --output=hybrid_test_%j.log     # Standard output and error log
pwd; hostname; date
 
module load  intel/2018.1.163  openmpi/3.0.0 raxml/8.2.12
 
srun --mpi=pmix_v1 raxmlHPC-HYBRID-SSE3 -T $SLURM_CPUS_PER_TASK \
      -f a -m GTRGAMMA -s /data/training/SLURM/dna.phy -p $RANDOM \
      -x $RANDOM -N 500 -n dna
 
date

The following example requests 8 tasks, each with 8 cores. It further specifies that these should be split evenly on 4 nodes, and within the nodes, the 2 tasks should be split, one on each of the two sockets. So each CPU on the two nodes will have 1 task, each with 8 cores. The distribution option will ensure that MPI ranks are distributed cyclically on nodes and sockets.

Also note setting OMP_NUM_THREADS so that OpenMP knows how many threads to use per task.

#!/bin/bash
#SBATCH --job-name=LAMMPS
#SBATCH --output=LAMMPS_%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=<email_address>
#SBATCH --nodes=4              # Number of nodes
#SBATCH --ntasks=8             # Number of MPI ranks
#SBATCH --ntasks-per-node=2    # Number of MPI ranks per node
#SBATCH --ntasks-per-socket=1  # Number of tasks per processor socket on the node
#SBATCH --cpus-per-task=8      # Number of OpenMP threads for each MPI process/rank
#SBATCH --mem-per-cpu=2000mb   # Per processor memory request
#SBATCH --time=4-00:00:00      # Walltime in hh:mm:ss or d-hh:mm:ss
date;hostname;pwd

module load intel/2018 openmpi/3.1.0

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun --mpi=pmi_v1 /path/to/app/lmp_gator2 < in.Cu.v.24nm.eq_xrd

date
  • Note that MPI gets -np from SLURM automatically.
  • Note there are many directives available to control processor layout.
    • Some to pay particular attention to are:
      • --nodes if you care exactly how many nodes are used
      • --ntasks-per-node to limit number of tasks on a node
      • --distribution one of several directives (see also --contiguous, --cores-per-socket, --mem_bind, --ntasks-per-socket, --sockets-per-node) to control how tasks, cores and memory are distributed among nodes, sockets and cores. While SLURM will generally make appropriate decisions for setting up jobs, careful use of these directives can significantly enhance job performance and users are encouraged to profile application performance under different conditions.

Array job

Please see the SLURM Job Arrays page for information on job arrays. Note that we use the simplest 'single-threaded' process example from above and extending it to an array of jobs. Modify the following script using the parallel, mpi, or hybrid job layout as needed.

#!/bin/bash
#SBATCH --job-name=array_job_test   # Job name
#SBATCH --mail-type=FAIL            # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu   # Where to send mail	
#SBATCH --ntasks=1                  # Run a single task
#SBATCH --mem=1gb                   # Job Memory
#SBATCH --time=00:05:00             # Time limit hrs:min:sec
#SBATCH --output=array_%A-%a.log    # Standard output and error log
#SBATCH --array=1-5                 # Array range
pwd; hostname; date

echo This is task $SLURM_ARRAY_TASK_ID

date

Note the use of %A for the master job ID of the array, and the %a for the task ID in the output filename.

GPU job

Please see GPU Access for more information regarding the use of HiPerGator GPUs. Note that the order in which the environment modules are loaded is important.

VASP

#!/bin/bash
#SBATCH --job-name=vasptest
#SBATCH --output=vasp.out
#SBATCH --error=vasp.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@ufl.edu
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=8
#SBATCH --distribution=cyclic:cyclic
#SBATCH --mem-per-cpu=7000mb
#SBATCH --partition=gpu
#SBATCH --gpus=a100:4
#SBATCH --time=00:30:00

module purge
module load cuda/10.0.130  intel/2018  openmpi/4.0.0 vasp/5.4.4

srun --mpi=pmix_v3 vasp_gpu

NAMD

#!/bin/bash
#SBATCH --job-name=stmv
#SBATCH --output=std.out
#SBATCH --error=std.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=4
#SBATCH --distribution=block:block
#SBATCH --time=30:00:00
#SBATCH --mem-per-cpu=1gb
#SBATCH --mail-type=NONE
#SBATCH --mail-user=some_user@ufl.edu
#SBATCH --partition=gpu
#SBATCH --gpus=a100:2

module load cuda/11.0.207 intel/2020.0.166 namd/2.14b2

echo "NAMD2                = $(which namd2)"
echo "SBATCH_CPU_BIND_LIST = $SBATCH_CPU_BIND_LIST"
echo "SBATCH_CPU_BIND      = $SBATCH_CPU_BIND     "
echo "CUDA_VISIBLE_DEVICES = $CUDA_VISIBLE_DEVICES"
echo "SLURM_CPUS_PER_TASK  = $SLURM_CPUS_PER_TASK "

gpuList=$(echo $CUDA_VISIBLE_DEVICES | sed -e 's/,/ /g')
N=0
devList=""
for gpu in $gpuList
do
    devList="$devList $N"
    N=$(($N + 1))
done
devList=$(echo $devList | sed -e 's/ /,/g')
echo "devList = $devList"

namd2 +p$SLURM_CPUS_PER_TASK +idlepoll +devices $devList stmv.namd