Multi-Threaded & Message Passing Job Scripts

From UFRC
Jump to navigation Jump to search

Back to Sample SLURM Scripts

Multi-Threaded SMP Job

This script can serve as a template for applications that are capable of using multiple processors on a single server or physical computer. These applications are commonly referred to as threaded, OpenMP, PTHREADS, or shared memory applications. While they can use multiple processors, they cannot make use of multiple servers and all the processors must be on the same node.

These applications required shared memory and can only run on one node; as such it is important to remember the following:

  • You must set --ntasks=1, and then set --cpus-per-task to the number of OpenMP threads you wish to use.
  • You must make the application aware of how many processors to use. How that is done depends on the application:
    • For some applications, set OMP_NUM_THREADS to a value less than or equal to the number of cpus-per-task you set.
    • For some applications, use a command line option when calling that application.

Expand to view example

#!/bin/bash
#SBATCH --job-name=parallel_job      # Job name
#SBATCH --mail-type=END,FAIL         # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu    # Where to send mail	
#SBATCH --nodes=1                    # Run all processes on a single node	
#SBATCH --ntasks=1                   # Run a single task		
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
#SBATCH --mem=1gb                    # Job memory request
#SBATCH --time=00:05:00              # Time limit hrs:min:sec
#SBATCH --output=parallel_%j.log     # Standard output and error log
pwd; hostname; date

echo "Running prime number generator program on $SLURM_CPUS_ON_NODE CPU cores"

/data/training/SLURM/prime/prime

date

Expand to view another example, setting OMP_NUM_THREADS:

#!/bin/bash
#SBATCH --job-name=parallel_job_test # Job name
#SBATCH --mail-type=END,FAIL         # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu    # Where to send mail	
#SBATCH --nodes=1                    # Run all processes on a single node	
#SBATCH --ntasks=1                   # Run a single task	
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
#SBATCH --mem=600mb                  # Total memory limit
#SBATCH --time=00:05:00              # Time limit hrs:min:sec
#SBATCH --output=parallel_%j.log     # Standard output and error log
date;hostname;pwd

export OMP_NUM_THREADS=4

module load intel

./YOURPROGRAM INPUT

date

If you run multi-processing code, for example using python multiprocess module, make sure to specify a single node and the number of tasks that your code will use.

Expand to view example

#!/bin/bash
#SBATCH --job-name=parallel_job_test # Job name
#SBATCH --mail-type=END,FAIL         # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu    # Where to send mail	
#SBATCH --nodes=1                    # Run all processes on a single node	
#SBATCH --ntasks=4                   # Number of processes
#SBATCH --mem=1gb                    # Total memory limit
#SBATCH --time=01:00:00              # Time limit hrs:min:sec
#SBATCH --output=multiprocess_%j.log # Standard output and error log
date;hostname;pwd

module load python/3

python script.py

date

Message Passing Interface (MPI) Jobs

PMIx Versions

When launching applications linked against our OpenMPI libraries via srun, you must specify the correct version of PMIx using the "--mpi" srun option. Generally speaking you can determine the appropriate PMIx version to use by running the ompi_info command after loading the desired OpenMPI environment module. However, the correct PMIx version is version is already set as an environment variable "HPC_PMIX" in the openmpi modules. E.g.: --mpi=${HPC_PMIX}

Expand to see example.

For example,

$ module load intel/2018 openmpi/3.1.2
$ ompi_info --param pmix all
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v3.1.2)
                MCA pmix: ext2x (MCA v2.1.0, API v2.0.0, Component v3.1.2)
                MCA pmix: s1 (MCA v2.1.0, API v2.0.0, Component v3.1.2)
                MCA pmix: s2 (MCA v2.1.0, API v2.0.0, Component v3.1.2)
$ ml purge
$ ml intel/2019 openmpi/4.0.1
$ ompi_info --param pmix all
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.1)
                MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.0.1)
                MCA pmix: s1 (MCA v2.1.0, API v2.0.0, Component v4.0.1)
                MCA pmix: s2 (MCA v2.1.0, API v2.0.0, Component v4.0.1)


In the examples above, you would specify pmix_v2 (i.e. ext2x) for the combination of intel/2018 and openmpi/3.1.2 and pmix_v3 (ext3x) for the second set of modules, intel/2019 and openmpi/4.0.1.

Important srun/sbatch/salloc Options

This script can serve as a template for MPI, or message passing interface, applications. These are applications that can use multiple processors that may, or may not, be on multiple compute nodes.

Our testing has found that it is best to be very specific about how you want your MPI ranks laid out across nodes and even sockets (multi-core CPUs). SLURM and OpenMPI have some conflicting behavior if you leave too much to chance. Please refer to the full SLURM sbatch documentation, but the following directives are the main directives to pay attention to:

  • -c, --cpus-per-task=<ncpus>
    • Request ncpus cores per task.
  • -m, --distribution=arbitrary|<block|cyclic|plane=<options>[:block|cyclic|fcyclic]>
    • Specify alternate distribution methods for remote processes.
    • We recommend -m cyclic:cyclic, which tells SLURM to distribute tasks cyclically over nodes and sockets.
  • -N, --nodes=<minnodes[-maxnodes]>
    • Request that a minimum of minnodes nodes be allocated to this job.
  • -n, --ntasks=<number>
    • Number of tasks (MPI ranks)
  • --ntasks-per-node=<ntasks>
    • Request that ntasks be invoked on each node
  • --ntasks-per-socket=<ntasks>
    • Request the maximum ntasks be invoked on each socket
    • Notes on socket layout:
      • hpg3-compute nodes have 2 sockets, each with 64 cores.
      • hpg2-compute nodes have 2 sockets, each with 16 cores.
      • hpg1-compute nodes have 4 sockets, each with 16 cores.


Example

The following example requests 24 tasks, each with a single core. It further specifies that these should be split evenly on 2 nodes, and within the nodes, the 12 tasks should be evenly split on the two sockets. So each CPU on the two nodes will have 6 tasks, each with its own dedicated core. The --distribution option will ensure that tasks are assigned cyclically among the allocated nodes and sockets. Please see the SchedMD sbatch documentation for more detailed explanations of each of the sbatch options below.

SLURM is very flexible and allows users to be very specific about their resource requests. Thinking about your application and doing some testing will be important to determine the best set of resources for your specific job.

Expand to see example.

#!/bin/bash
#SBATCH --job-name=mpi_job_test      # Job name
#SBATCH --mail-type=END,FAIL         # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@ufl.edu    # Where to send mail.  Set this to your email address
#SBATCH --ntasks=24                  # Number of MPI tasks (i.e. processes)
#SBATCH --cpus-per-task=1            # Number of cores per MPI task 
#SBATCH --nodes=2                    # Maximum number of nodes to be allocated
#SBATCH --ntasks-per-node=12         # Maximum number of tasks on each node
#SBATCH --ntasks-per-socket=6        # Maximum number of tasks on each socket
#SBATCH --distribution=cyclic:cyclic # Distribute tasks cyclically first among nodes and then among sockets within a node
#SBATCH --mem-per-cpu=600mb          # Memory (i.e. RAM) per processor
#SBATCH --time=00:05:00              # Wall time limit (days-hrs:min:sec)
#SBATCH --output=mpi_test_%j.log     # Path to the standard output and error files relative to the working directory

echo "Date              = $(date)"
echo "Hostname          = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated      = $SLURM_NTASKS"
echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"

module load intel/2018.1.163 openmpi/3.0.0
srun --mpi=${HPC_PMIX} /data/training/SLURM/prime/prime_mpi