Array ID Indexes

From UFRC
Revision as of 18:49, 16 May 2023 by Israel.herrera (talk | contribs) (Created page with "Back to SLURM Job Arrays ==Using the array ID Index== SLURM will provide a ''$SLURM_ARRAY_TASK_ID'' variable to each task. It can be used inside the job script to handle i...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Back to SLURM Job Arrays

Using the array ID Index

SLURM will provide a $SLURM_ARRAY_TASK_ID variable to each task. It can be used inside the job script to handle input and output files for that task.

For instance, for a 100-task job array the input files can be named seq_1.fa, seq_2.fa and so on through seq_100.fa. In a job script for a blastn job they can be referenced as blastn -query seq_${SLURM_ARRAY_TASK_ID}.fa. The output files can be handled in the same way.

One common application of array jobs is to run many input files. While it is easy if the files are numbered as in the example above, this is not needed. If for example you have a folder of 100 files that end in .txt, you can use the following approach to get the name of the file for each task automatically:

file=$(ls *.txt | sed -n ${SLURM_ARRAY_TASK_ID}p)
myscript -in $file

If, alternatively, you use an input file (e.g. 'input.list') with a list of samples/datasets (one per line) to process you can pick an item from the list as follows:

SAMPLE_LIST=($(<input.list))
SAMPLE=${SAMPLE_LIST[${SLURM_ARRAY_TASK_ID}]}

Array ID use in Scripts

When running custom code written in Python or R use the respective module that allows you to read environment variables to read the SLURM array task id of the current job and use it to perform analysis on the correct input file or data column/row. For example:

Python
import sys
jobid = sys.getenv('SLURM_ARRAY_TASK_ID')
R
task_id <- Sys.getenv("SLURM_ARRAY_TASK_ID")

Extended Example

This shell portion of a SLURM job script sets input and output directories as variables. Then, it sets a RUN variable to the SLURM task id in a job or to a value of '1' if you are running a quick test before submitting a job. The RUN value is used as an index value to pick a full path to a dataset from the input directory, determine the file name, and remove the specified extension to create a value that may represent a sample name to be used in forming an output path. Finally, the automated command is printed to stdout to be recorded in the job log and executed as a command.

Expand to see an explicit example of using SLURM Arrays and automating handling of input and output datasets.

INPUT_DIR=/blue/group/user/project/run/input
OUTPUT_DIR=/blue/group/user/project/run/output

RUN=${SLURM_ARRAY_TASK_ID:-1}

echo "Run: ${RUN}"

module load plink/1.90b3.39

INPUT_PATH=$(ls ${INPUT_DIR}/*.vcf.gz | sed -n ${RUN}p)
INPUT_FILE=$(basename ${INPUT_PATH})
SAMPLE=$(basename ${INPUT_FILE} .vcf.gz)

read -d '' cmd << EOF
plink \
--vcf ${INPUT_PATH} \
--out ${OUTPUT_DIR}/${SAMPLE} \
--allow-no-sex \
--keep-allele-order
EOF
echo ${cmd}

eval ${cmd}