Difference between revisions of "RAxML"
Line 31: | Line 31: | ||
* HPC_{{#uppercase:{{#var:app}}}}_DIR - installation directory | * HPC_{{#uppercase:{{#var:app}}}}_DIR - installation directory | ||
{{#if: {{#var: exe}}|==How To Run== | {{#if: {{#var: exe}}|==How To Run== | ||
− | Please see the discussion below on performance characteristics of the different implementations of RAxML. In general, there are four | + | Please see the discussion below on performance characteristics of the different implementations of RAxML. In general, there are four different RAxML executables installed on the Research Computing systems. They are |
+ | |||
+ | # Single threaded (i.e. serial) | ||
+ | # Multi-threaded (i.e. parallel) | ||
+ | # MPI (distributed memory, parallel) | ||
+ | # Hybrid (MPI for coarse-grained parallelism with threading for fine-grained parallelism) | ||
+ | |||
+ | Each of these is described below, along with notes on running them. | ||
===SSE3 vs AVX=== | ===SSE3 vs AVX=== |
Revision as of 12:40, 6 February 2014
Description
RAxML (Randomized Axelerated Maximum Likelihood) written by Alexandros Stamatakis and others is a program for sequential and parallel Maximum Likelihood based inference of large phylogenetic trees. It was originally derived from fastDNAml which in turn was derived from Joe Felsentein’s dnaml which is part of the PHYLIP package.
Required Modules
Serial
- raxml
Parallel (MPI)
- intel
- openmpi
- raxml
System Variables
- HPC_{{#uppercase:raxml}}_DIR - installation directory
How To Run
Please see the discussion below on performance characteristics of the different implementations of RAxML. In general, there are four different RAxML executables installed on the Research Computing systems. They are
- Single threaded (i.e. serial)
- Multi-threaded (i.e. parallel)
- MPI (distributed memory, parallel)
- Hybrid (MPI for coarse-grained parallelism with threading for fine-grained parallelism)
Each of these is described below, along with notes on running them.
SSE3 vs AVX
SSE3 and AVX are specialized vectorization algorithms that can improve performance. For each of the four classes of RAxML versions, we have both an SSE3 and and AVX version compiled. HiPerGator compute servers do support the AVX instruction set, and in theory, this version should be slightly faster (~10%) than the SSE3 version. However, our testing indicates that this is not the case, so users should experiment with their particular dataset, or default to the SSE3 version. To run the AVX version of RAxML, replace SSE3 in the executable name with AVX.
Serial version
The serial version of RAxML is called raxmlHPC-SSE3, and is a single-threaded application.
Multithreaded
The multi-threaded version of RAxML is called raxmlHPC-PTHREADS-SSE3, and can use multiple processors on a single compute sever (or node). This version implements the fine-grained parallelism as discussed below. Resource requests for this version should always be in the form of nodes=1:ppn=x, where x is the number of processors to use. Please see the information below when selecting the number of processors to use. In our testing, values over 8 do not significantly speed up analyses and should be avoided. It is important to use the -T flag which tells RAxML how many processors to use. You can either put the same number used in the resource request, or use the PBS environment variable $PBS_NUM_PPN which is set for you by the scheduler--e.g. -T $PBS_NUM_PPN.
MPI
The distributed memory version of RAxML utilizes the MPI API. The executable is called raxmlHPC-MPI-SSE3 and can use multiple processors that may, or may not be, on the same compute server (node). This version implements the course-grained parallelism as discussed below. Resource requests for this version should generally be in the form of nodes=1:ppn=x, where x is the number of processors to use, as long as x is less than 32. If you want to use more than 32 processors, you should generally ask for more nodes.
Hybrid
The hybrid (MPI and multi-threading) executable of RAxML is called raxmlHPC-HYBRID-SSE3 and uses multiple processors on multiple compute servers. It implements both course-grained and fine-grained parallelism as discussed below. Resource requests for this version should be in the form of nodes=x:ppn=y. As with the MPI executable, if the total number of processors desired is 32 or less, the resource request should be nodes=1:ppn=y and you should "mpiexec -np <number of course-grained processes>" and "-T <number of fine-grained threads>" such that the product of the two equals y.
For example to run 5 course-grained processes, each of which using 4 fine-grained threads, the following resource request and command line is suggested.
#PBS -l nodes=1:ppn=20
...
mpiexec -bynode -np 5 raxml-HPC-HYBRID-SSE3 -T 4 ...
If you require more than 32 cores total, it is best to use multiple nodes. In this case, the number of nodes and processors per node should correspond to the number of course-grained and fine-grained threads requested. For example,
#PBS -l nodes=10:ppn=4
...
mpiexec -bynode -np 10 raxml-HPC-HYBRID-SSE3 -T 4 ...
PBS Script Examples
See the RAxML_PBS page for raxml PBS script examples.
Performance
We highly recommend that users read the paper by Pfeiffer and Stamatakis (2010) before running parallel versions of RAxML. This paper provides a good overview of the different types of parallelism implemented in RAxML and how to best leverage them for analyses. The discussion below is largely based on this paper.
Course- and Fine-Grained Parallelism in RAxML
RAxML implements two different types of parallelism, referred to as course-grained and fine-grained. Course-grained parallelism is able to be split across multiple compute servers. Each course-grained process can work on one tree optimization. This may be a bootstrap replicate or a ML search. Fine-grained parallelism allows multiple processors on the SAME server to split up a singe tree optimization. A single optimization cannot be split across servers.
If the user is running the -f option (bootstrap search and ML search in one analysis) using the MPI or Hybrid executabls, the bootstrap replicates are split among the MPI processes, and once those are complete, each MPI process does an independent ML search. This is slightly different than under other methods as multiple ML searches are being performed. While this is likely a good thing in terms of finding the ML tree and a thorough analysis, users should understand that this stage will not see a reduction in run time because each MPI task is doing an independent search, rather than working together on a single search.