R

From UFRC
Revision as of 18:20, 13 June 2012 by Taylor (talk | contribs)
Jump to navigation Jump to search

Description

{{{name}}} website  
R is a free software environment for statistical computing and graphics. Template:App Location

Available versions

Note: File a support ticket to request installation of additional libraries.

  • 2.14.1-mpi - R base package MPI-enabled via the Rmpi library.
  • 2.14.2
  • 2.15.0 (default)

Running the application using modules

To use R with the environment modules system at HPC the following commands are available:

Get module information for r:

$module spider R

Load the default application module:

$module load R

The modulefile for this software adds the directory with executable files to the shell execution PATH and sets the following environment variables:

  • HPC_R_DIR - directory where R is located.
  • HPC_R_BIN - executable directory
  • HPC_R_LIB - library directory
  • HPC_R_INCLUDE - includes directory

To use the version of R built for parallel execution with MPI via the Rmpi library load the following modules:

module load intel/11.1 openmpi/1.4.3 R

Installed Packages

Note: Many of the packages in the R library shown below are installed as a part of Bioconductor meta-library. The list is generated from the default R version.

affy                    Methods for Affymetrix Oligonucleotide Arrays
affydata                Affymetrix Data for Demonstration Purpose
affyio                  Tools for parsing Affymetrix data files
affyPLM                 Methods for fitting probe-level models
affyQCReport            QC Report Generation for affyBatch objects
akima                   Interpolation of irregularly spaced data
annaffy                 Annotation tools for Affymetrix biological
                        metadata
annotate                Annotation for microarrays
AnnotationDbi           Annotation Database Interface
ape                     Analyses of Phylogenetics and Evolution
base                    The R Base Package
baySeq                  Empirical Bayesian analysis of patterns of
                        differential expression in count data
Biobase                 Biobase: Base functions for Bioconductor
BiocGenerics            Generic functions for Bioconductor
BiocInstaller           Install/Update Bioconductor and CRAN Packages
Biostrings              String objects representing biological
                        sequences, and matching algorithms
bitops                  Functions for Bitwise operations
boot                    Bootstrap Functions (originally by Angelo Canty
                        for S)
caTools                 Tools: moving window statistics, GIF, Base64,
                        ROC AUC, etc.
class                   Functions for Classification
cluster                 Cluster Analysis Extended Rousseeuw et al.
CNVtools                A package to test genetic association with CNV
                        data
codetools               Code Analysis Tools for R
colorspace              Color Space Manipulation
compiler                The R Compiler Package
datasets                The R Datasets Package
DBI                     R Database Interface
DESeq                   Differential gene expression analysis based on
                        the negative binomial distribution
dichromat               Color schemes for dichromats
digest                  Create cryptographic hash digests of R objects
doMC                    Foreach parallel adaptor for the multicore
                        package
doSNOW                  Foreach parallel adaptor for the snow package
DynDoc                  Dynamic document tools
edgeR                   Empirical analysis of digital gene expression
                        data in R
foreach                 Foreach looping construct for R
foreign                 Read Data Stored by Minitab, S, SAS, SPSS,
                        Stata, Systat, dBase, ...
gcrma                   Background Adjustment Using Sequence
                        Information
gdata                   Various R programming tools for data
                        manipulation
gee                     Generalized Estimation Equation solver
geiger                  Analysis of evolutionary diversification
genefilter              genefilter: methods for filtering genes from
                        microarray experiments
geneplotter             Graphics related functions for Bioconductor
GenomicRanges           Representation and manipulation of genomic
                        intervals
ggplot2                 An implementation of the Grammar of Graphics
glmmADMB                Generalized Linear Mixed Models using AD Model
                        Builder
GO.db                   A set of annotation maps describing the entire
                        Gene Ontology
gplots                  Various R programming tools for plotting data
graphics                The R Graphics Package
grDevices               The R Graphics Devices and Support for Colours
                        and Fonts
grid                    The Grid Graphics Package
gtools                  Various R programming tools
hacks                   Convenient R Functions
hgu95av2.db             Affymetrix Human Genome U95 Set annotation data
                        (chip hgu95av2)
HilbertVis              Hilbert curve visualization
Hmisc                   Harrell Miscellaneous
IRanges                 Infrastructure for manipulating intervals on
                        sequences
iterators               Iterator construct for R
itertools               Iterator Tools
KEGG.db                 A set of annotation maps for KEGG
KernSmooth              Functions for kernel smoothing for Wand & Jones
                        (1995)
labeling                Axis Labeling
laser                   Likelihood Analysis of Speciation/Extinction
                        Rates from Phylogenies
lattice                 Lattice Graphics
leaps                   regression subset selection
limma                   Linear Models for Microarray Data
locfit                  Local Regression, Likelihood and Density
                        Estimation.
maanova                 Tools for analyzing Micro Array experiments
marray                  Exploratory analysis for two-color spotted
                        microarray data
MASS                    Support Functions and Datasets for Venables and
                        Ripley's MASS
Matrix                  Sparse and Dense Matrix Classes and Methods
memoise                 Memoise functions
methods                 Formal Methods and Classes
mgcv                    Mixed GAM Computation Vehicle with GCV/AIC/REML
                        smoothness estimation
msm                     Multi-state Markov and hidden Markov models in
                        continuous time
multicore               Parallel processing of R code on machines with
                        multiple cores or CPUs
multtest                Resampling-based multiple hypothesis testing
munsell                 Munsell colour system
mvtnorm                 Multivariate Normal and t Distributions
nlme                    Linear and Nonlinear Mixed Effects Models
nnet                    Feed-forward Neural Networks and Multinomial
                        Log-Linear Models
org.Hs.eg.db            Genome wide annotation for Human
ouch                    Ornstein-Uhlenbeck models for phylogenetic
                        comparative hypotheses
parallel                Support for Parallel computation in R
permute                 Functions for generating restricted
                        permutations of data
pheatmap                Pretty Heatmaps
phylobase               Base package for phylogenetic structures and
                        comparative data
picante                 R tools for integrating phylogenies and ecology
plyr                    Tools for splitting, applying and combining
                        data
preprocessCore          A collection of pre-processing functions
prettyR                 Pretty descriptive stats.
proto                   Prototype object-based programming
qvalue                  Q-value estimation for false discovery rate
                        control
R2admb                  ADMB to R interface functions
RColorBrewer            ColorBrewer palettes
Rcpp                    Seamless R and C++ Integration
reshape                 Flexibly reshape data.
reshape2                Flexibly reshape data: a reboot of the reshape
                        package.
rpart                   Recursive Partitioning
RSQLite                 SQLite interface for R
Rwave                   Time-Frequency analysis of 1-D signals
scales                  Scale functions for graphics.
simpleaffy              Very simple high level analysis of Affymetrix
                        data
snow                    Simple Network of Workstations
spatial                 Functions for Kriging and Point Pattern
                        Analysis
splines                 Regression Spline Functions and Classes
statmod                 Statistical Modeling
stats                   The R Stats Package
stats4                  Statistical Functions using S4 Classes
stringr                 Make it easier to work with strings.
subplex                 Subplex optimization algorithm
survival                Survival analysis, including penalised
                        likelihood.
tcltk                   Tcl/Tk Interface
tools                   Tools for Package Development
utils                   The R Utils Package
vegan                   Community Ecology Package
vsn                     Variance stabilization and calibration for
                        microarray data
waveslim                Basic wavelet routines for one-, two- and
                        three-dimensional signal processing
wavethresh              Wavelets statistics and transforms.
XML                     Tools for parsing and generating XML within R
                        and S-Plus.
xtable                  Export tables to LaTeX or HTML
zlibbioc                An R packaged zlib-1.2.5




FAQ

Q: When I submit the job with N=1 and M=1 it runs and R allocates the 10 slaves that I want. Is this the OK?

A: In short, no. This is bad since you are lying to the scheduler about the resources you intend to run. We have scripts that will kill your job if they catch it and we tend to suspend accounts of users who make a practice of it. :)

Q: The actual job I want to run is much larger. Anywhere from 31 to 93 processors are desired. Is it ok to request this many processors.

A: That depends on the level of investment from your PI. If you ask for processors than your groups core allocation, which depends on the investment level, you will be essentially borrowing cores from other groups and may wait an extended period of time in the queue before your job runs. Groups are allowed to run on up to 10x their core allocation provided the resources are available. If you ask for more than 10x your groups core allocation, the job will be blocked indefinitely.

Q: Do I need the number of nodes requested to be correct or can I just have R go grab slaves after the job is submitted with N=1 and M=1?

A: Your resource request must be consistent with what you actually intend to use as noted above.

Q: Is it better to request a large number of nodes for a shorter period of time or less nodes for longer period of time (concretely, say 8 nodes for 40 hours versus 16 nodes for 20 hours) in terms of getting through the queue?

A: Do not confuse "nodes" with "cores/processors". Each "node" is a physical machine with between 4 and 48 cores. Your MPI threads will run on "cores" which may all be in the same "node" or spread among multiple nodes. You should ask for the number of cores you need and spread them among as few nodes as possible unless you have a good reason to do otherwise. Thus you should generally ask for things like

nodes=1:ppn=8 (we have lots of 8p nodes)
nodes=1:ppn=12 (we have a number of 12p also)

Multiples of the above work as well so you might ask for nodes=3:ppn=8 if you want to run 24 threads on 24 different cores.

It looks like in the R model there is a master/slave paradigm so you really need one master thread to manage the "slave" threads. It is likely that the master thread accumulates little CPU time so you could neglect it. In other words tell the scheduler that you want nodes=3:ppn=8 and tell R to spawn 24 children. This is a white lie which will do little harm. However, if it turns out that the master accumulates significant CPU time and your job gets killed by our rogue process killer, you can ask for the resources as follows

#PBS -l nodes=1:ppn=1infiniband+3:ppn=8:infiniband 

This will allocate 1 thread on a separate node (the master thread) and then the slave threads will be allocated on 3 additional nodes with at least 8 cores each.


Example of using the parallel module to run MPI jobs under R 2.14.1+

{{#fileAnchor: rmpi_test.R}} Download raw source of the [{{#fileLink: rmpi_test.R}} rmpi_test.R] file.

# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
    library("Rmpi")
    }
                                                                                
# Spawn as many slaves as possible
mpi.spawn.Rslaves()
                                                                                
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
    if (is.loaded("mpi_initialize")){
        if (mpi.comm.size(1) > 0){
            print("Please use mpi.close.Rslaves() to close slaves.")
            mpi.close.Rslaves()
        }
        print("Please use mpi.quit() to quit R")
        .Call("mpi_finalize")
    }
}

# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))

# Tell all slaves to close down, and exit the program
mpi.close.Rslaves()
mpi.quit()