# Difference between revisions of "R"

Line 254: | Line 254: | ||

This is a white lie which will do little harm. However, if it turns out that the master accumulates significant CPU time and your job gets killed by our rogue process killer, you can ask for the resources | This is a white lie which will do little harm. However, if it turns out that the master accumulates significant CPU time and your job gets killed by our rogue process killer, you can ask for the resources | ||

as follows <br /> | as follows <br /> | ||

− | #PBS -l nodes=1:ppn=1infiniband+3:ppn=8:infiniband <br /> | + | #PBS -l nodes=1:ppn=1infiniband+3:ppn=8:infiniband <br /> |

This will allocate 1 thread on a separate node (the master thread) and then the | This will allocate 1 thread on a separate node (the master thread) and then the | ||

slave threads will be allocated on 3 additional nodes with at least 8 cores | slave threads will be allocated on 3 additional nodes with at least 8 cores |

## Revision as of 18:20, 13 June 2012

## Description

{{{name}}} website

R is a free software environment for statistical computing and graphics.
Template:App Location

## Available versions

**Note: File a support ticket to request installation of additional libraries.**

- 2.14.1-mpi - R base package MPI-enabled via the Rmpi library.
- 2.14.2
- 2.15.0 (default)

## Running the application using modules

To use R with the environment modules system at HPC the following commands are available:

Get module information for r:

$module spider R

Load the default application module:

$module load R

The modulefile for this software adds the directory with executable files to the shell execution PATH and sets the following environment variables:

- HPC_R_DIR - directory where R is located.
- HPC_R_BIN - executable directory
- HPC_R_LIB - library directory
- HPC_R_INCLUDE - includes directory

To use the version of R built for parallel execution with MPI via the Rmpi library load the following modules:

module load intel/11.1 openmpi/1.4.3 R

## Installed Packages

**Note: ** Many of the packages in the R library shown below are installed as a part of Bioconductor meta-library. The list is generated from the default R version.

affy Methods for Affymetrix Oligonucleotide Arrays affydata Affymetrix Data for Demonstration Purpose affyio Tools for parsing Affymetrix data files affyPLM Methods for fitting probe-level models affyQCReport QC Report Generation for affyBatch objects akima Interpolation of irregularly spaced data annaffy Annotation tools for Affymetrix biological metadata annotate Annotation for microarrays AnnotationDbi Annotation Database Interface ape Analyses of Phylogenetics and Evolution base The R Base Package baySeq Empirical Bayesian analysis of patterns of differential expression in count data Biobase Biobase: Base functions for Bioconductor BiocGenerics Generic functions for Bioconductor BiocInstaller Install/Update Bioconductor and CRAN Packages Biostrings String objects representing biological sequences, and matching algorithms bitops Functions for Bitwise operations boot Bootstrap Functions (originally by Angelo Canty for S) caTools Tools: moving window statistics, GIF, Base64, ROC AUC, etc. class Functions for Classification cluster Cluster Analysis Extended Rousseeuw et al. CNVtools A package to test genetic association with CNV data codetools Code Analysis Tools for R colorspace Color Space Manipulation compiler The R Compiler Package datasets The R Datasets Package DBI R Database Interface DESeq Differential gene expression analysis based on the negative binomial distribution dichromat Color schemes for dichromats digest Create cryptographic hash digests of R objects doMC Foreach parallel adaptor for the multicore package doSNOW Foreach parallel adaptor for the snow package DynDoc Dynamic document tools edgeR Empirical analysis of digital gene expression data in R foreach Foreach looping construct for R foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase, ... gcrma Background Adjustment Using Sequence Information gdata Various R programming tools for data manipulation gee Generalized Estimation Equation solver geiger Analysis of evolutionary diversification genefilter genefilter: methods for filtering genes from microarray experiments geneplotter Graphics related functions for Bioconductor GenomicRanges Representation and manipulation of genomic intervals ggplot2 An implementation of the Grammar of Graphics glmmADMB Generalized Linear Mixed Models using AD Model Builder GO.db A set of annotation maps describing the entire Gene Ontology gplots Various R programming tools for plotting data graphics The R Graphics Package grDevices The R Graphics Devices and Support for Colours and Fonts grid The Grid Graphics Package gtools Various R programming tools hacks Convenient R Functions hgu95av2.db Affymetrix Human Genome U95 Set annotation data (chip hgu95av2) HilbertVis Hilbert curve visualization Hmisc Harrell Miscellaneous IRanges Infrastructure for manipulating intervals on sequences iterators Iterator construct for R itertools Iterator Tools KEGG.db A set of annotation maps for KEGG KernSmooth Functions for kernel smoothing for Wand & Jones (1995) labeling Axis Labeling laser Likelihood Analysis of Speciation/Extinction Rates from Phylogenies lattice Lattice Graphics leaps regression subset selection limma Linear Models for Microarray Data locfit Local Regression, Likelihood and Density Estimation. maanova Tools for analyzing Micro Array experiments marray Exploratory analysis for two-color spotted microarray data MASS Support Functions and Datasets for Venables and Ripley's MASS Matrix Sparse and Dense Matrix Classes and Methods memoise Memoise functions methods Formal Methods and Classes mgcv Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation msm Multi-state Markov and hidden Markov models in continuous time multicore Parallel processing of R code on machines with multiple cores or CPUs multtest Resampling-based multiple hypothesis testing munsell Munsell colour system mvtnorm Multivariate Normal and t Distributions nlme Linear and Nonlinear Mixed Effects Models nnet Feed-forward Neural Networks and Multinomial Log-Linear Models org.Hs.eg.db Genome wide annotation for Human ouch Ornstein-Uhlenbeck models for phylogenetic comparative hypotheses parallel Support for Parallel computation in R permute Functions for generating restricted permutations of data pheatmap Pretty Heatmaps phylobase Base package for phylogenetic structures and comparative data picante R tools for integrating phylogenies and ecology plyr Tools for splitting, applying and combining data preprocessCore A collection of pre-processing functions prettyR Pretty descriptive stats. proto Prototype object-based programming qvalue Q-value estimation for false discovery rate control R2admb ADMB to R interface functions RColorBrewer ColorBrewer palettes Rcpp Seamless R and C++ Integration reshape Flexibly reshape data. reshape2 Flexibly reshape data: a reboot of the reshape package. rpart Recursive Partitioning RSQLite SQLite interface for R Rwave Time-Frequency analysis of 1-D signals scales Scale functions for graphics. simpleaffy Very simple high level analysis of Affymetrix data snow Simple Network of Workstations spatial Functions for Kriging and Point Pattern Analysis splines Regression Spline Functions and Classes statmod Statistical Modeling stats The R Stats Package stats4 Statistical Functions using S4 Classes stringr Make it easier to work with strings. subplex Subplex optimization algorithm survival Survival analysis, including penalised likelihood. tcltk Tcl/Tk Interface tools Tools for Package Development utils The R Utils Package vegan Community Ecology Package vsn Variance stabilization and calibration for microarray data waveslim Basic wavelet routines for one-, two- and three-dimensional signal processing wavethresh Wavelets statistics and transforms. XML Tools for parsing and generating XML within R and S-Plus. xtable Export tables to LaTeX or HTML zlibbioc An R packaged zlib-1.2.5

## FAQ

**Q:** When I submit the job with N=1 and M=1 it runs and R allocates the 10 slaves that I want. Is this the OK?

**A:** In short, no. This is bad since you are lying to the scheduler about the resources you intend to run. We have scripts that will kill your job if they catch it and we tend to suspend accounts of users who make a practice of it. :)

**Q:** The actual job I want to run is much larger. Anywhere from 31 to 93 processors are desired. Is it ok to request this many processors.

**A:** That depends on the level of investment from your PI. If you ask for processors than your groups core allocation, which depends on the investment level, you will be essentially borrowing cores from other groups and may wait an extended period of time in the queue before your job runs. Groups are allowed to run on up to 10x their core allocation provided the resources are available. If you ask for more than 10x your groups core allocation, the job will be blocked indefinitely.

**Q:** Do I need the number of nodes requested to be correct or can I just have R go grab slaves after the job is submitted with N=1 and M=1?

**A:** Your resource request must be consistent with what you actually intend to use as noted above.

**Q:** Is it better to request a large number of nodes for a shorter period of time or less nodes for longer period of time (concretely, say 8 nodes for 40 hours versus 16 nodes for 20 hours) in terms of getting through the queue?

**A:** Do not confuse "nodes" with "cores/processors". Each "node" is a physical machine with between 4 and 48 cores. Your MPI threads will run on "cores" which may all be in the same "node" or spread among multiple nodes. You should ask for the number of cores you need and spread them among as few nodes as possible unless you have a good reason to do otherwise. Thus you
should generally ask for things like

nodes=1:ppn=8 (we have lots of 8p nodes)

nodes=1:ppn=12 (we have a number of 12p also)

Multiples of the above work as well so you might ask for nodes=3:ppn=8 if you want to run 24 threads on 24 different cores.

It looks like in the R model there is a master/slave paradigm so you really
need one master thread to manage the "slave" threads. It is likely that the master
thread accumulates little CPU time so you *could* neglect it. In other words
tell the scheduler that you want nodes=3:ppn=8 and tell R to spawn 24 children.
This is a white lie which will do little harm. However, if it turns out that the master accumulates significant CPU time and your job gets killed by our rogue process killer, you can ask for the resources
as follows

#PBS -l nodes=1:ppn=1infiniband+3:ppn=8:infiniband

This will allocate 1 thread on a separate node (the master thread) and then the slave threads will be allocated on 3 additional nodes with at least 8 cores each.

Example of using the parallel module to run MPI jobs under R 2.14.1+

{{#fileAnchor: rmpi_test.R}} Download raw source of the [{{#fileLink: rmpi_test.R}} rmpi_test.R] file.

```
# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
library("Rmpi")
}
# Spawn as many slaves as possible
mpi.spawn.Rslaves()
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
if (is.loaded("mpi_initialize")){
if (mpi.comm.size(1) > 0){
print("Please use mpi.close.Rslaves() to close slaves.")
mpi.close.Rslaves()
}
print("Please use mpi.quit() to quit R")
.Call("mpi_finalize")
}
}
# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
# Tell all slaves to close down, and exit the program
mpi.close.Rslaves()
mpi.quit()
```