Difference between revisions of "R"

From UFRC
Jump to navigation Jump to search
 
(35 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
__NOEDITSECTION__
 
__NOEDITSECTION__
[[Category:Software]][[Category:Statistics]]
+
{|align=right
 +
  |__TOC__
 +
  |}
 +
[[Category:Software]][[Category:Statistics]][[Category:Programming]]
 
{|<!--Main settings - REQUIRED-->
 
{|<!--Main settings - REQUIRED-->
 
|{{#vardefine:app|R}}
 
|{{#vardefine:app|R}}
Line 6: Line 9:
 
|{{#vardefine:exe|1}} <!--Present manual instructions for running the software -->
 
|{{#vardefine:exe|1}} <!--Present manual instructions for running the software -->
 
|{{#vardefine:conf|}} <!--Enable config wiki page link - {{#vardefine:conf|1}} = ON/conf|}} = OFF-->
 
|{{#vardefine:conf|}} <!--Enable config wiki page link - {{#vardefine:conf|1}} = ON/conf|}} = OFF-->
|{{#vardefine:pbs|1}} <!--Enable PBS script wiki page link-->
+
|{{#vardefine:job|1}} <!--Enable job script wiki page link-->
 
|{{#vardefine:policy|}} <!--Enable policy section -->
 
|{{#vardefine:policy|}} <!--Enable policy section -->
 
|{{#vardefine:testing|1}} <!--Enable performance testing/profiling section -->
 
|{{#vardefine:testing|1}} <!--Enable performance testing/profiling section -->
Line 21: Line 24:
 
'''Note: File a [http://support.rc.ufl.edu support ticket] to request installation of additional libraries.'''
 
'''Note: File a [http://support.rc.ufl.edu support ticket] to request installation of additional libraries.'''
 
<!--Modules-->
 
<!--Modules-->
==Required Modules==
+
==Environment Modules==
[[Modules|modules documentation]]
+
Run <code>module spider {{#var:app}}</code> to find out what environment modules are available for this application.
===Serial===
 
*{{#var:app}}
 
===Parallel (MPI)===
 
*Rmpi
 
The "Rmpi" module enables access to the version of R that provides the Rmpi library for large-scale multi-node parallel computations.
 
 
==System Variables==
 
==System Variables==
* HPC_{{#uppercase:{{#var:app}}}}_DIR - installation directory
+
* HPC_{{uc:{{#var:app}}}}_DIR - installation directory
 
* HPC_R_BIN - executable directory
 
* HPC_R_BIN - executable directory
 
* HPC_R_LIB - library directory
 
* HPC_R_LIB - library directory
 
* HPC_R_INCLUDE - includes directory
 
* HPC_R_INCLUDE - includes directory
 
{{#if: {{#var: exe}}|==How To Run==
 
{{#if: {{#var: exe}}|==How To Run==
R can be run on the command-line (or the batch system) using the '<code>Rscript myscript.R</code>' or '<code>R CMD BATCH myscript.R</code>' command or, for script development or visualization, via RStudio ('rstudio' environment module and command) on gui.rc.ufl.edu or gui1.rc.ufl.edu servers.
+
R can be run on the command-line (or the batch system) using the '<code>Rscript myscript.R</code>' or '<code>R CMD BATCH myscript.R</code>' command. For script development or visualization RStudio GUI application can be used. See the [[GUI_Programs|respective documentation]] for details. Alternatively an instance of [[RStudio_Server|RStudio Server]] can be started in a job. Then you can connect to it through an SSH tunnel from a web browser on your local computer.
|}}
+
;Notes and Warnings:
{{#if: {{#var: conf}}|==Configuration==
 
See the [[{{PAGENAME}}_Configuration]] page for {{#var: app}} configuration details.|}}
 
{{#if: {{#var: pbs}}|==PBS Script Examples==
 
See the [[{{PAGENAME}}_PBS]] page for {{#var: app}} PBS script examples.|}}
 
{{#if: {{#var: policy}}|==Usage Policy==
 
WRITE USAGE POLICY HERE (perhaps templates for a couple of main licensing schemes can be used)|}}
 
{{#if: {{#var: testing}}|==Performance==
 
We have benchmarked our most recent installed R version (3.0.2) built with the included blas/lapack libraries versus the newest (as of April 2015) release 3.2.0 built with Intel MKL libraries on the HiPerGator1 hardware (AMD Abu Dhabi 2.4GHz CPUs) and the Intel Haswell 2.3GHz CPUs we're testing for possible usage in HiPerGator2. The results are presented below:
 
===R Benchmark 2.5===
 
  
Number of times each test is run: 3
+
* The parallel::detectCores() function will return the total number of cores on a compute node and not the number of cores assigned to your job by the scheduler. Instead, use something like
{| class="wikitable"
+
numCores = as.integer(Sys.getenv("SLURM_CPUS_ON_NODE"))
! One
+
to find out the number of CPU cores 'X' requested in your job script by:
! colspan="3" | Two
+
#SBATCH --cpus-per-task=X
! Three
 
|-
 
|
 
|
 
|
 
|}
 
  
  border="1"  cellspacing="0" cellpadding="2" align="center"  class="wikitable" style="border-collapse: collapse; margin: 1em 1em 1em 0; border-top: none; border-right:none; "
+
* Default RData format
{| class="wikitable"
+
In R-3.6.0 the default serialization format used to save RData files has been changed to version 3 (RDX3), so R versions prior to 3.5.0 will not be able to open it. Keep this in mind if you copy RData files from HiPerGator to an external system with old R installed.
! Benchmark Name
 
! colspan="3" | Time, sec
 
! Notes
 
|-
 
| I. Matrix calculation || || ||
 
|-
 
| One || Two || Three || Four
 
|-
 
|}
 
  
<pre>
+
* Java
Creation, transp., deformation of a 2500x2500 matrix (sec):  1.06766666666667
+
rJava users need to load the java module manually with '<code>module load java/1.7.0_79</code>'
2400x2400 normal distributed random matrix ^1000____ (sec):  1.32266666666667
 
Sorting of 7,000,000 random values__________________ (sec):  0.867333333333332
 
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.785000000000001
 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.407333333333334
 
                      --------------------------------------------
 
                Trimmed geom. mean (2 extremes eliminated):  0.899146502701189
 
  
  II. Matrix functions
+
* TMPDIR
  --------------------
+
If temporary files are produced the may fill up memory disks on HPG2 nodes and cause node and job failures. Use something like
FFT over 2,400,000 random values____________________ (sec):  0.288999999999999
+
  mkdir -p tmp
Eigenvalues of a 640x640 random matrix______________ (sec): 0.349000000000001
+
  export TMPDIR=$(pwd)/tmp
Determinant of a 2500x2500 random matrix____________ (sec):  0.388333333333333
+
in your job script to prevent this and launch your job from the respective directory and not from your home directory.
Cholesky decomposition of a 3000x3000 matrix________ (sec): 0.404000000000001
 
Inverse of a 1600x1600 random matrix________________ (sec):  0.314666666666668
 
                      --------------------------------------------
 
                Trimmed geom. mean (2 extremes eliminated):  0.34937643771409
 
  
  III. Programmation
+
{{Note|'''For users of PHI and FERPA:''' It is particularly important to set your working and TMPDIR directories to be in your project's PHI/FERPA configured directory in <code>/blue</code> when working with R. Writing files to <code>/home</code> or <code>$TMPDIR</code> could expose restricted data to unauthorized users.|warn}}
  ------------------
 
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  1.346
 
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 0.322333333333335
 
Grand common divisors of 400,000 pairs (recursion)__ (sec):  0.492
 
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.336666666666666
 
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.402000000000001
 
                      --------------------------------------------
 
                Trimmed geom. mean (2 extremes eliminated):  0.405319120529265
 
  
 +
* Tasks vs Cores for parallel runs
 +
Parallel threads in an R job will be bound to the same CPU core even if multiple ntasks are specified in the job script. Use cpus-per-task to use R 'parallel' module correctly. For example, for an 8-thread parallel job use the following resource request in your job script:
 +
#SBATCH --nodes=1
 +
#SBATCH --ntasks=1
 +
#SBATCH --cpus-per-task=8
  
Total time for all 15 tests_________________________ (sec):  9.09400000000001
+
See the single-threaded and multi-threaded examples on the [[Sample SLURM Scripts]] page for more details.
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.503083863882055
 
</pre>
 
 
|}}
 
|}}
<!--Faq-->
+
{{#if: {{#var: conf}}|==Configuration==
{{#if: {{#var: faq}}|==FAQ==
+
See the [[{{PAGENAME}}_Configuration]] page for {{#var: app}} configuration details.|}}
*'''Q:''' When I submit the job with N=1 and M=1 it runs and R allocates the 10 slaves that I want. Is this the OK?
+
{{#if: {{#var: job}}|==Job Script Examples==
**'''A:''' In short, no. This is bad since you are lying to the scheduler about the resources you intend to run.  We have scripts that will kill your job if they catch it and we tend to suspend accounts of users who make a practice of it. :)
+
<div class="mw-collapsible mw-collapsed" style="width:70%; padding: 5px; border: 1px solid gray;">
*'''Q:''' The actual job I want to run is much larger.  Anywhere from 31 to 93 processors are desired.  Is it ok to request this many processors.
+
''Expand this section to view example R script.''
**'''A:''' That depends on the level of investment from your PI.  If you ask for processors than your groups core allocation, which depends on the investment level, you will be essentially borrowing cores from other groups and may wait an extended period of time in the queue before your job runs.  Groups are allowed to run on up to 10x their core allocation provided the resources are available.  If you ask for more than 10x your groups core allocation, the job will be blocked indefinitely.
+
<div class="mw-collapsible-content" style="padding: 5px;">
*'''Q:'''  Do I need the number of nodes requested to be correct or can I just have R go grab slaves after the job is submitted with N=1 and M=1?
+
<source lang=bash>
**'''A:''' Your resource request must be consistent with what you actually intend to use as noted above.
+
#!/bin/bash
<ul>
+
#SBATCH --job-name=R_test  #Job name
<li>'''Q:''' Is it better to request a large number of nodes for a shorter period of time or less nodes for longer period of time (concretely, say 8 nodes for 40 hours versus 16 nodes for 20 hours) in terms of getting through the queue?
+
#SBATCH --mail-type=END,FAIL  # Mail events (NONE, BEGIN, END, FAIL, ALL)
<ul><li>'''A:''' Do not confuse "nodes" with "cores/processors".  Each "node" is a physical machine with between 4 and 48 cores.  Your MPI threads will run on "cores" which may all be in the same "node" or spread among multiple nodes.  You should ask for the number of cores you need and spread them among as few nodes as possible unless you have a good reason to do otherwise.  Thus you should generally ask for things like<pre>
+
#SBATCH --mail-user=ENTER_YOUR_EMAIL_HERE  # Where to send mail
#PBS -l nodes=1:ppn=8    (we have lots of 8p nodes)
+
#SBATCH --ntasks=1
#PBS -l nodes=1:ppn=12  (we have a number of 12p also)</pre>
+
#SBATCH --mem=1gb  # Per processor memory
Multiples of the above work as well so you might ask for nodes=3:ppn=8 if you want to run 24 threads on 24 different cores.
+
#SBATCH --time=00:05:00  # Walltime
It looks like in the R model there is a master/slave paradigm so you really need one master thread to manage the "slave" threads.  It is likely that the master thread accumulates little CPU time so you ''could'' neglect it.  In other words tell the scheduler that you want nodes=3:ppn=8 and tell R to spawn 24 children.
+
#SBATCH --output=r_job.%j.out   # Name output file
This is a white lie which will do little harmHowever, if it turns out that the master accumulates significant CPU time and your job gets killed by our rogue process killer, you can ask for the resources as follows
+
#Record the time and compute node the job ran on
#PBS -l nodes=1:ppn=1infiniband+3:ppn=8:infiniband
+
date; hostname; pwd
This will allocate 1 thread on a separate node (the master thread) and then the slave threads will be allocated on 3 additional nodes with at least 8 cores each.</li>
+
#Use modules to load the environment for R
</ul>
+
module load R
</ul>
+
 
 +
#Run R script
 +
Rscript myRscript.R
 +
 
 +
date
 +
</source></div></div>
 
|}}
 
|}}
 +
{{#if: {{#var: policy}}|==Usage Policy==
 +
WRITE USAGE POLICY HERE (perhaps templates for a couple of main licensing schemes can be used)|}}
 +
{{#if: {{#var: testing}}|==Performance==
 +
We have benchmarked our most recent installed R version (3.0.2) built with the included blas/lapack libraries versus the newest (as of April 2015) release 3.2.0 built with Intel MKL libraries on the HiPerGator1 hardware (AMD Abu Dhabi 2.4GHz CPUs) and the Intel Haswell 2.3GHz CPUs we're testing for possible usage in HiPerGator2. The results are presented in the [[R Benchmark 2.5]] table |}}
 
{{#if: {{#var: citation}}|==Citation==
 
{{#if: {{#var: citation}}|==Citation==
 
If you publish research that uses {{{app}}} you have to cite it as follows:
 
If you publish research that uses {{{app}}} you have to cite it as follows:
Line 129: Line 97:
 
|}}
 
|}}
 
==Rmpi Example==
 
==Rmpi Example==
Example of using the parallel module to run MPI jobs under R 2.14.1+
+
See [[R MPI Example]] page for an example of using Rmpi code.
 +
 
 +
==Installed Libraries==
 +
You can install your own libraries to use with R. These are stored in your /home/ environment. For details visit our [[Applications FAQ]] and see the section "How do I install R packages?".
  
{{#fileAnchor: rmpi_test.R}}
+
Make sure the directory for that version of R is created or R will try to install to a system path and fail. E.g. for R/4.3 run the following command before attempting to install a package:
Download raw source of the [{{#fileLink: rmpi_test.R}} rmpi_test.R] file.
+
mkdir ~/R/x86_64-pc-linux-gnu-library/4.3
<source lang=bash>
 
# Load the R MPI package if it is not already loaded.
 
if (!is.loaded("mpi_initialize")) {
 
    library("Rmpi")
 
    }
 
                                                                               
 
# Spawn as many slaves as possible
 
mpi.spawn.Rslaves()
 
                                                                               
 
# In case R exits unexpectedly, have it automatically clean up
 
# resources taken up by Rmpi (slaves, memory, etc...)
 
.Last <- function(){
 
    if (is.loaded("mpi_initialize")){
 
        if (mpi.comm.size(1) > 0){
 
            print("Please use mpi.close.Rslaves() to close slaves.")
 
            mpi.close.Rslaves()
 
        }
 
        print("Please use mpi.quit() to quit R")
 
        .Call("mpi_finalize")
 
    }
 
}
 
  
# Tell all slaves to return a message identifying themselves
+
You can set a custom library path with the R_LIBS_USER environment variable.
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
+
From [https://cran.r-project.org/web/packages/startup/vignettes/startup-intro.html https://cran.r-project.org/web/packages/startup/vignettes/startup-intro.html]:
  
# Tell all slaves to close down, and exit the program
+
"R_LIBS_USER - user's library path, e.g. R_LIBS_USER=~/R/%p-library/%v is the folder specification used by default on all platforms and and R version. The folder must exist, otherwise it is ignored by R. The %p (platform) and %v (version) parts are R-specific conversion specifiers."
mpi.close.Rslaves()
 
mpi.quit()
 
</source>
 
  
==Installed Libraries==
+
To see a list of installed libraries in the currently loaded version of R:
 +
<pre>
 +
$ R
 +
> installed.packages()
 +
</pre>
 
'''Note: ''' Many of the packages in the R library shown below are installed as a part of Bioconductor meta-library. The list is generated from the default R version.
 
'''Note: ''' Many of the packages in the R library shown below are installed as a part of Bioconductor meta-library. The list is generated from the default R version.
<!-- Note to HPC Staff: paste the list generated by the "library()" command between the <pre> </pre> tags in the http://wiki.hpc.ufl.edu/index.php/R_libraries wiki page for the inclusion below to work. -->
+
<!-- Note to HPC Staff: paste the list generated by the "library()" command between the <pre> </pre> tags in the http://wiki.rc.ufl.edu/index.php/R_libraries wiki page for the inclusion below to work. -->
 +
<div class="mw-collapsible mw-collapsed" style="width:70%; padding: 5px; border: 1px solid gray;">
 +
''Expand this section to view installed library list.''
 +
<div class="mw-collapsible-content" style="padding: 5px;">
 
{{:R_libraries}}
 
{{:R_libraries}}
 +
</div>
 +
</div>

Latest revision as of 14:41, 20 September 2024

Description

R website  

R is a free software environment for statistical computing and graphics.

Note: File a support ticket to request installation of additional libraries.

Environment Modules

Run module spider R to find out what environment modules are available for this application.

System Variables

  • HPC_R_DIR - installation directory
  • HPC_R_BIN - executable directory
  • HPC_R_LIB - library directory
  • HPC_R_INCLUDE - includes directory

How To Run

R can be run on the command-line (or the batch system) using the 'Rscript myscript.R' or 'R CMD BATCH myscript.R' command. For script development or visualization RStudio GUI application can be used. See the respective documentation for details. Alternatively an instance of RStudio Server can be started in a job. Then you can connect to it through an SSH tunnel from a web browser on your local computer.

Notes and Warnings
  • The parallel::detectCores() function will return the total number of cores on a compute node and not the number of cores assigned to your job by the scheduler. Instead, use something like
numCores = as.integer(Sys.getenv("SLURM_CPUS_ON_NODE"))

to find out the number of CPU cores 'X' requested in your job script by:

#SBATCH --cpus-per-task=X
  • Default RData format

In R-3.6.0 the default serialization format used to save RData files has been changed to version 3 (RDX3), so R versions prior to 3.5.0 will not be able to open it. Keep this in mind if you copy RData files from HiPerGator to an external system with old R installed.

  • Java

rJava users need to load the java module manually with 'module load java/1.7.0_79'

  • TMPDIR

If temporary files are produced the may fill up memory disks on HPG2 nodes and cause node and job failures. Use something like

mkdir -p tmp
export TMPDIR=$(pwd)/tmp

in your job script to prevent this and launch your job from the respective directory and not from your home directory.

For users of PHI and FERPA: It is particularly important to set your working and TMPDIR directories to be in your project's PHI/FERPA configured directory in /blue when working with R. Writing files to /home or $TMPDIR could expose restricted data to unauthorized users.
  • Tasks vs Cores for parallel runs

Parallel threads in an R job will be bound to the same CPU core even if multiple ntasks are specified in the job script. Use cpus-per-task to use R 'parallel' module correctly. For example, for an 8-thread parallel job use the following resource request in your job script:

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8

See the single-threaded and multi-threaded examples on the Sample SLURM Scripts page for more details.

Job Script Examples

Expand this section to view example R script.

#!/bin/bash
#SBATCH --job-name=R_test   #Job name	
#SBATCH --mail-type=END,FAIL   # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=ENTER_YOUR_EMAIL_HERE   # Where to send mail	
#SBATCH --ntasks=1
#SBATCH --mem=1gb   # Per processor memory
#SBATCH --time=00:05:00   # Walltime
#SBATCH --output=r_job.%j.out   # Name output file 
#Record the time and compute node the job ran on
date; hostname; pwd
#Use modules to load the environment for R
module load R

#Run R script 
Rscript myRscript.R

date

Performance

We have benchmarked our most recent installed R version (3.0.2) built with the included blas/lapack libraries versus the newest (as of April 2015) release 3.2.0 built with Intel MKL libraries on the HiPerGator1 hardware (AMD Abu Dhabi 2.4GHz CPUs) and the Intel Haswell 2.3GHz CPUs we're testing for possible usage in HiPerGator2. The results are presented in the R Benchmark 2.5 table

Rmpi Example

See R MPI Example page for an example of using Rmpi code.

Installed Libraries

You can install your own libraries to use with R. These are stored in your /home/ environment. For details visit our Applications FAQ and see the section "How do I install R packages?".

Make sure the directory for that version of R is created or R will try to install to a system path and fail. E.g. for R/4.3 run the following command before attempting to install a package:

mkdir ~/R/x86_64-pc-linux-gnu-library/4.3

You can set a custom library path with the R_LIBS_USER environment variable. From https://cran.r-project.org/web/packages/startup/vignettes/startup-intro.html:

"R_LIBS_USER - user's library path, e.g. R_LIBS_USER=~/R/%p-library/%v is the folder specification used by default on all platforms and and R version. The folder must exist, otherwise it is ignored by R. The %p (platform) and %v (version) parts are R-specific conversion specifiers."

To see a list of installed libraries in the currently loaded version of R:

$ R
> installed.packages()

Note: Many of the packages in the R library shown below are installed as a part of Bioconductor meta-library. The list is generated from the default R version.

Expand this section to view installed library list.

File R_PACKAGES is missing.

Name Description