HPG Computation: Difference between revisions

← Older edit

Latest revision as of 16:40, 16 September 2024

Back to Getting Started

HiPerGator Etiquette

Only run workloads on compute nodes. Do not run scripts or applications on the login nodes beyond a small quick test. Use sbatch, srundev, salloc, or srun to start a session on a compute node instead.

Only run workloads from blue storage. This is a fast storage systems that can handle the I/O involved in research workloads. Before using sbatch or launching a workload interactively, make sure your working directory is a blue file path, e.g. /blue/<group>/<user>, and not your /orange or /home directory (~ or /home/<user>). Use pwd to print working directory.

Home directory is only for storing user readable files. Your 40GB home directory is the only storage on HiPerGator for which file recovery may be possible, so keep copies of scripts, configurations, or other important files here. Versions of files may be available for the previous 7 days (see information on home directory snapshots and recovering files), if you need to recover a file from your home directory. If you need back ups of important data, backup services will need to be purchased.

Do not install new software when using existing modules. This will cause errors when using our software environments, because the new installation is in your local folder. If you need to install software, create a Conda virtual environment, or open a support ticket to request RC to install the software in one of our software environments if it will be widely used.

Do not run workloads from orange storage. Orange is intended as long-term, archival storage of data you currently do not use. It cannot handle the high-throughput requirements of high-performance computing workloads.

Do not request excessive resources. This includes CPU, GPU and memory. Job emails include summary estimations of memory use, however, active monitoring will help you understand resource requirements. Applications often require special commands, arguments, or configurations to run in parallel. Therefore, you will likely need to do more than request multiple CPUs or GPUs for a workload to put those resources to use.

Using installed software

Avoid using pip install outside of a conda environment, as it can cause package conflicts. Please look at Managing Python environments and Jupyter kernels for more details and alternatives.

The full list of software available for use can be viewed on the Installed Software page. Access to installed software is provided through Environment Modules.

The following command can be used to browse the full list of available modules, along with short descriptions of the applications they make available:

module spider

To load a module, use the following command:

module load <module_name>

In Jupyter Notebooks, kernels are available with our most popular software stacks. If you are unable to find what you need or would like software installed, please fill out a support request.

For more information on loading modules to access software, view the page on the basic usage of environment modules.

There are some useful commands and utilities in a 'ufrc' environment module in addition to installed applications.

Interactive Testing or Development

You don't always need to use scripts to run code in the SLURM scheduler. When all you need is a quick shell session to run a command or two, write and/or test a job script, or compile some code use SLURM Dev Sessions.

Running Graphical Programs

It is possible to run programs that use a graphical user interface (GUI) on the system. However, doing so requires an installation of and configuration of additional software on the client computer.

Please see the GUI Programs page for information on running graphical user interface applications at UFRC.

@@ Line 1: / Line 1: @@
+[[Category:Scheduler]]
 Back to [[Getting Started]] __NOTOC__
+See also [[HPG Scheduling]]
+{{Note|'''Warning:'''
+'''Do not run full-scale (normal) analyses on login nodes'''. [[Development and Testing]] is required reading. The main approach to run computational analyses is through writing [[Sample SLURM Scripts|job scripts]] and sending them to the [[SLURM_Commands|scheduler]] to run. See also Some interfaces like [[Open OnDemand]], [[Jupyter#JupyterHub|JupyterHub]], and [[Galaxy]] can manage job scheduling behind the scenes and may be more convenient than job submission from the command-line when appropriate.
+'''Only run workloads from blue storage.''' This is a fast storage systems that can handle the I/O involved in research workloads. Before using <code>sbatch</code> or launching a workload interactively, make sure your working directory is a blue file path, e.g. <code>/blue/<group>/<user></code>, and not your /orange or /home directory (<code>~</code> or <code>/home/<user></code>). Use <code>pwd</code> to print working directory.
+|warn}}
+For more information about on how to get started using HiPerGator visit our Wiki category  [https://help.rc.ufl.edu/doc/Category:Essentials Essentials], where you can find additional instructions, training, and tutorial videos.
+[[File:Video preview.png|frameless|link=https://mediasite.video.ufl.edu/Mediasite/Play/bc7ded7444ac4ad6b3b91cd42dc88ede1d]]
+''Working with SLURM on HPG''
 ==HiPerGator Etiquette==
 * '''Only run workloads on compute nodes.''' Do not run scripts or applications on the login nodes beyond a [[Development_and_Testing#Login_Nodes|small quick test]]. Use <code>sbatch</code>, <code>srundev</code>, <code>salloc</code>, or <code>srun</code> to start a session on a compute node instead.
-* '''Only run workloads from blue storage.''' This is a fast storage systems that can handle the I/O involved in research workloads. Before using <code>sbatch</code> or launching a workload interactively, make sure your working directory is a blue file path, e.g. <code>/blue/<group>/<user></code>, and not your home directory (<code>~</code> or <code>/home/<user></code>). Use <code>pwd</code> to print working directory.
+* '''Only run workloads from blue storage.''' This is a fast storage systems that can handle the I/O involved in research workloads. Before using <code>sbatch</code> or launching a workload interactively, make sure your working directory is a blue file path, e.g. <code>/blue/<group>/<user></code>, and not your /orange or /home directory (<code>~</code> or <code>/home/<user></code>). Use <code>pwd</code> to print working directory.
 * '''Home directory is only for storing user readable files.''' Your 40GB home directory is the only storage on HiPerGator for which file recovery may be possible, so keep copies of scripts, configurations,  or other important files here. Versions of files '''may''' be available for the previous 7 days ([[Snapshots|see information on home directory snapshots and recovering files]]), if you need to recover a file from your home directory. If you need back ups of important data, backup services will need to be purchased.
@@ Line 14: / Line 32: @@
 ==Using installed software==
-{{Note|Try to avoid pip install as it installs locally with can cause issues. Please look at  [[Managing Python environments and Jupyter kernels]] for more details and alternatives.|warn}}
+{{Note|Avoid using pip install outside of a conda environment, as it can cause package conflicts. Please look at  [[Managing Python environments and Jupyter kernels]] for more details and alternatives.|warn}}
 The full list of software available for use can be viewed on the [[Installed_Software|Installed Software]] page. Access to installed software is provided through [[Modules|Environment Modules]].
@@ Line 27: / Line 45: @@
   module load <module_name>
 |}
-In Jupyter Notebooks, kernels are available with our most popular software stacks. If you are unable to find what you need or would like software installed, please fill out a [https://www.rc.ufl.edu/help/support-requests/ help request].
+In Jupyter Notebooks, kernels are available with our most popular software stacks. If you are unable to find what you need or would like software installed, please fill out a [https://support.rc.ufl.edu/ support request].
@@ Line 35: / Line 53: @@
 ==Interactive Testing or Development==
-You don't always have to use the SLURM scheduler. When all you need is a quick shell session to run a command or two, write and/or test a job script, or compile some code use [[Development_and_Testing|SLURM Dev Sessions]].
+You don't always need to use scripts to run code in the SLURM scheduler. When all you need is a quick shell session to run a command or two, write and/or test a job script, or compile some code use [[Development_and_Testing|SLURM Dev Sessions]].
 ==Running Graphical Programs==
@@ Line 41: / Line 59: @@
 Please see the [[GUI_Programs|GUI Programs]] page for information on running graphical user interface applications at UFRC.
-==Scheduling computational jobs==
-UFRC uses the Simple Linux Utility for Resource Management, or '''SLURM''', to allocate resources and schedule jobs. Users can create SLURM job scripts to submit jobs to the system. These scripts can, and should, be modified in order to control several aspects of your job, like resource allocation, email notifications, or an output destination.
-* See the [[Annotated SLURM Script]] for a walk-through of the basic components of a SLURM job script
-* See the [[Sample SLURM Scripts]] for several SLURM job script examples
-{| cellpadding = "20"
-|-
-|
-To submit a job script from one of the login nodes accessed via hpg.rc.ufl.edu, use the following command:
- $ sbatch <your_job_script>
-||
-To check the status of submitted jobs, use the following command:
- $ squeue -u <username>
-|}
-View [[SLURM Commands]] for more useful SLURM commands.
-===Managing Cores and Memory===
-See [[Account and QOS limits under SLURM]] for the main documentation on efficient management of computational resources, and an extensive explanation of QOS and SLURM account use.
-The amount of resources within an investment is calculated in NCU (Normalized Computing Units), which correspond to 1 CPU core and about 3.5GB of memory for each NCU purchased. CPUs (cores) and RAM are allocated to jobs independently as requested by your job script.
-Your group's investment can run out of **cores** (SLURM may show <code>QOSGrpCpuLimit</code> in the reason a job is pending) OR **memory** (SLURM may show <code>QOSGrpMemLimit</code> in the reason a job is pending) depending on current use by running jobs.
-The majority of HiPerGator nodes have the same ratio of about 4 GB of RAM per core, which, after accounting for the operating system and system services, leaves about 3.5 GB usable for jobs; hence the ratio of 1 core and 3.5GB RAM per NCU.
-Most HiPerGator nodes have 128 CPU cores and 1000GB of RAM. The '''bigmem''' nodes go up to 4TB of available memory. See [[Available_Node_Features]] for the exact data on resources available on all types of nodes on HiPerGator.
-You must specify both the number of cores and the amount of RAM needed in the job script for SLURM with the <code>--mem</code> (total job memory) or <code>--mem-per-cpu</code> (per-core memory) options. Otherwise, the job will be assigned the default 600mb of memory.
-If you need more than 128 GB of RAM, you can only run on the older nodes, which have 256 GB of RAM, or on the bigmem nodes, which have up to 1.5 TB of RAM.
-==Monitoring Your Workloads==
-You can see presently running workloads with the squeue command e.g.
- $ squeuemine
-OpenOnDemand offers a method to monitor jobs using the Jobs menu in the upper toolbar on your dashboard. This will show your current running, pending, and recently completed jobs.  Select: Jobs -> Active Jobs from the upper dashboard menu.
-We provide a number of helpful commands in the [[UFRC_environment_module|UFRC module]]. The <code>ufrc</code> module is loaded by default at login, but you can also load the <code>ufrc</code> module with the following command:
- $ module load ufrc
-Examples of commands for SLURM or HiPerGator specific UFRC environment module
-<div class="mw-collapsible mw-collapsed" style="width:70%; padding: 5px; border: 1px solid gray;">
-''Expand this section to view Examples of commands''
-<div class="mw-collapsible-content" style="padding: 5px;">
-<pre>
-$ slurmInfo           - displays resource usage for your group
-$ slurmInfo -p        - displays resource usage per partition
-$ showQos             - displays your available QoS
-$ home_quota          - displays your /home quota
-$ blue_quota          - displays your /blue quota
-$ orange_quota        - displays your /orange quota
-$ sacct               - displays job id and state of your recent workloads
-$ nodeInfo            - displays partitions by node types, showing total RAM and other features
-$ sinfo -p partition  - displays the status of nodes in a partition
-$ jobhtop             - displays resource usage for jobs
-$ jobnvtop            - displays resource usage for GPU jobs
-$ which python        - displays path to the Python install of the environment modules you have loaded
-</pre>
-</div>
-</div>