New user training

This page mirrors and expands upon the content provided in the New User Training module in myTraining. The New User Training module is required for all new account holders within two weeks of obtaining a new account. Users who do not complete the training will have their account deactivated until the training is completed.

Training Objectives

Recognize the role of Research Computing, utilize HiPerGator as a research tool and select appropriate resource allocations for analyses
Log into HiPerGatos using an ssh client
Describe appropriate use of the login servers and how to request resources for work beyond those limits
Describe HiPerGator's three main storage systems and the appropriate use for each
Use the module system for loading application environments
Locate where to receive user support
Identify common user mistakes and how to avoid them.

Module 1: Introduction to Research Computing and HiPerGator

HiPerGator

46,000 cores
Hundreds of GPUs
10 Petabytes of storage
New HiPerGator AI cluster will add
- 1,120 NVIDIA A100 GPUs
- 17,000 AMD Rome Epyc Cores

For additional information visit our website: https://www.rc.ufl.edu/

Summary: HiPerGator is a large, high-performance compute cluster capable of tackling some of the largest computational challenges, but users need to understand how to responsibly and efficiently use the resources.

Investor Supported

HiPerGator is heavily subsidized by the university, but we do require faculty researchers to make investments for access. Research Computing sell three main products:

Compute: NCUs (Normalized Compute Units)
- 1 CPU core and 3.5 GB of RAM
Storage:
- Blue: High-performance
- Orange: Intended for archival use
GPUs
- Sold in units of GPU cards
- NCU investment also required to make use of GPU

Investments can either be hardware investments, lasting for 5-years or service investments lasting 3-months or longer.

Price sheets are located here: https://www.rc.ufl.edu/services/rates/
Submit a purchase request here: https://www.rc.ufl.edu/services/purchase-request/

Module 2: How to Access and Run Jobs

Cluster Components

Accessing HiPerGator

ssh to host hpg.rc.ufl.edu
jhub.rc.ufl.edu (requires UF Network)
- See also overview video.
galaxy.rc.ufl.edu
Open on Demand: ood.rc.ufl.edu (requires UF Network)
- See also overview video.

Proper use of Login Nodes

Generally speaking, interactive work other than managing jobs and data is discouraged on the login nodes. Login nodes are intended for file and job management, and short-duration testing and development.

See more information here.

Acceptable use limits:

No more than 16-cores
No longer than 10 minutes (wall time)
No more than 64 GB of RAM

Resources for Scheduling a Job

For use beyond what is acceptable on the login servers, you can request resources on development servers, GPUs servers, through JupyterHub, Galaxy, Graphical User Interface servers via open on demand or submit batch jobs. All of these services work with the scheduler to allocate your requested resources so that your computations run efficiently and do not impact other users.

Scheduling a Job

Understand the resources that your analysis will use:
- CPUs: Can your job use multiple CPU cores? Does it scale?
- Memory: How much RAM will it use? Requesting more will not make your job run faster!
- GPUs: Does your application use GPUs?
- Time: How long will it run?
Request those resources:
- See sample job scripts
- Watch the HiPerGator: SLURM Submission Scripts training video. This video is approximately 30 minutes and includes a demonstration
- Watch the HiPerGator: SLURM Submission Scripts for MPI Jobs training video. This video is approximately 26 minutes and includes a demonstration
- Open on Demand, JupyterHub and Galaxy all have other mechanisms to request resources as SLURM needs this information to schedule your job.
Submit the Job
- Either using sbatch on the command line or through on of the interfaces
- Once your job is submitted, SLURM will check that there are resources available in your group and schedule the job to run
Run
- SLURM will work through the queue and run your job

Module 3: HiPerGator Storage

Three Locations for Storage

The storage systems are reviewed on this page.

Home Storage: /home/<user>
- Each user has 40GB of space
- Good for scripts, code and compiled applications
- Do not use for job input/output
- Snapshots are available for seven days
Blue Storage: /blue/</group>
- Our highest-performance filesystem
- All input/output from jobs should go here
Orange Storage: /orange/<group>
- Slower than /blue
- Not intended for large I/O for jobs
- Primarily for archival purposes

Backup and Quotas

Unless purchased separately, nothing is backed up on HiPerGator!!
- This page describes the tape backup options and costs.
Orange and Blue storage quotas are at the group level and based on investment.
- See price sheets
- Submit a purchase request

Directories

Directories on the Orange and Blue filesystems are automounted.
Your group's directory may not appear until you access it.
Remember
- Directories may not show up in an ls of /blue or /orange
- You may need to type the path in SFTP clients or Glubus for it to appear
- You cannot always use tab completion in the shell

Module System

HiPerGator uses the lmod environment module system
For applications, compilers or interpreters, load the corresponding module

Module 4: Common User Mistakes and How to Avoid Them

Running resource intensive applications on the login nodes
- Submit a batch job: see Sample Slurm Scripts
- Request a development session for testing and development
Writing to /home during batch job execution
- Use /blue/<group> for job input/output. See Storage.
Wasting resources
- Understand CPU and memory needs of your application
- Over-requesting resources generally does not make your application run faster--it prevents other users from accessing resources.
Blindly copying lab mate scripts
- Make sure you understand what borrowed scripts do
- Many users copy previous lab mate scripts, but do not understand the details
- This often leads to wasted resources
Misunderstanding the group investment limits and the burst QOS
- Each group has specific limits
- Burst jobs are run as idle resources are available
  - The slurmInfo command in the ufrc module can show your group's investment and current use.

Module 5: How to get Help

Submit support requests at: https://support.rc.ufl.edu/
For problems with running jobs, provide:
- JobID number(s)
- Path(s) to job scripts and output
- As much detailed information as you can about your problem
For requests to install an application, provide:
- Name of application
- URL to download the application