Difference between revisions of "FAQ"

From UFRC
Jump to navigation Jump to search
 
(25 intermediate revisions by 3 users not shown)
Line 3: Line 3:
 
   |__TOC__
 
   |__TOC__
 
   |}
 
   |}
==Accounts and Investment==
+
==Applications==
'''Q:''' How to create a HiPerGator account?
+
For questions about specific software such as Python,  Open OnDemand, or Custom Installations, visit [[Applications FAQ]]
  
'''A:''' HPG accounts cannot be created by users, but can be requested with a valid sponsor's approval. Please submit a request via https://www.rc.ufl.edu/get-started/hipergator/request-hipergator-account/
+
==Accounts and Investments==
 +
'''Q:''' How do I get a HiPerGator account?
  
 +
:'''A:''' HPG accounts must be requested via the [https://www.rc.ufl.edu/get-started/hipergator/request-hipergator-account/ account request form] and receive a valid sponsor's approval.
  
 
'''Q:''' How do I purchase HiPerGator resources or reinvest on expired allocations?
 
'''Q:''' How do I purchase HiPerGator resources or reinvest on expired allocations?
  
'''A:''' If you're a sponsor or account manager, please fill out a purchase form at https://www.rc.ufl.edu/get-started/purchase-allocation/
+
:'''A:''' If you're a sponsor or account manager, please fill out a purchase form at https://www.rc.ufl.edu/get-started/purchase-allocation/
 
 
  
 
'''Q:''' How to add users to a group?
 
'''Q:''' How to add users to a group?
  
'''A:''' All users must submit a ticket via the [https://support.rc.ufl.edu/enter_bug.cgi RC Support Ticketing System] with the Subject line in a format similar to '''"Add (username) to (groupname) group"''' in order to gain access to a given group.
+
:'''A:''' All users must submit a ticket via the [https://support.rc.ufl.edu/enter_bug.cgi RC Support Ticketing System] with the Subject line in a format similar to '''"Add (username) to (groupname) group"''' in order to gain access to a given group.
 
 
  
 
'''Q:''' I can't login to my HPG account.
 
'''Q:''' I can't login to my HPG account.
  
'''A:''' Visit our [https://help.rc.ufl.edu/doc/Blocked_Accounts Blocked Accounts] wiki page
+
:'''A:''' Visit our [https://help.rc.ufl.edu/doc/Blocked_Accounts Blocked Accounts] wiki page
 
 
==Storage==
 
'''Q:''' I can't see my (or my group's) /blue or /orange folders!
 
 
 
'''A:''' If you are listing /blue or /orange you won't see your group's directory tree. It's automatically connected (mounted) when you try to access it in any way e.g. by using an 'ls' or 'cd' command. E.g. if your group name is 'mygroup' you should list or cd into /blue/mygroup or /orange/mygroup. See also this short video: https://web.microsoftstream.com/video/87698fe6-84df-40dc-9d22-c3a6c63820fa
 
  
 +
'''Q:''' How can I find out what allocations have expired or about to expire?
  
'''Q:''' Why do I see "No Space Left" in my output?
+
:'''A:''' Please use the "showAllocation" tool in the 'ufrc' env module. See [[UFRC_environment_module]] for reference on all HPG tools.
  
'''A:''' If you see a 'No Space Left' or a similar message (no quota remaining, etc) check the path(s) in the error message closely to look for 'home', 'orange', 'blue', or 'red' and check the respective quota for that filesystem. All quota commands are in the [[UFRC_environment_module|'ufrc' environment module]] and include 'home_quota', 'blue_quota', 'orange_quota'. See [[Getting Started]] and [[Storage]] for more help.
+
'''Q:''' How many CPUs/GPUs can I use?
  
A convenient interactive tool to see what's taking up the storage quota is the '''ncdu''' command in a bash terminal. You can run that command and delete or move data to a different storage to free up space.
+
:'''A:''' Load the module "ufrc" to run the command "slurmInfo", which shows resources available to your groups. You can use even more resources by choosing burst qos. Learn more at [https://help.rc.ufl.edu/doc/Account_and_QOS_limits_under_SLURM Account and QOS limits under SLURM]. For more information about HiPerGator's SLURM scheduler, please visit the [https://help.rc.ufl.edu/doc/Category:Scheduler Scheduler category] in our Help Wiki.
  
If the data that's taking up most of the space is related to application environments and packages such as conda, pip, or singularity, you can modify your configuration file to update the default directories for custom installs. You can find more information about the .condarc setup here: [https://help.rc.ufl.edu/doc/Conda Conda]
+
==Scheduler==
 +
===Memory Use===
 +
'''Q:''' What does 'OOM', 'oom-kill event(s)', 'out of memory' error(s) in the job log means?
 +
:'''A:''' Short answer: request more memory when you resubmit the job. Long answer: each HiPerGator job/session is run with CPU core number, memory, and time limits set by the job resource request. Both the memory and time limits are going to result in the termination of the job if exceeded whereas the CPU core number limit can severely affect the performance of the job in some cases, but will not result in job termination. See [[Account and QOS limits under SLURM]] for a thorough explanation of resource limits. Read [[Out Of Memory]] for additional considerations.
  
==Applications==
+
===GPU Use===
===Python===
+
'''Q:''' Why has my GPU job been pending in the SLURM queue for a long time?
'''Q''': Installed a python package via 'pip install PACKAGEX', but 'import PACKAGEX' results in an error.
+
:'''A:'''
 +
* All of your group’s allocated GPUs may be in use.
 +
* You are requesting more gpus than 8 GPUs that are available on a single node in a single node job.
 +
* Your job is requesting one or more A100 GPUs. The A100 GPUs on HiPerGator are in extremely high demand. Jobs requesting A100 GPUs must expect long job pending times. However, there are typically a large number of available GeForce 2080Ti GPUs, so jobs requesting 2080Ti GPUs are expected to start promptly. For your information, the 2080Ti GPUs have 11GB of onboard memory compared to 80GB in A100 cards.  
  
'''A''': A pip install you performed puts the resulting package into your personal directory tree located in the '''~/.local/lib/pythonX.Y/site-packages''' directory tree. A personal pip install can often result in an installation of a python package from a binary archive (wheel) that was built on a system against software libraries that are not compatible with HiPerGator. A typical error message in such case complains about the lack of a particular GLIBC version or some other missing library. Note that the issue can be exacerbated by an incompatible interaction between an environment loaded via an environment module ('module load something') and a personal python package install. To avoid this issue the python package must be installed into an isolated environment. Our approach for creating such environments depends on many factors, but usually results in a Conda or containerized environment.
+
See [[GPU Access]] and [[Slurm and GPU Use]] for more information on the hardware and selecting a GPU for a job. Use the ‘slurmInfo’ command to see your group’s current GPU usage.
  
===Custom Installation===
+
'''Q:''' What's the difference between GPU and HWGUI partitions?
'''Q''': I want to have a custom install of an application or python modules.
 
  
'''A''': We recommend creating a [[Conda]] environment and installing needed packages with the 'mamba' tool from the conda environment module. It is possible to mix conda and pip installed packages inside a conda environment as conda/mamba is aware of packages installed via pip, but not vice versa.
+
:'''A:''' HWGUI partitions are technically GPU partitions, but HWGUI is more dedicated to interface visualization for software whos GUI requires hardware acceleration, but it's not directed to high performance computing the way GPU partitions are.
  
See also: [[Installing Personal Python Modules]] and [[Managing Python environments and Jupyter kernels]]
+
==Storage==
 +
'''Q:''' I can't see my (or my group's) /blue or /orange folders!
  
===R===
+
:'''A:''' If you are listing /blue or /orange you won't see your group's directory tree. It's automatically connected (mounted) when you try to access it in any way e.g. by using an 'ls' or 'cd' command. E.g. if your group name is 'mygroup' you should list or cd into /blue/mygroup or /orange/mygroup. If you are using Jupyter Notebook or other GUI or web applications that make it difficult to browse to a specific path you can [[Jupyter_Notebooks#Create_the_Link | create a symlink (shortcut)]]. Example: ln -s path_to_link_to name_of_link
'''Q:''' How do I install R packages?
 
  
'''A:''' Users can install R packages in their local directory. The default directory is /home/my.username/R/x86_64-pc-linux-gnu-library/X.X/ (X.X = version number)
+
See also this short video: https://web.microsoftstream.com/video/87698fe6-84df-40dc-9d22-c3a6c63820fa
  
{|cellpadding="20"
+
'''Q:''' Why do I see "No Space Left" in job output or application error?
|-style="vertical-align:top;"
 
|
 
From a standard repository (such as [https://cran.r-project.org/ CRAN-R])
 
<pre>
 
$ module load R/X.X
 
$ R
 
> install.packages("PACKAGE")</pre>
 
||
 
From github
 
<pre>
 
$ module load R/X.X
 
$ R
 
> devtools::install_github("author/software")
 
or
 
> remotes::install_github("author/software")
 
</pre>
 
||
 
From a tarball
 
<pre>$ module load R/X.X
 
$ R CMD INSTALL /path/package.tar.gz</pre>
 
|}
 
  
'''Q:''' When I submit a job using 'parallel' package all threads seem to share a single CPU core instead of running on the separate cores I requested.
+
:'''A:''' If you see a 'No Space Left' or a similar message (no quota remaining, etc) check the path(s) in the error message closely to look for 'home', 'orange', 'blue', or 'red' and check the respective quota for that filesystem. All quota commands are in the [[UFRC_environment_module|'ufrc' environment module]] and include 'home_quota', 'blue_quota', 'orange_quota'. See [[Getting Started]] and [[Storage]] for more help.
  
'''A:''' On SLURM you need to use --cpus-per-task to specify the number of available cores. E.g.
+
A convenient interactive tool to see what's taking up the storage quota is the '''ncdu''' command in a bash terminal (also available with the "ufrc" module). You can run that command and delete or move data to a different storage to free up space.
#SBATCH --nodes=1
 
#SBATCH --ntasks=1
 
#SBATCH --cpus-per-task=12
 
  
will allow mcapply or other function from the 'parallel' package to run on all requested cores
+
If the data that's taking up most of the space is related to application environments and packages such as conda, pip, or singularity, you can modify your configuration file to update the default directories for custom installs. You can find more information about the .condarc setup here: [https://help.rc.ufl.edu/doc/Conda Conda]. You can also select a directory outside your $HOME to store python packages and modules when running "pip install": pip install --install-option="--prefix=/some/path/" package_name. For more info see https://help.rc.ufl.edu/doc/Python
 
 
===Jupyter===
 
'''Q:''' Why do I see the following error message? (kernel).ipynb appears to have died. It will restart automatically.
 
 
 
'''A:''' This is typically caused by the kernel using more RAM than what was requested when starting the session. Increase your memory request.
 
 
 
 
 
'''Q:''' Why am I not able to spawn a Jupyter session?
 
 
 
'''A:''' One common cause for being unable to login to a Jupyter (JupyterHub or Jupyter Notebook) session is running out of home space quota. See the FAQ item above for "No Space Left".
 
 
 
Another reason is packages conflicting while loading the session. In this case, it is necessary to look for errors in the output and check for packages from the user's local environment listed.
 
  
 
==Performance==
 
==Performance==
'''Q''': Why is HiPerGator running so slow?
+
'''Q:''' Why is HiPerGator running so slow?
 
 
'''A''': There are many reasons why users may experience unusually low performance while using HPG. First, users should ensure that performance issues are not originated from their Internet service provider, home network, or personal devices.
 
  
 +
:'''A:''' There are many reasons why users may experience unusually low performance while using HPG. First, users should ensure that performance issues are not originated from their Internet service provider, home network, or personal devices.
  
 
Once the possible causes above are discarded, users should report the issue as soon as possible via the [https://support.rc.ufl.edu/enter_bug.cgi RC Support Ticketing System].
 
Once the possible causes above are discarded, users should report the issue as soon as possible via the [https://support.rc.ufl.edu/enter_bug.cgi RC Support Ticketing System].
Line 114: Line 78:
 
* Method for accessing HPG: Jupyterhub, Open OnDemand, or Terminal interface used.
 
* Method for accessing HPG: Jupyterhub, Open OnDemand, or Terminal interface used.
  
 +
'''Q:''' How long can I run my job for?
 +
 +
:'''A:''' There are default time limits set in different partitions. However, users can set their own time limit using the "--time=" flag: #SBATCH --time=4-00:00:00. '''Note:''' Walltime in hh:mm:ss or d-hh:mm:ss
  
'''Q''': Are there profiling tools installed on HiPerGator that help identify performance bottlenecks?
+
:For more details visit [https://help.rc.ufl.edu/doc/SLURM_Partition_Limits SLURM Partition Limits]
  
'''A''': The [[REMORA]] is the most generic profiling tool we have on the cluster. More specific tools depend on the application/stack or the language. E.g. cProfile for python code, [[Nsight]] Compute for CUDA apps, or VTune for C/C++ + MPI code.
+
'''Q:''' Are there profiling tools installed on HiPerGator that help identify performance bottlenecks?
  
 +
:'''A:''' The [[REMORA]] is the most generic profiling tool we have on the cluster. More specific tools depend on the application/stack or the language. E.g. cProfile for python code, [[Nsight]] Compute for CUDA apps, or VTune for C/C++ + MPI code.
  
'''Q''': Why is my job still pending?
+
'''Q:''' Why is my job still pending?
  
'''A''': According to SLURM documentation, when a job cannot be started a reason is immediately found and recorded in the job's "reason" field in the squeue output and the scheduler moves on to the next job to consider.
+
:'''A:''' According to SLURM documentation, when a job cannot be started a reason is immediately found and recorded in the job's "reason" field in the squeue output and the scheduler moves on to the next job to consider. Run <code>$ slurmInfo -g groupname</code> to see the current utilization for your group and cluster-wide.
  
 
Related article: [https://help.rc.ufl.edu/doc/Account_and_QOS_limits_under_SLURM Account and QOS limits under SLURM]
 
Related article: [https://help.rc.ufl.edu/doc/Account_and_QOS_limits_under_SLURM Account and QOS limits under SLURM]
  
 
* Common reasons why jobs are pending
 
* Common reasons why jobs are pending
 
 
{|cellpadding="20"
 
{|cellpadding="20"
 
|-
 
|-

Latest revision as of 15:49, 26 August 2024

Applications

For questions about specific software such as Python, Open OnDemand, or Custom Installations, visit Applications FAQ

Accounts and Investments

Q: How do I get a HiPerGator account?

A: HPG accounts must be requested via the account request form and receive a valid sponsor's approval.

Q: How do I purchase HiPerGator resources or reinvest on expired allocations?

A: If you're a sponsor or account manager, please fill out a purchase form at https://www.rc.ufl.edu/get-started/purchase-allocation/

Q: How to add users to a group?

A: All users must submit a ticket via the RC Support Ticketing System with the Subject line in a format similar to "Add (username) to (groupname) group" in order to gain access to a given group.

Q: I can't login to my HPG account.

A: Visit our Blocked Accounts wiki page

Q: How can I find out what allocations have expired or about to expire?

A: Please use the "showAllocation" tool in the 'ufrc' env module. See UFRC_environment_module for reference on all HPG tools.

Q: How many CPUs/GPUs can I use?

A: Load the module "ufrc" to run the command "slurmInfo", which shows resources available to your groups. You can use even more resources by choosing burst qos. Learn more at Account and QOS limits under SLURM. For more information about HiPerGator's SLURM scheduler, please visit the Scheduler category in our Help Wiki.

Scheduler

Memory Use

Q: What does 'OOM', 'oom-kill event(s)', 'out of memory' error(s) in the job log means?

A: Short answer: request more memory when you resubmit the job. Long answer: each HiPerGator job/session is run with CPU core number, memory, and time limits set by the job resource request. Both the memory and time limits are going to result in the termination of the job if exceeded whereas the CPU core number limit can severely affect the performance of the job in some cases, but will not result in job termination. See Account and QOS limits under SLURM for a thorough explanation of resource limits. Read Out Of Memory for additional considerations.

GPU Use

Q: Why has my GPU job been pending in the SLURM queue for a long time?

A:
  • All of your group’s allocated GPUs may be in use.
  • You are requesting more gpus than 8 GPUs that are available on a single node in a single node job.
  • Your job is requesting one or more A100 GPUs. The A100 GPUs on HiPerGator are in extremely high demand. Jobs requesting A100 GPUs must expect long job pending times. However, there are typically a large number of available GeForce 2080Ti GPUs, so jobs requesting 2080Ti GPUs are expected to start promptly. For your information, the 2080Ti GPUs have 11GB of onboard memory compared to 80GB in A100 cards.

See GPU Access and Slurm and GPU Use for more information on the hardware and selecting a GPU for a job. Use the ‘slurmInfo’ command to see your group’s current GPU usage.

Q: What's the difference between GPU and HWGUI partitions?

A: HWGUI partitions are technically GPU partitions, but HWGUI is more dedicated to interface visualization for software whos GUI requires hardware acceleration, but it's not directed to high performance computing the way GPU partitions are.

Storage

Q: I can't see my (or my group's) /blue or /orange folders!

A: If you are listing /blue or /orange you won't see your group's directory tree. It's automatically connected (mounted) when you try to access it in any way e.g. by using an 'ls' or 'cd' command. E.g. if your group name is 'mygroup' you should list or cd into /blue/mygroup or /orange/mygroup. If you are using Jupyter Notebook or other GUI or web applications that make it difficult to browse to a specific path you can create a symlink (shortcut). Example: ln -s path_to_link_to name_of_link

See also this short video: https://web.microsoftstream.com/video/87698fe6-84df-40dc-9d22-c3a6c63820fa

Q: Why do I see "No Space Left" in job output or application error?

A: If you see a 'No Space Left' or a similar message (no quota remaining, etc) check the path(s) in the error message closely to look for 'home', 'orange', 'blue', or 'red' and check the respective quota for that filesystem. All quota commands are in the 'ufrc' environment module and include 'home_quota', 'blue_quota', 'orange_quota'. See Getting Started and Storage for more help.

A convenient interactive tool to see what's taking up the storage quota is the ncdu command in a bash terminal (also available with the "ufrc" module). You can run that command and delete or move data to a different storage to free up space.

If the data that's taking up most of the space is related to application environments and packages such as conda, pip, or singularity, you can modify your configuration file to update the default directories for custom installs. You can find more information about the .condarc setup here: Conda. You can also select a directory outside your $HOME to store python packages and modules when running "pip install": pip install --install-option="--prefix=/some/path/" package_name. For more info see https://help.rc.ufl.edu/doc/Python

Performance

Q: Why is HiPerGator running so slow?

A: There are many reasons why users may experience unusually low performance while using HPG. First, users should ensure that performance issues are not originated from their Internet service provider, home network, or personal devices.

Once the possible causes above are discarded, users should report the issue as soon as possible via the RC Support Ticketing System. When reporting the issue, please include detailed information such as:

  • Time when the issue occurred
  • JobID
  • Nodes being used, i.e. username@hpg-node$. Note: Login nodes are not considered high performance nodes and intense jobs should not be executed from them.
  • Paths, file names, etc.
  • Operating system
  • Method for accessing HPG: Jupyterhub, Open OnDemand, or Terminal interface used.

Q: How long can I run my job for?

A: There are default time limits set in different partitions. However, users can set their own time limit using the "--time=" flag: #SBATCH --time=4-00:00:00. Note: Walltime in hh:mm:ss or d-hh:mm:ss
For more details visit SLURM Partition Limits

Q: Are there profiling tools installed on HiPerGator that help identify performance bottlenecks?

A: The REMORA is the most generic profiling tool we have on the cluster. More specific tools depend on the application/stack or the language. E.g. cProfile for python code, Nsight Compute for CUDA apps, or VTune for C/C++ + MPI code.

Q: Why is my job still pending?

A: According to SLURM documentation, when a job cannot be started a reason is immediately found and recorded in the job's "reason" field in the squeue output and the scheduler moves on to the next job to consider. Run $ slurmInfo -g groupname to see the current utilization for your group and cluster-wide.

Related article: Account and QOS limits under SLURM

  • Common reasons why jobs are pending
Priority
Resources being reserved for higher priority job. This is particularly common on Burst QOS jobs.
Resources
Required resources are in use
Dependency
Job dependencies not yet satisfied
Reservation
Waiting for advanced reservation
AssociationJobLimit
User or account job limit reached
AssociationResourceLimit
User or account resource limit reached
AssociationTimeLimit
User or account time limit reached
QOSJobLimit
Quality Of Service (QOS) job limit reached
QOSResourceLimit
Quality Of Service (QOS) resource limit reached
QOSTimeLimit
Quality Of Service (QOS) time limit reached