HiPerGator Metrics: Difference between revisions
m Israel.herrera moved page HiPerGator Status to HiPerGator Metrics without leaving a redirect |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 4: | Line 4: | ||
[[Category:Infrastructure]][[Category:Documentation]] | [[Category:Infrastructure]][[Category:Documentation]] | ||
==Accessing the HiPerGator Status Dashboard== | ==Accessing the HiPerGator Status Dashboard== | ||
#You must have a valid HiPerGator account. If you need to request an account, see the [https://www.rc.ufl.edu/access/account-request/ Account Request] page. | |||
#Use your browser to access https://metrics.rc.ufl.edu | |||
#You will be directed to the UF GatorLink login page (it's possible this step will be skipped if you have already authenticated to other UF resources) | |||
#Once authenticated, you will be shown a Grafana login page | |||
#Enter your GatorLink credentials and click Log In | |||
#You will be directed to the HiPerGator Status dashboard which should have charts like the ones below. | |||
If you do '''not''' land on this page, please contact [mailto:support@rc.ufl.edu Support] or file a [https://support.rc.ufl.edu Bugzilla] ticket. | |||
If you do '''not''' land on this page, please contact [mailto:support@rc.ufl.edu Support] or file a [https://support.rc.ufl.edu Bugzilla] ticket. | |||
==Dashboard Panels Explained== | ==Dashboard Panels Explained== | ||
{|cellpadding="20" | {|cellpadding="20" | ||
|-style="vertical-align:top;" | |-style="vertical-align:top;" | ||
| style="width: 50%"| | |||
===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=2 Number of Users]=== | |||
[[Image:Num users.png|right | frameless | x150px| link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=2]] | |||
This panel shows the number of users per login node.<br> | |||
There is a visual threshold set at 75 users. Values over this threshold do not necessarily indicate a problem.<br> | |||
This panel can be used to identify an unbalanced load across the login nodes.<br> | |||
Often, if a node has zero or very few users, it's likely the node is being drained for maintenance purposes and/or has been removed from the pool.<br> | |||
|| | |||
===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=2 5 Minute Load Average]=== | |||
[[Image:5 min load ave.png|frameless|right | x150px| link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=2]] | |||
This panel shows the 5 minute load average of each login node.<br> | |||
A red threshold line has been added to indicate a higher than expected load average.<br> | |||
Being over the threshold does not necessarily imply a problem with that node.<br> | |||
As with all panels, you can hover your mouse over a point on the line and it will display the value at that point.<br> | |||
If a value is not displayed, move your cursor slightly left or right.<br> | |||
|- | |||
| | | | ||
===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&viewPanel=14 CPU Status]=== | ===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&viewPanel=14 CPU Status]=== | ||
[[Image:Cpu status.png|frameless|right | link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&viewPanel=14]] | [[Image:Cpu status.png|frameless|right | x150px| link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&viewPanel=14]] | ||
This panel shows the number of CPUs that are allocated, idle and reserved.<br> | This panel shows the number of CPUs that are allocated, idle and reserved.<br> | ||
It also displays both the current value and average value. Average is calculated based on chosen time range.<br> | It also displays both the current value and average value. Average is calculated based on chosen time range.<br> | ||
Line 29: | Line 42: | ||
Idle CPUs are available to be scheduled.<br> | Idle CPUs are available to be scheduled.<br> | ||
Reserved CPUs are generally unavailable for scheduling and most often are held for system maintenance.<br> | Reserved CPUs are generally unavailable for scheduling and most often are held for system maintenance.<br> | ||
<br><br><br> | || | ||
===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=4 1 Minute Load Average]=== | |||
[[Image:1 min load ave.png|frameless|right | x175px| link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=4]] | |||
This panel shows the 1 minute load average of the login nodes.<br> | |||
It is similar to the 5 minute average but this one can spike higher at times.<br> | |||
If the 1 minute load average is high, but the 5 minute average is not, then it's an indication of a short spike of activity.<br> | |||
If both the 1 and 5 minute averages are high, then it's possible someone has started a CPU intensive process.<br> | |||
There is also a threshold value set to indicate a higher than expected load.<br> | |||
|- | |||
| | |||
===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=12 GPU Allocation]=== | ===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=12 GPU Allocation]=== | ||
[[Image:Gpu status.png |right | frameless | link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=12]] | [[Image:Gpu status.png |right | frameless | x125px| link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=12]] | ||
This panel shows the allocated percentage of GPUs per product family and overall.<br> | This panel shows the allocated percentage of GPUs per product family and overall.<br> | ||
It also displays both the current and average allocation, calculated over the chosen time range.<br> | It also displays both the current and average allocation, calculated over the chosen time range.<br> | ||
The overall allocation includes all GPU families.<br> | The overall allocation includes all GPU families.<br> | ||
|| | |||
===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=8 Slurm Jobs Started per Minute]=== | ===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=8 Slurm Jobs Started per Minute]=== | ||
[[Image:Slurm job starts.png|frameless|right | link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=8]] | [[Image:Slurm job starts.png|frameless|right | x125px| link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=8]] | ||
This panel shows the number of jobs started by the Slurm scheduler every minute.<br> | This panel shows the number of jobs started by the Slurm scheduler every minute.<br> | ||
To view the value of a specific point, you can hover your mouse over a bar.<br> | To view the value of a specific point, you can hover your mouse over a bar.<br> | ||
Each bar should display a value.<br> | Each bar should display a value.<br> | ||
|} | |} | ||
Latest revision as of 15:53, 10 January 2023
Accessing the HiPerGator Status Dashboard
- You must have a valid HiPerGator account. If you need to request an account, see the Account Request page.
- Use your browser to access https://metrics.rc.ufl.edu
- You will be directed to the UF GatorLink login page (it's possible this step will be skipped if you have already authenticated to other UF resources)
- Once authenticated, you will be shown a Grafana login page
- Enter your GatorLink credentials and click Log In
- You will be directed to the HiPerGator Status dashboard which should have charts like the ones below.
If you do not land on this page, please contact Support or file a Bugzilla ticket.
Dashboard Panels Explained
Number of Users![]() This panel shows the number of users per login node. |
5 Minute Load Average![]() This panel shows the 5 minute load average of each login node. |
CPU Status![]() This panel shows the number of CPUs that are allocated, idle and reserved. |
1 Minute Load Average![]() This panel shows the 1 minute load average of the login nodes. |
GPU Allocation![]() This panel shows the allocated percentage of GPUs per product family and overall. |
Slurm Jobs Started per Minute![]() This panel shows the number of jobs started by the Slurm scheduler every minute. |
Dashboard Panels Explained
The dashboard has several restrictions, but there are some areas that can be changed.