Difference between revisions of "HiPerGator Metrics"

From UFRC
Jump to navigation Jump to search
m (fix typo)
 
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[Category:Infrastructure]][[Category:Docs]]
+
{|align=right
 +
  |__TOC__
 +
  |}
 +
[[Category:Infrastructure]][[Category:Documentation‏]]
 
==Accessing the HiPerGator Status Dashboard==
 
==Accessing the HiPerGator Status Dashboard==
1. You must either be on the Campus VPN or a Campus Network to reach the site.<br>
+
 
2. You must also have a valid HiPerGator account. If you need to request an account, see the [https://www.rc.ufl.edu/access/account-request/ Account Request] page.<br>
+
#You must have a valid HiPerGator account. If you need to request an account, see the [https://www.rc.ufl.edu/access/account-request/ Account Request] page.
3. Use your browser to access https://metrics.rc.ufl.edu<br>
+
#Use your browser to access https://metrics.rc.ufl.edu
4. You will be directed to the UF GatorLink login page (it's possible this step will be skipped if you have already authenticated to other UF resources)<br>
+
#You will be directed to the UF GatorLink login page (it's possible this step will be skipped if you have already authenticated to other UF resources)
5. Once authenticated, you will be shown a Grafana login page<br>
+
#Once authenticated, you will be shown a Grafana login page
[[File:Grafana Login.png|left|thumb]]
+
#Enter your GatorLink credentials and click Log In
<br><br><br><br><br><br><br><br><br><br>
+
#You will be directed to the HiPerGator Status dashboard which should have charts like the ones below.
6. Enter your GatorLink credentials and click Log In<br>
+
If you do '''not''' land on this page, please contact [mailto:support@rc.ufl.edu Support] or file a [https://support.rc.ufl.edu Bugzilla] ticket.
7. You will be directed to the HiPerGator Status dashboard which should look like this:
 
[[File:Grafana Landing.png|left|thumb]]
 
<br><br><br><br><br><br><br><br><br><br>
 
If you do '''not''' land on this page, please contact [mailto:support@rc.ufl.edu Support] or file a [https://support.rc.ufl.edu Bugzilla] ticket.<br>
 
  
 
==Dashboard Panels Explained==
 
==Dashboard Panels Explained==
===5 Minute Load Average===
+
{|cellpadding="20"
[[File:5 min load ave.png|thumb|left]]
+
|-style="vertical-align:top;"
 +
| style="width: 50%"|
 +
===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=2 Number of Users]===
 +
[[Image:Num users.png|right | frameless | x150px| link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=2]]
 +
This panel shows the number of users per login node.<br>
 +
There is a visual threshold set at 75 users. Values over this threshold do not necessarily indicate a problem.<br>
 +
This panel can be used to identify an unbalanced load across the login nodes.<br>
 +
Often, if a node has zero or very few users, it's likely the node is being drained for maintenance purposes and/or has been removed from the pool.<br>
 +
||
 +
===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=2 5 Minute Load Average]===
 +
[[Image:5 min load ave.png|frameless|right | x150px| link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=2]]
 
This panel shows the 5 minute load average of each login node.<br>
 
This panel shows the 5 minute load average of each login node.<br>
 
A red threshold line has been added to indicate a higher than expected load average.<br>
 
A red threshold line has been added to indicate a higher than expected load average.<br>
Line 22: Line 31:
 
As with all panels, you can hover your mouse over a point on the line and it will display the value at that point.<br>
 
As with all panels, you can hover your mouse over a point on the line and it will display the value at that point.<br>
 
If a value is not displayed, move your cursor slightly left or right.<br>
 
If a value is not displayed, move your cursor slightly left or right.<br>
<br><br><br>
+
|-
===Slurm Jobs Started per Minute===
+
|
[[File:Slurm job starts.png|thumb|left]]
+
===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&viewPanel=14 CPU Status]===
This panel shows the number of jobs started by the Slurm scheduler every minute.<br>
+
[[Image:Cpu status.png|frameless|right | x150px| link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&viewPanel=14]]
To view the value of a specific point, you can hover your mouse over a bar.<br>
 
Each bar should display a value.<br>
 
<br><br><br><br><br><br>
 
===GPU Availability===
 
[[File:Gpu availability.png|thumb|left]]
 
This panel shows how many GPUs of each product family are available for scheduling.<br>
 
Please note this does not factor in GPUs that have been reserved, so the actual number available could be less.<br>
 
<br><br><br><br><br><br><br><br><br>
 
===1 Minute Load Average===
 
[[File:1 min load ave.png|thumb|left]]
 
This panel shows the 1 minute load average of the login nodes.<br>
 
It is similar to the 5 minute average but this one can spike higher at times.<br>
 
If the 1 minute load average is high, but the 5 minute average is not, then it's an indication of a short spike of activity.<br>
 
If both the 1 and 5 minute averages are high, then it's possible someone has started a CPU intensive process.<br>
 
There is also a threshold value set to indicate a higher than expected load.<br>
 
<br><br><br><br><br><br><br>
 
===CPU Status===
 
[[File:Cpu status.png|thumb|left]]
 
 
This panel shows the number of CPUs that are allocated, idle and reserved.<br>
 
This panel shows the number of CPUs that are allocated, idle and reserved.<br>
 
It also displays both the current value and average value. Average is calculated based on chosen time range.<br>
 
It also displays both the current value and average value. Average is calculated based on chosen time range.<br>
Line 51: Line 42:
 
Idle CPUs are available to be scheduled.<br>
 
Idle CPUs are available to be scheduled.<br>
 
Reserved CPUs are generally unavailable for scheduling and most often are held for system maintenance.<br>
 
Reserved CPUs are generally unavailable for scheduling and most often are held for system maintenance.<br>
<br>
+
||
===GPU Allocation===
+
===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=4 1 Minute Load Average]===
[[File:Gpu status.png|thumb|left]]
+
[[Image:1 min load ave.png|frameless|right | x175px| link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=4]]
 +
This panel shows the 1 minute load average of the login nodes.<br>
 +
It is similar to the 5 minute average but this one can spike higher at times.<br>
 +
If the 1 minute load average is high, but the 5 minute average is not, then it's an indication of a short spike of activity.<br>
 +
If both the 1 and 5 minute averages are high, then it's possible someone has started a CPU intensive process.<br>
 +
There is also a threshold value set to indicate a higher than expected load.<br>
 +
|-
 +
|
 +
===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=12 GPU Allocation]===
 +
[[Image:Gpu status.png |right | frameless | x125px| link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=12]]
 
This panel shows the allocated percentage of GPUs per product family and overall.<br>
 
This panel shows the allocated percentage of GPUs per product family and overall.<br>
 
It also displays both the current and average allocation, calculated over the chosen time range.<br>
 
It also displays both the current and average allocation, calculated over the chosen time range.<br>
 
The overall allocation includes all GPU families.<br>
 
The overall allocation includes all GPU families.<br>
<br><br><br><br>
+
||
===Number of Users===
+
===[https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=8 Slurm Jobs Started per Minute]===
[[File:Num users.png|thumb|left]]
+
[[Image:Slurm job starts.png|frameless|right | x125px| link=https://metrics.rc.ufl.edu/d/e6ECgorZz/hipergator-status?orgId=2&refresh=5m&var-cluster=hipergator&var-partition=All&theme=dark&viewPanel=8]]
This panel shows the number of users per login node.<br>
+
This panel shows the number of jobs started by the Slurm scheduler every minute.<br>
There is a visual threshold set at 75 users. Values over this threshold do not necessarily indicate a problem.<br>
+
To view the value of a specific point, you can hover your mouse over a bar.<br>
This panel can be used to identify an unbalanced load across the login nodes.<br>
+
Each bar should display a value.<br>
Often, if a node has zero or very few users, it's likely the node is being drained for maintenance purposes and/or has been removed from the pool.<br>
+
|}
<br><br><br><br><br>
+
 
==General Dashboard Usage==
+
==Dashboard Panels Explained==
 
The dashboard has several restrictions, but there are some areas that can be changed.<br>
 
The dashboard has several restrictions, but there are some areas that can be changed.<br>
 +
{|cellpadding="20"
 +
|-style="vertical-align:top;"
 +
|
 
===Changing the Time Range===
 
===Changing the Time Range===
[[File:Time range.png|thumb|left]]
+
[[Image:Time range.png|framesless|right | 250px]]
 
You are able to change the time range of the dashboard by clicking on the box in the top right corner with the clock icon.<br>
 
You are able to change the time range of the dashboard by clicking on the box in the top right corner with the clock icon.<br>
 
It is set to a default of the "Last 3 hours". When you click this, you will be presented with several preset options.<br>
 
It is set to a default of the "Last 3 hours". When you click this, you will be presented with several preset options.<br>
 
It's best to use one of the presets. Be advised, when using a longer range, the dashboard panels may be more difficult to view.<br>
 
It's best to use one of the presets. Be advised, when using a longer range, the dashboard panels may be more difficult to view.<br>
<br><br><br><br><br>
+
<br><br><br>
 
===Changing the Refresh Frequency===
 
===Changing the Refresh Frequency===
[[File:Refresh.png|thumb|left]]
+
[[Image:Refresh.png|framesless|right | 250px]]
 
You are able to adjust how often the dashboard panels refresh.<br>
 
You are able to adjust how often the dashboard panels refresh.<br>
 
Click the icon in the top right with the refresh icon. It is set to a default of 5 minutes.<br>
 
Click the icon in the top right with the refresh icon. It is set to a default of 5 minutes.<br>
 
Be advised, making the refresh interval less than 5 minutes will increase the load on backend servers and may result in poor performance of the dashboard.<br>
 
Be advised, making the refresh interval less than 5 minutes will increase the load on backend servers and may result in poor performance of the dashboard.<br>
 
Also, some of the panels are only designed to collect data every 15 minutes, or longer, so shortening the refresh may have little to no effect.<br>
 
Also, some of the panels are only designed to collect data every 15 minutes, or longer, so shortening the refresh may have little to no effect.<br>
<br><br><br><br>
+
 
 +
||
 
===Changing the Partition===
 
===Changing the Partition===
[[File:Partition box.png|thumb|left]]
+
[[Image:Partition box.png|framesless|right | 250px]]
 
You can select partitions of interest from the drop down menu at the top left of the dashboard.<br>
 
You can select partitions of interest from the drop down menu at the top left of the dashboard.<br>
 
Currently, this will only have an effect on the CPU Status panel.<br>
 
Currently, this will only have an effect on the CPU Status panel.<br>
 
You may select multiple entries from the selector, or choose All (the default).<br>
 
You may select multiple entries from the selector, or choose All (the default).<br>
<br><br><br><br><br><br>
+
<br><br><br>
 
===Maximize a Panel===
 
===Maximize a Panel===
[[File:Full screen panel.png|thumb|left]]
+
[[Image:Full screen panel.png|framesless|right | 250px]]
 
If you want to view only a single panel, simply click on the panel title then click View.<br>
 
If you want to view only a single panel, simply click on the panel title then click View.<br>
 
This will present only that panel and fill the entire dashboard making it easier to read.<br>
 
This will present only that panel and fill the entire dashboard making it easier to read.<br>
 +
|}

Latest revision as of 15:53, 10 January 2023

Accessing the HiPerGator Status Dashboard

  1. You must have a valid HiPerGator account. If you need to request an account, see the Account Request page.
  2. Use your browser to access https://metrics.rc.ufl.edu
  3. You will be directed to the UF GatorLink login page (it's possible this step will be skipped if you have already authenticated to other UF resources)
  4. Once authenticated, you will be shown a Grafana login page
  5. Enter your GatorLink credentials and click Log In
  6. You will be directed to the HiPerGator Status dashboard which should have charts like the ones below.

If you do not land on this page, please contact Support or file a Bugzilla ticket.

Dashboard Panels Explained

Number of Users

Num users.png

This panel shows the number of users per login node.
There is a visual threshold set at 75 users. Values over this threshold do not necessarily indicate a problem.
This panel can be used to identify an unbalanced load across the login nodes.
Often, if a node has zero or very few users, it's likely the node is being drained for maintenance purposes and/or has been removed from the pool.

5 Minute Load Average

5 min load ave.png

This panel shows the 5 minute load average of each login node.
A red threshold line has been added to indicate a higher than expected load average.
Being over the threshold does not necessarily imply a problem with that node.
As with all panels, you can hover your mouse over a point on the line and it will display the value at that point.
If a value is not displayed, move your cursor slightly left or right.

CPU Status

Cpu status.png

This panel shows the number of CPUs that are allocated, idle and reserved.
It also displays both the current value and average value. Average is calculated based on chosen time range.
Data is collected every 15 minutes and are a snapshot of that point in time.
You can hover over the points in the line to get the value for that time.
Allocated CPUs are actively performing job functions.
Idle CPUs are available to be scheduled.
Reserved CPUs are generally unavailable for scheduling and most often are held for system maintenance.

1 Minute Load Average

1 min load ave.png

This panel shows the 1 minute load average of the login nodes.
It is similar to the 5 minute average but this one can spike higher at times.
If the 1 minute load average is high, but the 5 minute average is not, then it's an indication of a short spike of activity.
If both the 1 and 5 minute averages are high, then it's possible someone has started a CPU intensive process.
There is also a threshold value set to indicate a higher than expected load.

GPU Allocation

Gpu status.png

This panel shows the allocated percentage of GPUs per product family and overall.
It also displays both the current and average allocation, calculated over the chosen time range.
The overall allocation includes all GPU families.

Slurm Jobs Started per Minute

Slurm job starts.png

This panel shows the number of jobs started by the Slurm scheduler every minute.
To view the value of a specific point, you can hover your mouse over a bar.
Each bar should display a value.

Dashboard Panels Explained

The dashboard has several restrictions, but there are some areas that can be changed.

Changing the Time Range

framesless

You are able to change the time range of the dashboard by clicking on the box in the top right corner with the clock icon.
It is set to a default of the "Last 3 hours". When you click this, you will be presented with several preset options.
It's best to use one of the presets. Be advised, when using a longer range, the dashboard panels may be more difficult to view.



Changing the Refresh Frequency

framesless

You are able to adjust how often the dashboard panels refresh.
Click the icon in the top right with the refresh icon. It is set to a default of 5 minutes.
Be advised, making the refresh interval less than 5 minutes will increase the load on backend servers and may result in poor performance of the dashboard.
Also, some of the panels are only designed to collect data every 15 minutes, or longer, so shortening the refresh may have little to no effect.

Changing the Partition

framesless

You can select partitions of interest from the drop down menu at the top left of the dashboard.
Currently, this will only have an effect on the CPU Status panel.
You may select multiple entries from the selector, or choose All (the default).



Maximize a Panel

framesless

If you want to view only a single panel, simply click on the panel title then click View.
This will present only that panel and fill the entire dashboard making it easier to read.