Infrastructure Monitoring¶
NOMAD extends beyond compute nodes to monitor research workstations and storage systems, providing a holistic view of your computing environment.
Dashboard Views¶
Access these views from the web dashboard tabs:
Workstation Monitoring¶
Track departmental workstations across your institution.
What's Monitored¶
| Metric | Description |
|---|---|
| Status | Online/offline/unreachable |
| CPU Load | Current utilization percentage |
| Memory | Used/total RAM |
| Disk | Root filesystem usage |
| Users | Currently logged-in users |
| Uptime | Time since last reboot |
Dashboard View¶
The Workstations tab shows machines grouped by department:
----------------------------------------------------------------------
Workstations 14 online
----------------------------------------------------------------------
Biology (4)
+---------+--------+--------+--------+---------+---------+
| Machine | Status | CPU | Memory | Disk | Users |
+---------+--------+--------+--------+---------+---------+
| bio-ws1 | UP | 23% | 8/32GB | 45% | alice |
| bio-ws2 | UP | 67% | 24/32GB| 52% | bob,chen|
| bio-ws3 | UP | 5% | 4/32GB | 38% | - |
| bio-ws4 | DOWN | - | - | - | - |
+---------+--------+--------+--------+---------+---------+
Chemistry (3)
...
----------------------------------------------------------------------
Configuration¶
Add workstations to your config file:
# ~/.config/nomad/nomad.toml
[[workstations]]
name = "bio-ws1"
host = "bio-ws1.dept.edu"
department = "Biology"
[[workstations]]
name = "bio-ws2"
host = "bio-ws2.dept.edu"
department = "Biology"
[[workstations]]
name = "chem-ws1"
host = "chem-ws1.dept.edu"
department = "Chemistry"
Or use the collector:
nomad collect workstations --discover # Auto-discover via DNS
nomad collect workstations --add bio-ws1.dept.edu
Storage Monitoring¶
Monitor NFS servers, ZFS pools, and storage capacity.
What's Monitored¶
| Metric | Description |
|---|---|
| Capacity | Total/used/available space |
| Utilization | Percentage used |
| ZFS Health | Pool status (online/degraded/faulted) |
| IOPS | Read/write operations per second |
| Throughput | MB/s read/write |
| Latency | Average I/O response time |
| NFS Clients | Connected client count |
Dashboard View¶
The Storage tab displays server status and pool health:
----------------------------------------------------------------------
Storage Servers 3 healthy
----------------------------------------------------------------------
storage01 - Primary Home Directories
+-----------+---------+---------+--------+----------------+
| Pool | Status | Used | IOPS | Clients |
+-----------+---------+---------+--------+----------------+
| tank/home | ONLINE | 45/100TB| 2.3K | 47 connected |
| tank/apps | ONLINE | 2/10TB | 450 | 47 connected |
+-----------+---------+---------+--------+----------------+
storage02 - Scratch Space
+-----------+---------+---------+--------+----------------+
| Pool | Status | Used | IOPS | Clients |
+-----------+---------+---------+--------+----------------+
| scratch | DEGRADED| 82/100TB| 5.1K | 89 connected |
+-----------+---------+---------+--------+----------------+
[!] storage02/scratch: 1 drive faulted, resilver in progress
----------------------------------------------------------------------
Configuration¶
# ~/.config/nomad/nomad.toml
[[storage]]
name = "storage01"
host = "storage01.cluster.edu"
type = "nfs"
pools = ["tank/home", "tank/apps"]
[[storage]]
name = "storage02"
host = "storage02.cluster.edu"
type = "nfs"
pools = ["scratch"]
Alerts¶
Set up alerts for storage conditions:
[alerts.storage]
capacity_warning = 80 # Warn at 80% full
capacity_critical = 95 # Critical at 95% full
inode_warning = 80
inode_critical = 95
latency_warning_ms = 50 # Warn if latency > 50ms
pool_degraded = true # Alert on ZFS degraded state
Use Cases¶
Correlating Job Failures¶
When jobs fail, check infrastructure:
- Job failed with I/O errors - Check Storage tab for NFS issues
- Job timed out - Check if storage latency spiked
- Multiple failures from one department - Check their workstations
Capacity Planning¶
Track trends over time:
nomad report storage --days 30 # 30-day storage trend
nomad report workstations --idle # Find underutilized machines
Proactive Maintenance¶
Get notified before problems occur:
- Storage at 80% - Plan cleanup or expansion
- ZFS pool degraded - Replace failing drive
- Workstation offline - Check before users complain