Infrastructure Monitoring¶

NOMAD extends beyond compute nodes to monitor research workstations and storage systems, providing a holistic view of your computing environment.

Dashboard Views¶

Access these views from the web dashboard tabs:

nomad dashboard
# Then click: Workstations | Storage

Workstation Monitoring¶

Track departmental workstations across your institution.

What's Monitored¶

Metric	Description
Status	Online/offline/unreachable
CPU Load	Current utilization percentage
Memory	Used/total RAM
Disk	Root filesystem usage
Users	Currently logged-in users
Uptime	Time since last reboot

Dashboard View¶

The Workstations tab shows machines grouped by department:

----------------------------------------------------------------------
 Workstations                                             14 online
----------------------------------------------------------------------

 Biology (4)
 +---------+--------+--------+--------+---------+---------+
 | Machine | Status | CPU    | Memory | Disk    | Users   |
 +---------+--------+--------+--------+---------+---------+
 | bio-ws1 | UP     | 23%    | 8/32GB | 45%     | alice   |
 | bio-ws2 | UP     | 67%    | 24/32GB| 52%     | bob,chen|
 | bio-ws3 | UP     | 5%     | 4/32GB | 38%     | -       |
 | bio-ws4 | DOWN   | -      | -      | -       | -       |
 +---------+--------+--------+--------+---------+---------+

 Chemistry (3)
 ...
----------------------------------------------------------------------

Configuration¶

Add workstations to your config file:

# ~/.config/nomad/nomad.toml

[[workstations]]
name = "bio-ws1"
host = "bio-ws1.dept.edu"
department = "Biology"

[[workstations]]
name = "bio-ws2"
host = "bio-ws2.dept.edu"
department = "Biology"

[[workstations]]
name = "chem-ws1"
host = "chem-ws1.dept.edu"
department = "Chemistry"

Or use the collector:

nomad collect workstations --discover   # Auto-discover via DNS
nomad collect workstations --add bio-ws1.dept.edu

Storage Monitoring¶

Monitor NFS servers, ZFS pools, and storage capacity.

What's Monitored¶

Metric	Description
Capacity	Total/used/available space
Utilization	Percentage used
ZFS Health	Pool status (online/degraded/faulted)
IOPS	Read/write operations per second
Throughput	MB/s read/write
Latency	Average I/O response time
NFS Clients	Connected client count

Dashboard View¶

The Storage tab displays server status and pool health:

----------------------------------------------------------------------
 Storage Servers                                          3 healthy
----------------------------------------------------------------------

 storage01 - Primary Home Directories
 +-----------+---------+---------+--------+----------------+
 | Pool      | Status  | Used    | IOPS   | Clients        |
 +-----------+---------+---------+--------+----------------+
 | tank/home | ONLINE  | 45/100TB| 2.3K   | 47 connected   |
 | tank/apps | ONLINE  | 2/10TB  | 450    | 47 connected   |
 +-----------+---------+---------+--------+----------------+

 storage02 - Scratch Space
 +-----------+---------+---------+--------+----------------+
 | Pool      | Status  | Used    | IOPS   | Clients        |
 +-----------+---------+---------+--------+----------------+
 | scratch   | DEGRADED| 82/100TB| 5.1K   | 89 connected   |
 +-----------+---------+---------+--------+----------------+
 [!] storage02/scratch: 1 drive faulted, resilver in progress

----------------------------------------------------------------------

Configuration¶

# ~/.config/nomad/nomad.toml

[[storage]]
name = "storage01"
host = "storage01.cluster.edu"
type = "nfs"
pools = ["tank/home", "tank/apps"]

[[storage]]
name = "storage02"
host = "storage02.cluster.edu"
type = "nfs"
pools = ["scratch"]

Alerts¶

Set up alerts for storage conditions:

[alerts.storage]
capacity_warning = 80      # Warn at 80% full
capacity_critical = 95     # Critical at 95% full
inode_warning = 80
inode_critical = 95
latency_warning_ms = 50    # Warn if latency > 50ms
pool_degraded = true       # Alert on ZFS degraded state

Use Cases¶

Correlating Job Failures¶

When jobs fail, check infrastructure:

Job failed with I/O errors - Check Storage tab for NFS issues
Job timed out - Check if storage latency spiked
Multiple failures from one department - Check their workstations

Capacity Planning¶

Track trends over time:

nomad report storage --days 30    # 30-day storage trend
nomad report workstations --idle  # Find underutilized machines

Proactive Maintenance¶

Get notified before problems occur:

Storage at 80% - Plan cleanup or expansion
ZFS pool degraded - Replace failing drive
Workstation offline - Check before users complain