Diagnostics¶

The nomad diag commands provide targeted analysis for common HPC issues, helping administrators quickly identify root causes of performance problems.

Commands¶

nomad diag network    # Network performance analysis
nomad diag storage    # Storage health and I/O patterns
nomad diag node       # Node-level resource bottlenecks

Network Diagnostics¶

Analyze network performance between nodes.

nomad diag network
nomad diag network --src node01 --dst node02
nomad diag network --threshold 100   # Flag if <100 MB/s

What It Checks¶

Bandwidth: Measures throughput between node pairs
Latency: Round-trip time for packets
Packet loss: Percentage of dropped packets
Path analysis: Identifies bottleneck links

Output Example¶

Network Diagnostics Report
======================================================================

Path Analysis: node01 -> node02
  Hops: 2 (node01 -> switch01 -> node02)
  Latency: 0.3ms (excellent)
  Bandwidth: 9.2 Gbps (92% of 10G link)
  Packet Loss: 0.00%

Path Analysis: node01 -> storage01
  Hops: 3 (node01 -> switch01 -> switch02 -> storage01)
  Latency: 0.8ms (good)
  Bandwidth: 4.1 Gbps (41% of 10G link)  [!] Below expected
  Packet Loss: 0.01%

Summary:
  Total paths tested: 12
  Degraded paths: 1
  Recommendation: Check switch02 port utilization

Storage Diagnostics¶

Examine storage health and I/O patterns.

nomad diag storage
nomad diag storage --path /home
nomad diag storage --detailed

What It Checks¶

NFS performance: Read/write throughput, latency
Disk utilization: Space usage, inode counts
I/O patterns: Sequential vs random, read/write ratio
Queue depth: Outstanding I/O requests
Error rates: Failed operations, retries

Output Example¶

Storage Diagnostics Report
======================================================================

/home (NFS: storage01:/export/home)
  Capacity: 45.2 TB / 100 TB (45%)
  Inodes: 12M / 50M (24%)
  Read throughput: 850 MB/s (good)
  Write throughput: 420 MB/s (good)
  Latency: 2.1ms avg, 15ms p99
  Active clients: 47

/scratch (NFS: storage02:/export/scratch)
  Capacity: 82.1 TB / 100 TB (82%)  [!] High utilization
  Inodes: 45M / 50M (90%)  [!] Critical
  Read throughput: 1.2 GB/s (excellent)
  Write throughput: 180 MB/s  [!] Degraded
  Latency: 8.3ms avg, 250ms p99  [!] High latency spikes
  Active clients: 89

Recommendations:
  - /scratch approaching inode limit - clean old files
  - /scratch write performance degraded - check ZFS pool status

Node Diagnostics¶

Drill into node-level resource bottlenecks.

nomad diag node
nomad diag node gpu-01
nomad diag node --all

What It Checks¶

CPU: Utilization, load average, steal time
Memory: Usage, swap activity, OOM events
GPU: Utilization, temperature, memory, errors
I/O: Wait percentage, disk throughput
Network: Interface utilization, errors

Output Example¶

Node Diagnostics: gpu-03
======================================================================

CPU
  Utilization: 78% (32/40 cores active)
  Load Average: 28.5, 27.2, 25.8
  Steal Time: 0.0%
  Status: [OK] Normal

Memory
  Used: 187 GB / 256 GB (73%)
  Swap: 2.1 GB / 32 GB (6%)  [!] Swap activity detected
  OOM Events (24h): 0
  Status: [!] Monitor swap usage

GPU (4x NVIDIA A100)
  GPU 0: 92% util, 68C, 38/40 GB mem
  GPU 1: 88% util, 65C, 35/40 GB mem
  GPU 2: 0% util, 42C, 0/40 GB mem  [!] Idle
  GPU 3: 95% util, 71C, 39/40 GB mem
  ECC Errors: 0
  Status: [!] GPU 2 underutilized

I/O
  Wait: 3.2%
  Read: 450 MB/s
  Write: 120 MB/s
  Status: [OK] Normal

Recommendations:
  - Swap activity on memory-bound workload - consider job memory limits
  - GPU 2 idle while others near capacity - check job GPU allocation

Common Options¶

All nomad diag commands support:

Option	Description
`--db PATH`	Specify database path
`--json`	Output as JSON
`--quiet`	Minimal output (exit code only)
`-v, --verbose`	Detailed output

Exit Codes¶

Code	Meaning
0	All checks passed
1	Warnings detected
2	Critical issues found
3	Error running diagnostics

Useful for scripting and automated monitoring.