Skip to content

Diagnostics

The nomad diag commands provide targeted analysis for common HPC issues, helping administrators quickly identify root causes of performance problems.

Commands

nomad diag network    # Network performance analysis
nomad diag storage    # Storage health and I/O patterns
nomad diag node       # Node-level resource bottlenecks

Network Diagnostics

Analyze network performance between nodes.

nomad diag network
nomad diag network --src node01 --dst node02
nomad diag network --threshold 100   # Flag if <100 MB/s

What It Checks

  • Bandwidth: Measures throughput between node pairs
  • Latency: Round-trip time for packets
  • Packet loss: Percentage of dropped packets
  • Path analysis: Identifies bottleneck links

Output Example

Network Diagnostics Report
======================================================================

Path Analysis: node01 -> node02
  Hops: 2 (node01 -> switch01 -> node02)
  Latency: 0.3ms (excellent)
  Bandwidth: 9.2 Gbps (92% of 10G link)
  Packet Loss: 0.00%

Path Analysis: node01 -> storage01
  Hops: 3 (node01 -> switch01 -> switch02 -> storage01)
  Latency: 0.8ms (good)
  Bandwidth: 4.1 Gbps (41% of 10G link)  [!] Below expected
  Packet Loss: 0.01%

Summary:
  Total paths tested: 12
  Degraded paths: 1
  Recommendation: Check switch02 port utilization

Storage Diagnostics

Examine storage health and I/O patterns.

nomad diag storage
nomad diag storage --path /home
nomad diag storage --detailed

What It Checks

  • NFS performance: Read/write throughput, latency
  • Disk utilization: Space usage, inode counts
  • I/O patterns: Sequential vs random, read/write ratio
  • Queue depth: Outstanding I/O requests
  • Error rates: Failed operations, retries

Output Example

Storage Diagnostics Report
======================================================================

/home (NFS: storage01:/export/home)
  Capacity: 45.2 TB / 100 TB (45%)
  Inodes: 12M / 50M (24%)
  Read throughput: 850 MB/s (good)
  Write throughput: 420 MB/s (good)
  Latency: 2.1ms avg, 15ms p99
  Active clients: 47

/scratch (NFS: storage02:/export/scratch)
  Capacity: 82.1 TB / 100 TB (82%)  [!] High utilization
  Inodes: 45M / 50M (90%)  [!] Critical
  Read throughput: 1.2 GB/s (excellent)
  Write throughput: 180 MB/s  [!] Degraded
  Latency: 8.3ms avg, 250ms p99  [!] High latency spikes
  Active clients: 89

Recommendations:
  - /scratch approaching inode limit - clean old files
  - /scratch write performance degraded - check ZFS pool status

Node Diagnostics

Drill into node-level resource bottlenecks.

nomad diag node
nomad diag node gpu-01
nomad diag node --all

What It Checks

  • CPU: Utilization, load average, steal time
  • Memory: Usage, swap activity, OOM events
  • GPU: Utilization, temperature, memory, errors
  • I/O: Wait percentage, disk throughput
  • Network: Interface utilization, errors

Output Example

Node Diagnostics: gpu-03
======================================================================

CPU
  Utilization: 78% (32/40 cores active)
  Load Average: 28.5, 27.2, 25.8
  Steal Time: 0.0%
  Status: [OK] Normal

Memory
  Used: 187 GB / 256 GB (73%)
  Swap: 2.1 GB / 32 GB (6%)  [!] Swap activity detected
  OOM Events (24h): 0
  Status: [!] Monitor swap usage

GPU (4x NVIDIA A100)
  GPU 0: 92% util, 68C, 38/40 GB mem
  GPU 1: 88% util, 65C, 35/40 GB mem
  GPU 2: 0% util, 42C, 0/40 GB mem  [!] Idle
  GPU 3: 95% util, 71C, 39/40 GB mem
  ECC Errors: 0
  Status: [!] GPU 2 underutilized

I/O
  Wait: 3.2%
  Read: 450 MB/s
  Write: 120 MB/s
  Status: [OK] Normal

Recommendations:
  - Swap activity on memory-bound workload - consider job memory limits
  - GPU 2 idle while others near capacity - check job GPU allocation

Common Options

All nomad diag commands support:

Option Description
--db PATH Specify database path
--json Output as JSON
--quiet Minimal output (exit code only)
-v, --verbose Detailed output

Exit Codes

Code Meaning
0 All checks passed
1 Warnings detected
2 Critical issues found
3 Error running diagnostics

Useful for scripting and automated monitoring.