NØMAD Architecture Summary
Data Collection Overview
┌──────────────────────────────────────────────────────────────────────────────┐
│ NØMAD Data Collection v0.2.0 │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ SYSTEM COLLECTORS (every 60s): │
│ ┌──────────────┬─────────────────────────────────────────────────────────┐ │
│ │ disk │ Filesystem usage (total, used, free, projections) │ │
│ │ iostat │ Device I/O: %iowait, utilization, latency │ │
│ │ mpstat │ Per-core CPU: utilization, imbalance detection │ │
│ │ vmstat │ Memory pressure, swap activity, blocked processes │ │
│ │ nfs │ NFS I/O: ops/sec, throughput, RTT, retransmissions │ │
│ │ gpu │ NVIDIA GPU: utilization, memory, temperature, power │ │
│ └──────────────┴─────────────────────────────────────────────────────────┘ │
│ │
│ SLURM COLLECTORS (every 60s): │
│ ┌──────────────┬─────────────────────────────────────────────────────────┐ │
│ │ slurm │ Queue state: pending, running, partition stats │ │
│ │ job_metrics │ sacct data: CPU/mem efficiency, health scores │ │
│ │ node_state │ Node allocation, drain reasons, CPU load, memory │ │
│ └──────────────┴─────────────────────────────────────────────────────────┘ │
│ │
│ JOB MONITOR (every 30s): │
│ ┌──────────────┬─────────────────────────────────────────────────────────┐ │
│ │ job_monitor │ Per-job I/O: NFS vs local writes from /proc/[pid]/io │ │
│ └──────────────┴─────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Feature Vector (19 dimensions)
┌─────────────────────────────────────────────────────────────────────────────┐
│ Feature Vector for Similarity Analysis │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ FROM SACCT (job outcome): FROM IOSTAT (system I/O): │
│ ┌────────────────────────────────┐ ┌────────────────────────────────┐ │
│ │ 1. health_score [0-1] │ │ 11. avg_iowait_percent [0-1] │ │
│ │ 2. cpu_efficiency [0-1] │ │ 12. peak_iowait_percent [0-1] │ │
│ │ 3. memory_efficiency [0-1] │ │ 13. avg_device_util [0-1] │ │
│ │ 4. used_gpu [0,1] │ └────────────────────────────────┘ │
│ │ 5. had_swap [0,1] │ │
│ └────────────────────────────────┘ FROM MPSTAT (CPU cores): │
│ ┌────────────────────────────────┐ │
│ FROM JOB_MONITOR (I/O behavior): │ 14. avg_core_busy [0-1] │ │
│ ┌────────────────────────────────┐ │ 15. core_imbalance_ratio [0-1] │ │
│ │ 6. total_write_gb [0-1] │ │ 16. max_core_busy [0-1] │ │
│ │ 7. write_rate_mbps [0-1] │ └────────────────────────────────┘ │
│ │ 8. nfs_ratio [0-1] │ │
│ │ 9. runtime_minutes [0-1] │ FROM VMSTAT (memory pressure): │
│ │ 10. write_intensity [0-1] │ ┌────────────────────────────────┐ │
│ └────────────────────────────────┘ │ 17. avg_memory_pressure [0-1] │ │
│ │ 18. peak_swap_activity [0-1] │ │
│ │ 19. avg_procs_blocked [0-1] │ │
│ └────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Collector Details
| # |
Collector |
Source |
Key Data |
Graceful Skip |
| 1 |
disk |
shutil.disk_usage |
total/used/free, fill rate projections |
No |
| 2 |
slurm |
squeue, sinfo |
pending/running jobs, partition stats |
No |
| 3 |
job_metrics |
sacct |
CPU/mem efficiency, exit codes, health |
No |
| 4 |
iostat |
iostat -x |
%iowait, device util, r/w latency |
No |
| 5 |
mpstat |
mpstat -P ALL |
per-core CPU, imbalance ratio |
No |
| 6 |
vmstat |
vmstat |
swap, memory pressure, blocked procs |
No |
| 7 |
node_state |
scontrol show node |
allocation, drain reasons, load |
No |
| 8 |
gpu |
nvidia-smi |
util, memory, temp, power |
Yes (if no GPU) |
| 9 |
nfs |
nfsiostat |
ops/sec, throughput, RTT |
Yes (if no NFS) |
| 10 |
job_monitor |
/proc/[pid]/io |
per-job NFS vs local I/O |
No |
Database Tables
System Metrics
filesystems - disk usage snapshots
iostat_cpu - system %iowait
iostat_device - per-device I/O stats
mpstat_core - per-core CPU stats
mpstat_summary - CPU imbalance metrics
vmstat - memory pressure, swap
node_state - SLURM node allocation
gpu_stats - NVIDIA GPU metrics
nfs_stats - NFS I/O metrics
Job Data
jobs - job metadata from sacct
job_metrics - time-series job stats
job_io_samples - per-job I/O snapshots
job_summary - health scores, feature vectors
Analysis
job_similarity - pairwise similarity edges
clusters - job cluster profiles
alerts - alert history
collection_log - collector run history
CLI Commands
# Core commands
nomad status # Full system overview
nomad syscheck # Verify requirements
nomad collect --once # Single collection cycle
nomad collect -i 60 # Continuous (every 60s)
nomad monitor -i 30 # Job I/O monitor (every 30s)
# Analysis
nomad disk /home # Filesystem trends
nomad jobs --user X # Job history
nomad similarity # Similarity analysis
nomad alerts # View alerts
# Bash helpers (source scripts/nomad.sh)
nstatus nwatch ndisk njobs nsimilarity
nalerts ncollect nmonitor nsyscheck nlog
Quick Start
# 1. Initialize database
sqlite3 /var/lib/nomad/nomad.db < nomad/db/schema.sql
# 2. Verify system
nomad syscheck
# 3. Test collection
nomad collect --once
# 4. Start continuous collection
nohup nomad collect -i 60 > /tmp/nomad-collect.log 2>&1 &
nohup nomad monitor -i 30 > /tmp/nomad-monitor.log 2>&1 &
# 5. Check status
nomad status