NØMAD Architecture Summary¶

Data Collection Overview¶

┌──────────────────────────────────────────────────────────────────────────────┐
│                         NØMAD Data Collection v0.2.0                        │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SYSTEM COLLECTORS (every 60s):                                              │
│  ┌──────────────┬─────────────────────────────────────────────────────────┐  │
│  │ disk         │ Filesystem usage (total, used, free, projections)       │  │
│  │ iostat       │ Device I/O: %iowait, utilization, latency               │  │
│  │ mpstat       │ Per-core CPU: utilization, imbalance detection          │  │
│  │ vmstat       │ Memory pressure, swap activity, blocked processes       │  │
│  │ nfs          │ NFS I/O: ops/sec, throughput, RTT, retransmissions      │  │
│  │ gpu          │ NVIDIA GPU: utilization, memory, temperature, power     │  │
│  └──────────────┴─────────────────────────────────────────────────────────┘  │
│                                                                              │
│  SLURM COLLECTORS (every 60s):                                               │
│  ┌──────────────┬─────────────────────────────────────────────────────────┐  │
│  │ slurm        │ Queue state: pending, running, partition stats          │  │
│  │ job_metrics  │ sacct data: CPU/mem efficiency, health scores           │  │
│  │ node_state   │ Node allocation, drain reasons, CPU load, memory        │  │
│  └──────────────┴─────────────────────────────────────────────────────────┘  │
│                                                                              │
│  JOB MONITOR (every 30s):                                                    │
│  ┌──────────────┬─────────────────────────────────────────────────────────┐  │
│  │ job_monitor  │ Per-job I/O: NFS vs local writes from /proc/[pid]/io    │  │
│  └──────────────┴─────────────────────────────────────────────────────────┘  │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Feature Vector (19 dimensions)¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Feature Vector for Similarity Analysis                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  FROM SACCT (job outcome):              FROM IOSTAT (system I/O):           │
│  ┌────────────────────────────────┐     ┌────────────────────────────────┐  │
│  │  1. health_score        [0-1]  │     │ 11. avg_iowait_percent   [0-1] │  │
│  │  2. cpu_efficiency      [0-1]  │     │ 12. peak_iowait_percent  [0-1] │  │
│  │  3. memory_efficiency   [0-1]  │     │ 13. avg_device_util      [0-1] │  │
│  │  4. used_gpu            [0,1]  │     └────────────────────────────────┘  │
│  │  5. had_swap            [0,1]  │                                         │
│  └────────────────────────────────┘     FROM MPSTAT (CPU cores):            │
│                                         ┌────────────────────────────────┐  │
│  FROM JOB_MONITOR (I/O behavior):       │ 14. avg_core_busy        [0-1] │  │
│  ┌────────────────────────────────┐     │ 15. core_imbalance_ratio [0-1] │  │
│  │  6. total_write_gb      [0-1]  │     │ 16. max_core_busy        [0-1] │  │
│  │  7. write_rate_mbps     [0-1]  │     └────────────────────────────────┘  │
│  │  8. nfs_ratio           [0-1]  │                                         │
│  │  9. runtime_minutes     [0-1]  │     FROM VMSTAT (memory pressure):      │
│  │ 10. write_intensity     [0-1]  │     ┌────────────────────────────────┐  │
│  └────────────────────────────────┘     │ 17. avg_memory_pressure  [0-1] │  │
│                                         │ 18. peak_swap_activity   [0-1] │  │
│                                         │ 19. avg_procs_blocked    [0-1] │  │
│                                         └────────────────────────────────┘  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Collector Details¶

#	Collector	Source	Key Data	Graceful Skip
1	`disk`	`shutil.disk_usage`	total/used/free, fill rate projections	No
2	`slurm`	`squeue`, `sinfo`	pending/running jobs, partition stats	No
3	`job_metrics`	`sacct`	CPU/mem efficiency, exit codes, health	No
4	`iostat`	`iostat -x`	%iowait, device util, r/w latency	No
5	`mpstat`	`mpstat -P ALL`	per-core CPU, imbalance ratio	No
6	`vmstat`	`vmstat`	swap, memory pressure, blocked procs	No
7	`node_state`	`scontrol show node`	allocation, drain reasons, load	No
8	`gpu`	`nvidia-smi`	util, memory, temp, power	Yes (if no GPU)
9	`nfs`	`nfsiostat`	ops/sec, throughput, RTT	Yes (if no NFS)
10	`job_monitor`	`/proc/[pid]/io`	per-job NFS vs local I/O	No

Database Tables¶

System Metrics¶

filesystems - disk usage snapshots
iostat_cpu - system %iowait
iostat_device - per-device I/O stats
mpstat_core - per-core CPU stats
mpstat_summary - CPU imbalance metrics
vmstat - memory pressure, swap
node_state - SLURM node allocation
gpu_stats - NVIDIA GPU metrics
nfs_stats - NFS I/O metrics

Job Data¶

jobs - job metadata from sacct
job_metrics - time-series job stats
job_io_samples - per-job I/O snapshots
job_summary - health scores, feature vectors

Analysis¶

job_similarity - pairwise similarity edges
clusters - job cluster profiles
alerts - alert history
collection_log - collector run history

CLI Commands¶

# Core commands
nomad status              # Full system overview
nomad syscheck            # Verify requirements
nomad collect --once      # Single collection cycle
nomad collect -i 60       # Continuous (every 60s)
nomad monitor -i 30       # Job I/O monitor (every 30s)

# Analysis
nomad disk /home          # Filesystem trends
nomad jobs --user X       # Job history
nomad similarity          # Similarity analysis
nomad alerts              # View alerts

# Bash helpers (source scripts/nomad.sh)
nstatus    nwatch    ndisk    njobs    nsimilarity
nalerts    ncollect  nmonitor nsyscheck nlog

Quick Start¶

# 1. Initialize database
sqlite3 /var/lib/nomad/nomad.db < nomad/db/schema.sql

# 2. Verify system
nomad syscheck

# 3. Test collection
nomad collect --once

# 4. Start continuous collection
nohup nomad collect -i 60 > /tmp/nomad-collect.log 2>&1 &
nohup nomad monitor -i 30 > /tmp/nomad-monitor.log 2>&1 &

# 5. Check status
nomad status