NOMAD Development Roadmap¶

Timeline Overview¶

2025 Q1 (Jan-Mar)     Phase 1: Monitoring Foundation
2025 Q2 (Apr-Jun)     Phase 2: Prediction Engine
2025 Q3 (Jul-Sep)     Phase 3: Visualization & Integration
2025 Q4 (Oct-Dec)     Phase 4: Paper 1 & Release
2026 Q1-Q2            Phase 5: Advanced ML & Paper 2

Phase 1: Monitoring Foundation (Jan-Mar 2025)¶

Milestone 1.1: Core Infrastructure¶

Target: End of January

[ ] Database Layer
[ ] SQLite schema design and implementation
[ ] Data models (Python dataclasses)
[ ] Query utilities
[ ] Migration system for schema updates
[ ] Configuration System
[ ] TOML parser and validation
[ ] Default configuration
[ ] Environment variable overrides
[ ] Config hot-reload support
[ ] Logging & Error Handling
[ ] Structured logging setup
[ ] Log rotation
[ ] Error categorization

Milestone 1.2: Collectors¶

Target: End of February

[ ] Base Collector Framework
[ ] Abstract base class
[ ] Collection scheduling
[ ] Error handling and retry logic
[ ] Metrics storage interface
[ ] Disk Collector
[ ] Filesystem usage (df parsing)
[ ] Quota tracking (quota command)
[ ] Fill rate calculation
[ ] Derivative analysis integration
[ ] Large file detection
[ ] SLURM Collector
[ ] Queue state (squeue)
[ ] Job history (sacct)
[ ] Node state (sinfo)
[ ] Partition statistics
[ ] Pending job analysis
[ ] Node Collector
[ ] Node status from SLURM
[ ] SSH-based health checks
[ ] Temperature monitoring (sensors, nvidia-smi)
[ ] NFS mount verification
[ ] Service status checks
[ ] License Collector
[ ] FlexLM query parsing
[ ] RLM support
[ ] Generic license server interface
[ ] Expiration tracking

Milestone 1.3: Alert System¶

Target: End of March

[ ] Alert Engine
[ ] Rule evaluation framework
[ ] Threshold-based rules
[ ] Derivative-based rules
[ ] Alert deduplication
[ ] Cooldown management
[ ] Severity levels
[ ] Derivative Analysis
[ ] First derivative calculation
[ ] Second derivative calculation
[ ] Trend classification
[ ] Projection (linear and quadratic)
[ ] Smoothing options
[ ] Dispatch System
[ ] Email dispatcher
[ ] Slack webhook dispatcher
[ ] Generic webhook support
[ ] Alert acknowledgment tracking
[ ] CLI Interface
[ ] nomad init
[ ] nomad start/stop/status
[ ] nomad disk/queue/nodes/licenses
[ ] nomad alerts

Phase 1 Deliverables¶

Working monitoring daemon
Email alerts for threshold and derivative triggers
CLI for status checking
SQLite database with historical data
Basic documentation

Phase 2: Prediction Engine (Apr-Jun 2025)¶

Milestone 2.1: Job Metrics Collection¶

Target: End of April

[ ] SLURM Hooks
[ ] Prolog script (job start)
[ ] Epilog script (job end)
[ ] cgroup metrics extraction
[ ] GPU metrics (nvidia-smi)
[ ] I/O metrics (from /proc or cgroup)
[ ] Job Data Model
[ ] Job metadata table
[ ] Job metrics table (time-series)
[ ] Job summary table (computed at end)
[ ] Health score storage

Milestone 2.2: Similarity Network¶

Target: End of May

[ ] Feature Engineering
[ ] Raw metric extraction
[ ] Normalization (z-score, min-max)
[ ] Non-redundant feature set
[ ] Feature correlation analysis
[ ] Similarity Computation
[ ] Cosine similarity implementation
[ ] Efficient pairwise computation
[ ] Threshold-based edge creation
[ ] Network storage format
[ ] Health Score Model
[ ] Initial formula (domain knowledge)
[ ] Continuous score (0→1)
[ ] Calibration against outcomes
[ ] Cluster-based prediction

Milestone 2.3: Simulation & Validation¶

Target: End of June

[ ] Generative Model
[ ] Fit distributions to empirical data
[ ] Profile-based simulation
[ ] Correlation preservation
[ ] Synthetic job generation
[ ] Validation Framework
[ ] Coverage analysis
[ ] Anomaly detection
[ ] Distribution comparison
[ ] Temporal drift monitoring
[ ] Error Analysis
[ ] Confusion matrix computation
[ ] Type 1/Type 2 error rates
[ ] ROC curve generation
[ ] Threshold optimization
[ ] Defaults derivation
[ ] Recommendations
[ ] Feature impact analysis
[ ] Threshold extraction
[ ] User-specific suggestions
[ ] Training identification

Phase 2 Deliverables¶

Per-job metrics collection via SLURM hooks
Similarity network from real cluster data
Health score prediction
Data-driven recommendations
Simulation validation framework

Phase 3: Visualization & Integration (Jul-Sep 2025)¶

Milestone 3.1: Dashboard Backend¶

Target: End of July

[ ] API Server
[ ] FastAPI or Flask backend
[ ] REST endpoints for all data
[ ] WebSocket for real-time updates
[ ] Authentication (optional)
[ ] Data Aggregation
[ ] Time-series aggregation
[ ] Rollup tables for performance
[ ] Efficient queries for dashboard

Milestone 3.2: Dashboard Frontend¶

Target: End of August

[ ] Monitoring Views
[ ] Disk usage dashboard
[ ] Queue status display
[ ] Node health grid
[ ] License availability
[ ] Alert management
[ ] Prediction Views
[ ] 2D similarity network
[ ] Health score distribution
[ ] Feature correlation panel
[ ] Recommendations display
[ ] 3D Visualization
[ ] Three.js network rendering
[ ] Interactive rotation/zoom
[ ] Safe/danger zone display
[ ] Simulation cloud overlay
[ ] Real-time job tracking

Milestone 3.3: Integration & Testing¶

Target: End of September

[ ] End-to-End Testing
[ ] Collector integration tests
[ ] Alert system tests
[ ] Prediction accuracy tests
[ ] Dashboard functional tests
[ ] Documentation
[ ] Installation guide
[ ] Configuration reference
[ ] API documentation
[ ] User guide
[ ] Performance Optimization
[ ] Database query optimization
[ ] Collection efficiency
[ ] Dashboard responsiveness

Phase 3 Deliverables¶

Complete web dashboard
3D network visualization
Real-time updates
Comprehensive documentation
Performance-tested system

Phase 4: Paper 1 & Release (Oct-Dec 2025)¶

Milestone 4.1: Case Study¶

Target: End of October

[ ] Production Cluster Deployment
[ ] Full deployment on Production Cluster
[ ] 2+ months of production data
[ ] User feedback collection
[ ] Metrics Collection
[ ] Alert effectiveness analysis
[ ] Prediction accuracy metrics
[ ] System overhead measurements
[ ] User satisfaction survey
[ ] VM Simulation Environment
[ ] Data anonymization pipeline (remove users, paths, hostnames)
[ ] Export tool for Production Cluster data → portable dataset
[ ] Data replay engine (feed historical data as "live" events)
[ ] Mock SLURM commands (squeue, sacct responses from data)
[ ] VM image or Docker container with full NOMAD stack
[ ] Documentation for reproducibility

Milestone 4.2: Paper Writing¶

Target: End of November

[ ] Paper 1 Draft
[ ] Introduction and motivation
[ ] Architecture description
[ ] Feature documentation
[ ] Case study results
[ ] Performance analysis
[ ] Figures and Tables
[ ] Architecture diagram
[ ] Screenshot gallery
[ ] Performance charts
[ ] Comparison tables

Milestone 4.3: Release¶

Target: End of December

[ ] Open Source Release
[ ] Code cleanup
[ ] License files
[ ] GitHub repository setup
[ ] PyPI package
[ ] Paper Submission
[ ] JOSS or SoftwareX submission
[ ] Reviewer response preparation

Phase 4 Deliverables¶

Production deployment on Production Cluster
Tool paper submitted
Open source release v1.0
PyPI package

Phase 5: Advanced ML & Paper 2 (2026 Q1-Q2)¶

Milestone 5.1: Advanced Models¶

Target: End of February 2026

[ ] GNN Implementation
[ ] PyTorch Geometric setup
[ ] Graph construction from similarity network
[ ] Node-level prediction (job health)
[ ] Training pipeline
[ ] LSTM Early Warning
[ ] Time-series feature extraction
[ ] Derivative features
[ ] Early warning prediction
[ ] Alert integration
[ ] Ensemble Methods
[ ] Model combination
[ ] Confidence estimation
[ ] Disagreement detection

Milestone 5.2: Partnerships¶

Target: End of April 2026

[ ] Partner Outreach
[ ] Contact potential partners
[ ] Data sharing agreements
[ ] Deployment assistance
[ ] Multi-Cluster Data
[ ] Anonymization pipeline
[ ] Cross-cluster analysis
[ ] Universal vs local patterns

Milestone 5.3: Paper 2¶

Target: Summer 2026

[ ] Research Analysis
[ ] Emergent pattern discovery
[ ] Biogeographical analogy validation
[ ] Prediction vs baseline comparison
[ ] Cross-institution validation
[ ] Paper 2 Writing
[ ] Methods focus
[ ] Theoretical framework
[ ] Multi-cluster results
[ ] Nature Computational Science target

Phase 5 Deliverables¶

GNN and LSTM models
Multi-cluster deployment
Paper 2 submitted
Community data federation

Task Tracking¶

Priority Labels¶

🔴 P0: Critical path, blocks other work
🟠 P1: Important, should be done soon
🟡 P2: Nice to have, can be deferred
🟢 P3: Future enhancement

Status Labels¶

⬜ Not started
🟨 In progress
✅ Complete
❌ Blocked

Dependencies¶

External Dependencies¶

Python 3.9+
SQLite 3.35+
SLURM (for queue monitoring)
nvidia-smi (for GPU monitoring)
React/Three.js (for visualization)

Python Dependencies¶

# Core
toml>=0.10
click>=8.0
sqlalchemy>=2.0

# Analysis
numpy>=1.21
scipy>=1.7
pandas>=1.3

# Prediction (Phase 2)
scikit-learn>=1.0
torch>=2.0
torch-geometric>=2.0

# Visualization (Phase 3)
fastapi>=0.100
uvicorn>=0.20
jinja2>=3.0

# Development
pytest>=7.0
ruff>=0.1
black>=23.0

Risk Mitigation¶

Risk	Impact	Mitigation
SLURM access restrictions	Can't collect job metrics	Fallback to sacct-only data
No root on cluster	Limited cgroup access	Use available SLURM data
ML model underperforms	Poor predictions	Start with simple rules, add ML later
Dashboard too complex	Delayed release	MVP first, enhance iteratively
Partner data unavailable	Paper 2 scope limited	Focus on single-cluster depth

Success Metrics¶

Phase 1¶

[ ] Monitoring daemon runs 7+ days without crash
[ ] Alerts delivered within 60 seconds of trigger
[ ] <1% CPU overhead on head node

Phase 2¶

[ ] >80% jobs have metrics collected
[ ] Prediction accuracy >70%
[ ] Recommendations improve success rate by >10%

Phase 3¶

[ ] Dashboard loads in <3 seconds
[ ] 3D visualization runs at 30+ FPS
[ ] Real-time updates within 5 seconds

Phase 4¶

[ ] Paper 1 submitted to JOSS/SoftwareX
[ ] >10 GitHub stars within 3 months
[ ] At least 1 external deployment inquiry

ZFS Support Enhancement¶

Status: Planned

Current Compatibility¶

Basic monitoring via df and iostat works on ZFS systems
Quotas via zfs get userquota not yet supported

Planned Features¶

Auto-detection: Check for /proc/spl/kstat/zfs or zpool command
ZFS Collector (nomad/collectors/zfs.py):
Pool health via zpool status
Per-vdev I/O via zpool iostat -v
ARC cache stats from /proc/spl/kstat/zfs/arcstats
Compression/dedup ratios via zfs get
ZFS Quotas: Support zfs get userquota@user dataset
Config option: storage_backend = "auto" | "traditional" | "zfs"

Priority Metrics¶

Metric	Source	Why
Pool health	`zpool status`	Critical for failure prediction
ARC hit ratio	arcstats	Memory efficiency
Latency histograms	`zpool iostat -l`	I/O performance

Plugin Architecture¶

Status: Partially implemented

Current State¶

[x] Collectors: BaseCollector + @registry.register pattern
[x] Alert backends: NotificationBackend ABC
[ ] Analysis modules: Not yet pluggable
[ ] Edu dimensions: Not yet pluggable

Planned Refactoring¶

Phase 1: Analysis Plugins

nomad/analysis/
  base.py              # BaseAnalyzer + registry
  similarity.py        # @registry.register
  gnn.py               # @registry.register
  lstm.py              # @registry.register
  autoencoder.py       # @registry.register

Phase 2: Edu Dimension Plugins

nomad/edu/
  dimensions/
    base.py          # BaseDimension + registry
    cpu.py           # @registry.register
    memory.py        # @registry.register
    time.py          # @registry.register
    io.py            # @registry.register
    gpu.py           # @registry.register

Phase 3: Entry Points - Auto-discovery via setuptools.entry_points - Third-party packages: pip install nomad-bioinformatics

Benefits¶

Custom proficiency dimensions per site
Custom ML models from researchers
Third-party plugins become organic marketing

Data Readiness Estimator¶

Status: Planned

Concept¶

Users need sufficient data before ML models are reliable.

Planned Command¶

nomad readiness

Data Readiness Assessment
-----------------------------------------
Jobs collected:     127 / 500 minimum
Days of data:       3 / 14 recommended
Feature coverage:   72%

ML Model Status:
  Similarity network:  Ready (127 jobs)
  LSTM early warning:  Not ready (need 7+ days)
  GNN predictions:     Not ready (need 300+ jobs)

Estimated time to full readiness: 11 days

Features¶

Minimum data thresholds per model
Confidence intervals vs data volume
Progress indicator in CLI/dashboard
Quick-start mode (rule-based until ML-ready)

Plugin Architecture¶

Status: Partially implemented

Current State¶

[x] Collectors: BaseCollector + @registry.register pattern
[x] Alert backends: NotificationBackend ABC
[ ] Analysis modules: Not yet pluggable
[ ] Edu dimensions: Not yet pluggable

Planned Refactoring¶

Phase 1: Analysis Plugins

nomad/analysis/
  base.py              # BaseAnalyzer + registry
  similarity.py        # @registry.register
  gnn.py               # @registry.register
  lstm.py              # @registry.register
  autoencoder.py       # @registry.register

Phase 2: Edu Dimension Plugins

nomad/edu/
  dimensions/
    base.py          # BaseDimension + registry
    cpu.py           # @registry.register
    memory.py        # @registry.register
    time.py          # @registry.register
    io.py            # @registry.register
    gpu.py           # @registry.register

Phase 3: Entry Points - Auto-discovery via setuptools.entry_points - Third-party packages: pip install nomad-bioinformatics

Benefits¶

Custom proficiency dimensions per site
Custom ML models from researchers
Third-party plugins become organic marketing

Data Readiness Estimator¶

Status: Planned

Concept¶

Users need sufficient data before ML models are reliable.

Planned Command¶

nomad readiness

Data Readiness Assessment
-----------------------------------------
Jobs collected:     127 / 500 minimum
Days of data:       3 / 14 recommended
Feature coverage:   72%

ML Model Status:
  Similarity network:  Ready (127 jobs)
  LSTM early warning:  Not ready (need 7+ days)
  GNN predictions:     Not ready (need 300+ jobs)

Estimated time to full readiness: 11 days

Features¶

Minimum data thresholds per model
Confidence intervals vs data volume
Progress indicator in CLI/dashboard
Quick-start mode (rule-based until ML-ready)

Rebranding: NOMAD to NOMAD¶

Status: Planned (High Priority - do before paper acceptance)

Rationale¶

"NOMAD" can be misread as "no-made" by English speakers
"NOMAD" reads correctly as the English word for wanderer
Fits the philosophy: "Travels light, adapts to its environment"
Better to change now while paper is in review and user base is small

Name Change¶

Old	New
NOMAD (NOde Monitoring And Diagnostics)	NOMAD (NOde Monitoring And Diagnostics)

Changes Required¶

Code/Package - [ ] Rename directory: nomad/ to nomad/ - [ ] Update all Python imports - [ ] Update pyproject.toml (package name, entry points) - [ ] Update CLI commands: nomad to nomad - [ ] Backward-compat alias: nomad still works temporarily

Paper (nomad-jors-paper.tex) - [ ] Line 21: Title - [ ] Line 36: Abstract - change expansion to "NOde Monitoring And Diagnostics" - [ ] Lines 79-290+: All NOMAD references to NOMAD - [ ] Update any figures showing the name

Documentation - [ ] README.md - [ ] All docs/*.md files - [ ] mkdocs.yml (site_url, repo_url)

External - [ ] Rename GitHub repo: jtonini/nomad to jtonini/nomad - [ ] New PyPI package: nomad-hpc - [ ] Update Zenodo DOI (new version) - [ ] GitHub Pages URL: jtonini.github.io/nomad

Migration Script (to create)¶

# Rename directory
mv nomad nomad

# Update imports in all Python files
find . -name "*.py" -exec sed -i 's/from nomad/from nomad/g' {} \;
find . -name "*.py" -exec sed -i 's/import nomad/import nomad/g' {} \;

# Update docs
find docs -name "*.md" -exec sed -i 's/NOMAD/NOMAD/g' {} \;
find docs -name "*.md" -exec sed -i 's/nomad/nomad/g' {} \;

# Update paper
sed -i 's/NOMAD/NOMAD/g' paper/nomad-jors-paper.tex
sed -i 's/NOde Monitoring And Diagnostics/NOde Monitoring And Diagnostics/g' paper/nomad-jors-paper.tex
mv paper/nomad-jors-paper.tex paper/nomad-jors-paper.tex

Backward Compatibility¶

# pyproject.toml - support both commands during transition
[project.scripts]
nomad = "nomad.cli:main"
nomad = "nomad.cli:main"  # deprecated, prints warning, remove in v2.0