Insight Engine¶

The NØMAD Insight Engine translates analytical output into actionable, human-readable narratives. Instead of presenting raw numbers, charts, and threshold alerts, the engine explains what is happening, why it matters, and what to do about it.

The Distinction¶

Layer	Example	Role
Alerts	`WARN: /scratch usage 92%`	Reactive threshold triggers
Reports	`94.2% success rate this week`	Backward-looking, descriptive
Insights	`The scratch filesystem is filling at 5 GB/hr because GPU jobs are writing large checkpoints. At this rate, /scratch will be full by 2am.`	Interpretive, forward-looking

Reports provide evidence. Insights provide understanding. Both coexist.

Quick Start¶

# Generate demo data with stress scenarios
nomad demo --no-launch

# Get a concise operational briefing
nomad insights brief --db ~/nomad_demo.db --cluster demo-cluster --hours 168

# Full detailed report
nomad insights detail --db ~/nomad_demo.db --hours 168

# JSON output (for API/Console integration)
nomad insights json --db ~/nomad_demo.db

# Slack-formatted message
nomad insights slack --db ~/nomad_demo.db --cluster demo-cluster

# Email digest
nomad insights digest --db ~/nomad_demo.db --period daily

How It Works¶

The engine runs a four-step pipeline:

DB tables → Signal Readers → Template Narration → Correlation Engine → Output
              (Level 1)         (Level 1)            (Level 2)

Step 1: Signal Readers¶

Eight domain-specific readers query the database and produce typed Signal objects with severity, metrics, and affected entities.

Reader	Data Source	Signals Produced
Jobs	`jobs`	Success rate, partition failures, OOM, timeouts, trend
Disk	`storage_state`	Filesystem usage, fill rate projection
GPU	`jobs` (GPU subset)	GPU job failure rate, GPU OOM
Queue	`queue_state`, `jobs`	Queue pressure, wait times
Network	`network_perf`	Latency, packet loss
Alerts	`alerts`	Active alert count, flapping detection
Cloud	`cloud_metrics`	Cost summary, underutilized instances
Workstation	`workstation_state`	CPU/memory pressure

Step 2: Template Narration (Level 1)¶

Each signal is passed through a narrative template that produces a human-readable sentence. Templates adjust tone based on severity:

Info: "1,200 jobs processed, 96% success rate."
Warning: "850 jobs, 82% success rate -- below the 90% baseline."
Critical: "400 jobs, only 65% succeeded -- well below normal."

There are 19 templates covering all signal types.

Step 3: Correlation Engine (Level 2)¶

The engine examines multiple signals together to find causal or co-occurring patterns. Instead of three separate alerts, it produces one coherent finding:

Correlation Rule	Signals Combined	Insight
Disk pressure + job failures	`disk_fill_projection` + `job_success_rate`	Cascading failure risk
GPU OOM + partition failures	`gpu_oom` + `partition_failure_concentration`	VRAM capacity mismatch
Queue pressure + wait times	`queue_pressure` + `high_wait_time`	Partition bottleneck
Network issues + job failures	`high_network_latency` + `job_success_rate`	I/O-related failures
Cloud cost + underutilization	`cloud_cost_summary` + `underutilized_instance`	Cost optimization
Workstation overload + alerts	`workstation_high_cpu` + `active_alerts`	User impact

Correlated insights include a recommendation with specific actions.

Step 4: Output Formatting¶

Format	Use Case
CLI brief	Concise terminal briefing
CLI detail	Full report with metrics
JSON	API and Console integration
Slack	Channel notifications (supports webhook)
Email digest	Daily/weekly summaries

CLI Reference¶

All commands accept --db PATH, --hours N, and --cluster NAME.

`nomad insights brief`¶

Concise operational briefing with health assessment, correlated findings, and individual signals.

`nomad insights detail`¶

Full report with all signals, metrics, and affected entities.

`nomad insights json`¶

JSON output for programmatic use:

{
  "overall_health": "degraded",
  "signal_count": 15,
  "insight_count": 3,
  "insights": [...],
  "signals": [...]
}

`nomad insights slack`¶

Slack-formatted message. Add --webhook URL to post directly.

`nomad insights digest`¶

Email digest with --period daily|weekly.

Dashboard Integration¶

Available in nomad dashboard as the Insights tab, and through the /api/insights endpoint.

Architecture¶

nomad/insights/
    engine.py         — InsightEngine orchestrator
    signals.py        — 8 signal readers
    templates.py      — 19 narrative templates
    correlator.py     — 6 correlation rules
    formatters.py     — Output formatters
    inject_stress.py  — Demo stress scenarios

Implementation Levels¶

Level	Description	Status
Level 1	Template-based narratives	Implemented
Level 2	Multi-signal correlation	Implemented
Level 3	LLM-powered interpretation	Planned (CSSI Year 2-3)

Programmatic Use¶

from nomad.insights import InsightEngine

engine = InsightEngine("/path/to/nomad.db", hours=168, cluster_name="mycluster")

print(engine.overall_health)    # "good", "nominal", "degraded", "impaired"
print(engine.signal_count)
data = engine.to_dict()         # Python dict
print(engine.to_slack())        # Slack markdown
subject, body = engine.to_email("daily")