GPU Monitoring¶

NØMAD collects GPU metrics from NVIDIA GPUs using nvidia-smi as the standard source, with optional enhanced metrics via DCGM (NVIDIA Data Center GPU Manager) when available. The two sources are transparent to the user — NØMAD detects DCGM automatically and falls back to nvidia-smi silently.

Data Sources¶

Source	Metrics	Availability
`nvidia-smi`	Utilization, memory, temperature, power	Any NVIDIA GPU node
DCGM	All of the above + Real Util, pipeline counters, hardware health	Datacenter GPUs with DCGM installed

The data_source column in gpu_stats records which source was used for each record. The dashboard shows a DCGM pill next to the GPU name when enhanced metrics are active.

Real Utilization¶

nvidia-smi reports GPU utilization as the fraction of time at least one kernel was running — a kernel-launch duty cycle. A GPU can show 100% utilization while its tensor cores are completely idle.

Real Utilization is a weighted composite of DCGM pipeline activity counters that reflects what the GPU is actually doing:

Real Util = clamp(0, 100,
    W_sm     × SM_ACTIVE
  + W_tensor × TENSOR_ACTIVE
  + W_dram   × DRAM_ACTIVE
  + W_gr     × GR_ENGINE_ACTIVE)

Three presets are available, configurable in nomad.toml:

[collectors.gpu]
real_util_weights = "ai"   # ai | hpc | memory | [custom list]

Preset	Weights (SM, Tensor, DRAM, GR)	Best for
`ai`	0.35, 0.35, 0.20, 0.10	DL training, LLM inference
`hpc`	0.45, 0.15, 0.25, 0.15	Scientific computing, mixed CUDA
`memory`	0.35, 0.10, 0.40, 0.15	Bandwidth-heavy, stencil codes

Custom weights are also supported:

real_util_weights = [0.40, 0.30, 0.20, 0.10]  # SM, Tensor, DRAM, GR

Weights are auto-normalized to sum to 1.

The Real Utilization metric and workload classification framework are adapted from KempnerPulse (Kempner Institute, MIT license), based on NVIDIA DCGM profiling metric guidance.

Workload Classification¶

When DCGM data is available, each GPU record is assigned one of 12 workload categories based on pipeline counter patterns. Categories are evaluated in priority order; the first match wins.

Category	Signal	Meaning
tensor-heavy compute	Tensor ≥ 50%, SM ≥ 60%	DL training, large-scale inference. GPU used for its intended purpose.
tensor compute	Tensor ≥ 15%, SM ≥ 40%	Mixed-precision, moderate neural network activity.
FP64 / HPC compute	FP64 ≥ 20%, SM ≥ 50%	Classical scientific computing: molecular dynamics, quantum chemistry, climate simulation.
I/O or data-loading	PCIe ≥ 1 GB/s, SM < 30%	GPU waiting for data. Common during dataset loading or checkpoint I/O.
memory-bound	DRAM ≥ 50%, SM < 50%	Bandwidth is the bottleneck. The workload may benefit from data layout optimization.
compute-heavy	SM ≥ 80%	High SM occupancy. General CUDA compute, older code without tensor core use.
compute-active	SM ≥ 50%	Moderate SM activity. Mixed or general-purpose CUDA kernels.
memory-active	DRAM ≥ 40%	Significant DRAM traffic without SM saturation.
busy, low SM use	GR ≥ 40%, SM < 25%	Kernel launch overhead, synchronization barriers, or driver activity.
low utilization	GR < 15%, SM < 15%, DRAM < 15%	Barely active — job may be stalled or using the GPU incidentally.
idle	Real Util < 5%, all counters low	Nothing running.
mixed / moderate	(fallthrough)	No single dominant pattern.

The dashboard displays the workload category as a color-coded badge using the Okabe-Ito colorblind-safe palette:

Blue — tensor workloads (tensor-heavy, tensor)
Teal — FP64 / HPC compute
Amber — memory and I/O workloads
Mauve — compute workloads
Orange — I/O or data-loading
Gray — idle, low utilization, mixed

Interpreting the utilization gap¶

The difference between nvidia-smi utilization and Real Utilization is often the most actionable signal:

Small gap (< 10 pts): GPU is well-utilized; pipeline stages match the workload type.
Large gap (> 20 pts) with memory-bound: Data movement is the bottleneck. Consider larger batch sizes, prefetching, or NVLink if available.
Large gap with busy, low SM use: Many small kernels or excessive synchronization. Kernel fusion or async execution may help.
High nvidia-smi, low Real Util, idle: A single process holds the GPU but is not actively computing (e.g., a job in a data-loading loop).

Hardware Health Monitoring¶

When DCGM is active, NØMAD collects hardware health signals that precede visible performance degradation:

Signal	Meaning
PCIe replay rate	Link errors causing retransmissions. Any sustained rate > 0/s indicates link issues.
ECC correctable errors	Transient memory bit flips. Monitor for trends.
ECC uncorrectable errors	Serious memory failures. Investigate immediately.
Row remap failure	HBM memory (A100, H100) permanently degraded. Remove from production.

Health status per GPU:

Status	Condition
OK	Normal operation
WARN	PCIe replay rate > 0/s
HOT	Temperature at or above model threshold (A100: 93°C, RTX 6000 Ada: 92°C, H100: 95°C)
CRIT	Row remap failure or uncorrectable ECC errors

The dashboard shows a colored health badge (WARN/HOT/CRIT) next to the GPU name in the node detail panel. CRIT nodes should be removed from production.

SSH Mode¶

For headnode deployments where GPUs are on compute nodes, configure SSH mode:

[collectors.gpu]
enabled = true
ssh_user = "zeus"
gpu_nodes = ["node51", "node52", "node53"]

NØMAD SSHes to each node and runs nvidia-smi (and dcgmi if available) remotely. The same fallback logic applies per node.

SLURM / CUDA GPU Visibility¶

User-facing commands (nomad edu explain_job, etc.) filter GPU data to the GPUs visible to the current job. Detection priority:

CUDA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES
SLURM_STEP_GPUS
SLURM_JOB_GPUS
All GPUs (admin mode)

CLI Commands¶

`nomad diag gpu`¶

Utilization summary, workload distribution, and temperature status.

nomad diag gpu                    # All nodes, last 24h
nomad diag gpu --node node51      # Specific node
nomad diag gpu --hours 48         # Longer window
nomad diag gpu --health           # Hardware health only (PCIe, ECC, remap)

`nomad insights gpu`¶

Narrative summary of GPU usage patterns and health concerns. Flags the utilization gap when nvidia-smi reports significantly higher utilization than Real Util.

nomad insights gpu                # All nodes, last 6h
nomad insights gpu --hours 12     # Longer window
nomad insights gpu --node spdr16  # Specific node

Installing DCGM¶

DCGM is free and available for all supported datacenter GPUs.

Rocky Linux / RHEL:

dnf install -y epel-release
dnf install -y datacenter-gpu-manager
systemctl enable --now nvidia-dcgm

Ubuntu / Debian:

apt-get install -y datacenter-gpu-manager
systemctl enable --now nvidia-dcgm

Verify:

dcgmi discovery -l
dcgmi dmon -e 1001 -c 1

NØMAD detects dcgmi in PATH automatically on the next collection cycle. No configuration change is required.

Supported GPUs¶

GPU	nvidia-smi	DCGM
NVIDIA A100	✓	✓
NVIDIA A40	✓	✓
NVIDIA H100 / H200	✓	✓
NVIDIA RTX 6000 Ada	✓	✓ (DCGM 3.3.9+, field IDs 1001-1005)
NVIDIA RTX 4090 / 3090 / 3050	✓	Management only — profiling not supported
NVIDIA GTX series (Maxwell/Pascal)	✓	—
AWS p3/p4/g4dn/g5 instances	✓	✓ (if installed)
AMD GPUs	—	—

Profiling counter availability is GPU-class dependent. NVIDIA gates the profiling fields (SM_ACTIVE, TENSOR_ACTIVE, DRAM_ACTIVE, GR_ENGINE_ACTIVE) to datacenter and professional GPUs. Consumer GeForce cards (RTX 40/30/20 series) support DCGM for management metrics (utilization, memory, temperature, power) but profiling counters return error -36 ("Profiling is not supported"). The collector detects this automatically and falls back to nvidia-smi data with no DCGM enrichment fields populated. Real Utilization and workload classification are unavailable for these GPUs.

NØMAD has been verified with DCGM on: - NVIDIA RTX 6000 Ada Generation with DCGM 3.3.9 (full profiling support) - NVIDIA A100, A40, H100 (full profiling support — datacenter GPUs) - NVIDIA RTX 4090 (DCGM 4.5+ — management only, no profiling)

AMD GPU support is a future initiative (ROCm).

Insight Engine Integration¶

GPU signals are automatically included in nomad insights brief, nomad insights detail, and the dashboard Insights tab. No configuration is required — signals fire when the relevant data is present in the database.

Signals¶

Signal	Severity	Condition	Requires DCGM
GPU Job Failures	WARNING	> 20% of GPU jobs failing	No
GPU Out of Memory	WARNING/NOTICE	Any GPU jobs hit VRAM limit	No
GPU Utilization Gap	WARNING/NOTICE	nvidia-smi avg > 30%, Real Util lags > 20 pts	Yes
GPU Workload Pattern	NOTICE	memory-bound ≥ 50% or idle ≥ 70% of samples	Yes
GPU Workload Pattern	INFO	tensor-heavy or FP64 dominant ≥ 60%	Yes
GPU Hardware Health	CRITICAL	Row remap failure or uncorrectable ECC	Yes
GPU Hardware Health	WARNING	Temperature at or above model threshold	Yes
GPU Hardware Health	NOTICE	PCIe replay errors detected	Yes

Example narratives¶

GPU Utilization Gap (NOTICE):

node51 shows a 30-point gap between nvidia-smi utilization (54%) and Real Utilization (24%). The GPU appears busy but the compute pipeline is underused. Consider larger batch sizes, kernel fusion, or data prefetching.

GPU Workload Pattern — memory-bound (NOTICE):

node51 has been running memory-bound workloads 65% of the time. GPU compute pipeline is underutilized relative to memory bandwidth. Possible improvements: increase batch size, optimize data layout, or use prefetching to overlap compute and data transfer.

GPU Workload Pattern — productive (INFO):

spdr16 is running tensor-heavy compute workloads 72% of the time — GPU resources are being used effectively.

GPU Hardware Health — CRITICAL:

node51 GPU 2 has a row remap failure — HBM memory is permanently degraded. This GPU should be removed from production and scheduled for replacement.

GPU Hardware Health — NOTICE:

spdr17 GPU 0 is logging PCIe replay errors (0.023/s). This indicates link instability — check the PCIe slot, riser card, or cable. Left unaddressed, this typically escalates to link failure.

Signal thresholds¶

Utilization gap and workload pattern signals require at least 3 samples within the analysis window before firing, avoiding false positives from transient conditions. Hardware health signals fire on any occurrence within the window — PCIe errors and ECC events are always worth flagging.

-- Extended gpu_stats table
CREATE TABLE gpu_stats (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp DATETIME NOT NULL,
    node_name TEXT,
    gpu_index INTEGER,
    gpu_name TEXT,
    gpu_util_percent REAL,       -- nvidia-smi kernel duty cycle
    memory_util_percent REAL,
    memory_used_mb INTEGER,
    memory_total_mb INTEGER,
    memory_free_mb INTEGER,
    temperature_c INTEGER,
    power_draw_w REAL,
    power_limit_w REAL,
    compute_processes INTEGER,
    -- DCGM fields (NULL when nvidia-smi only)
    sm_active_pct REAL,          -- Streaming multiprocessor activity
    tensor_active_pct REAL,      -- Tensor core activity
    dram_active_pct REAL,        -- DRAM bandwidth utilization
    gr_engine_active_pct REAL,   -- Graphics engine activity
    fp64_active_pct REAL,        -- FP64 pipeline activity
    real_util_pct REAL,          -- Weighted composite (see above)
    workload_class TEXT,         -- 12-category classification
    data_source TEXT             -- 'nvidia-smi' or 'dcgm'
);

-- Hardware health time series
CREATE TABLE gpu_health (
    timestamp DATETIME NOT NULL,
    node TEXT NOT NULL,
    gpu_id INTEGER NOT NULL,
    pcie_replay_count INTEGER,
    pcie_replay_rate_per_sec REAL,
    ecc_correctable_total INTEGER,
    ecc_uncorrectable_total INTEGER,
    rows_remapped_correctable INTEGER,
    rows_remapped_uncorrectable INTEGER,
    rows_remapped_pending INTEGER,
    row_remap_failure INTEGER,
    health_status TEXT,          -- OK | WARN | HOT | CRIT
    PRIMARY KEY (timestamp, node, gpu_id)
);

Existing databases are migrated automatically on the next nomad collect run. No manual schema changes are required.