Setup Wizard¶
The nomad init wizard configures NØMAÐ for your environment. It detects available tools, asks about your infrastructure, and generates a configuration file at ~/.config/nomad/nomad.toml.
This guide walks through each step of the wizard with explanations and recommendations.
Step 1: Connection Mode¶
Option 1 — On the headnode is the most common choice. NØMAÐ runs directly on the machine that has access to SLURM commands (sinfo, squeue, sacct) and local filesystems. This is the simplest setup with no SSH configuration needed for the local cluster.
Option 2 — Remote machine is for cases where you want to run NØMAÐ from a laptop or desktop that connects to the cluster via SSH. NØMAÐ will SSH into the cluster to run commands. You will need to provide SSH credentials (hostname, user, key path).
Recommendation
Choose Option 1 whenever possible. Running on the headnode gives direct access to SLURM, filesystems, and system tools without SSH overhead.
Step 2: Deployment Strategy¶
NØMAÐ supports two deployment strategies. The wizard explains both and asks how many systems this instance will monitor.
Strategy A — One NØMAÐ per machine + sync¶
Install NØMAÐ on each machine independently. Each instance monitors its own environment and stores data in its own database. Use nomad sync on a central machine to merge all databases into a unified dashboard.
This is the recommended approach when each machine has a different role — for example, an HPC cluster with SLURM, an interactive server with Jupyter/RStudio, and a workstation management server.
In this case, answer 1 on each machine:
- HPC headnode: 1 cluster (HPC type) — monitors partitions, jobs, nodes
- Interactive server: 1 cluster (HPC type, SLURM auto-skipped) — monitors sessions, disk, CPU
- Workstation manager: 1 cluster (Workstation type) — monitors lab machines via SSH
Then on a central machine, configure nomad sync to pull all databases together:
nomad sync # Merge all site databases
nomad dashboard --db combined.db # Unified view across all sites
Strategy B — One NØMAÐ monitors everything¶
A single NØMAÐ instance monitors multiple systems from one machine, using SSH for remote access. This works when you have a machine that can reach all targets.
For example, to monitor both an HPC cluster and a set of workstations from the same headnode, answer 2:
- Cluster 1: HPC cluster (SLURM, local)
- Cluster 2: Workstation group (department machines, via SSH)
Which strategy to choose?¶
| Scenario | Strategy | Why |
|---|---|---|
| Machines have different roles (HPC, interactive, workstations) | A — one per machine + sync | Each instance uses the right collectors for its environment |
| Everything is reachable from one machine | B — single instance | Simpler setup, no sync needed |
| Machines are on different networks | A — one per machine + sync | SSH may not work across network boundaries |
| You want resilience (one site going down doesn't stop others) | A — one per machine + sync | Each site collects independently |
You can always start with Strategy B and split later
If you start with a single instance monitoring everything and later want to split, just run nomad init on each machine separately and set up nomad sync. The data format is the same either way.
Step 3: Cluster Configuration¶
For each cluster, the wizard asks for a name and type.
Cluster Name¶
Choose a short, descriptive name. This appears in the dashboard navigation tabs and in the combined database when using nomad sync. Use lowercase with no spaces.
Cluster Type¶
What type of system is this?
1) HPC cluster (managed by SLURM)
2) Workstation group (department machines, not SLURM)
Option 1 — HPC cluster enables SLURM collectors (squeue, sacct, sinfo), job metrics, node state, group membership, and job accounting. The wizard will auto-detect SLURM partitions.
Option 2 — Workstation group disables SLURM collectors and enables the WorkstationCollector, which monitors remote machines via SSH. You will be asked for department names and hostnames. Workstation data appears in the Workstations page of the dashboard.
What gets disabled for workstation groups
When you select workstation group, the following collectors are automatically disabled since they require SLURM: slurm, job_metrics, node_state, groups. The disk, vmstat, gpu, nfs, and workstation collectors remain active.
Step 3a: HPC Cluster — Partition Detection¶
If you selected HPC cluster, the wizard auto-detects SLURM partitions:
If auto-detection fails (SLURM not installed locally, or running in remote mode), you can type partition names manually:
Tip
You can find partition names by running sinfo -h -o "%P" on the cluster headnode.
Step 3b: Workstation Group — Departments and Machines¶
If you selected workstation group, the wizard asks for department organization:
Then for each department, provide the hostnames of machines to monitor:
Type the hostnames for 'parish-lab', separated by commas:
Nodes for parish-lab: aamy, adam, alexis, boyi, camryn, cooper
The WorkstationCollector will SSH to each machine and collect CPU, memory, disk, load, and user session data.
SSH Access Required
The current user must be able to SSH to each workstation with key-based authentication (no password). Test with: ssh -o BatchMode=yes hostname "hostname". The wizard writes the current username as ssh_user in the configuration.
Step 4: Filesystem Monitoring¶
Which filesystems should NØMAÐ monitor for disk usage?
Filesystems (comma-separated) [/, /home]: /, /home, /scratch
These are the local filesystems on the machine running NØMAÐ. The DiskCollector monitors these paths, computes fill rates using derivative analysis, and fires alerts when usage exceeds thresholds.
Common paths:
/— Root filesystem/home— User home directories/scratch— Scratch storage (often the most critical to monitor)/localscratch— Node-local scratch
Workstation filesystems
For workstation groups, this setting applies to the machine running NØMAÐ (not the workstations). The WorkstationCollector independently collects disk usage from each remote workstation via SSH.
Step 5: Optional Features¶
The wizard probes for available tools and asks about optional collectors:
GPU Monitoring¶
Enabled when nvidia-smi is available. Collects GPU utilization, memory, temperature, and power draw.
NFS Monitoring¶
Enabled when nfsiostat is found. Monitors NFS mount performance including read/write throughput and latency.
Interactive Sessions¶
Detects running JupyterHub or RStudio Server processes. When enabled, the InteractiveCollector tracks active sessions, memory usage, idle time, and session type.
Step 6: Alerts¶
NØMAÐ can send you email alerts when something needs
attention (disk filling up, nodes going down, etc.).
Your email address (press Enter to skip): admin@example.com
Providing an email address enables the alert pipeline. Alerts are always stored in the database and visible in the dashboard regardless of email configuration.
For email delivery, NØMAÐ uses the system mail command if available. No SMTP server configuration is required. You can set up a daily report via cron:
Advanced alert configuration
After the wizard, you can edit ~/.config/nomad/nomad.toml to configure Slack webhooks, custom thresholds, severity filters, and cooldown periods. See the Alerts documentation for details.
Step 7: Dashboard Port¶
The NØMAÐ dashboard is a web page you open in your
browser to view cluster status, node health, and alerts.
Dashboard port [8050]:
The default port 8050 works for most installations. Choose a different port if 8050 is already in use.
For remote access, set up an SSH tunnel:
Then open http://localhost:8050 in your browser.
Summary and Next Steps¶
After completing the wizard, you will see a summary:
✓ NØMAÐ configured!
Config: ~/.config/nomad/nomad.toml
Data: ~/.local/share/nomad
Clusters:
• spydur: 5 partitions, 30 nodes (local)
Enabled: GPU monitoring, NFS monitoring
What to do next:
1. Review your config: nano ~/.config/nomad/nomad.toml
2. Verify environment: nomad syscheck
3. Start collecting data: nomad collect
4. Open the dashboard: nomad dashboard
Verify Your Environment¶
This checks Python version, SLURM availability, system tools, database state, and filesystem access. Fix any warnings before starting collection.
Test Collection¶
Run a single collection cycle to verify everything works. Check for errors in the output — successful collectors show ✓ with a record count.
Start Continuous Collection¶
For production, consider running the collector as a systemd user service for automatic restart on failure.
Launch the Dashboard¶
Open http://localhost:8050 in your browser.
Multi-Site Setup¶
To monitor multiple sites from a single dashboard:
- Install and configure NØMAÐ on each site using
nomad init - On a central machine, create
~/.config/nomad/sync.toml:
[[sites]]
name = "cluster-a"
host = "headnode-a"
user = "monitor"
db_path = "~/.local/share/nomad/cluster-a.db"
[[sites]]
name = "cluster-b"
host = "headnode-b"
user = "monitor"
db_path = "~/.local/share/nomad/cluster-b.db"
- Run sync and launch the combined dashboard:
- Set up a cron for automatic syncing:
Configuration Reference¶
The wizard generates ~/.config/nomad/nomad.toml. Key sections:
| Section | Purpose |
|---|---|
[general] |
Data directory, cluster name |
[database] |
Database filename |
[collectors.disk] |
Filesystem paths to monitor |
[collectors.slurm] |
SLURM partitions (HPC only) |
[collectors.workstation] |
SSH user, workstation hostnames |
[collectors.gpu] |
GPU monitoring toggle |
[collectors.nfs] |
NFS mount monitoring |
[collectors.interactive] |
Jupyter/RStudio session detection |
[alerts] |
Email, Slack, webhook, thresholds |
[dashboard] |
Port, host binding |
[clusters.*] |
Cluster metadata and partition descriptions |
See the Configuration documentation for the full reference.