Start parallel runs on Euler
Overview
This guide explains how to run multiple Nexus-E simulations in parallel on the Euler cluster using euler_run.sbatch.
Why use this? Instead of manually copying files and running simulations one by one, this script automates running multiple database/playlist combinations in parallel, saving significant time and effort.
Key features:
- Runs on scratch storage: Each job automatically copies only the necessary files to scratch - fast and doesn't consume Home quota
- Centralized tracking: All job statuses are collected in a single manifest file, including paths to nexus-e logs, SLURM output/error files, and results
- Email notifications: Receive an email when each job completes or fails (optional)
- Parallel execution: All database/playlist combinations run independently in parallel
Files involved:
| File | Purpose |
|---|---|
euler_run.sbatch |
Main batch script (submit this with sbatch) |
modules_playlists/euler_run.config.local |
Your personal settings: JOBS, REPO_PATH, EMAIL (git-ignored) |
modules_playlists/euler_run.config.example |
Template to copy for creating your .local file |
config.toml |
Main Nexus-E config (database credentials, settings) |
Table of Contents
- Prerequisites (may skip)
- First Time Setup
- Common Workflows
- How It Works
- Monitoring Jobs
- Understanding the Manifest File
- Troubleshooting
- Appendix: Understanding create_a_copy Behavior
Prerequisites
π‘ Already running Nexus-E on Euler? If you've successfully run Nexus-E on Euler before, you can skip this section and go directly to First Time Setup.
Before using euler_run.sbatch, ensure you have:
1. Access & Environment
- SSH access to Euler cluster
- Member of the
es_sansaaccount (or update#SBATCH --accountin the script) uvpackage manager installed and accessible
2. Repository Setup
- Nexus-E framework cloned to Euler (default location:
~/repos/nexus-e-framework) - If you use a different path, you'll need to update
REPO_PATHin your config file (see Step 2) - Recommendation: Keep only one repository version on Home to avoid confusion
- Your
config.tomlconfigured with correct database credentials, user initials, etc. - Required databases accessible on the database server
3. Directory Structure
The script expects this structure:
~/repos/nexus-e-framework/ # Your repo (can be different path)
βββ euler_run.sbatch # The batch script
βββ config.toml # Main config (database credentials, user settings)
βββ modules_playlists/
β βββ euler_run.config.local # Your personal config (JOBS, REPO_PATH) - You make this in the Step 1.
β βββ centiv_2050.toml # Playlist definitions
β βββ ...
βββ ...
~/Euler_logs/ # Created automatically for logs and manifests
First Time Setup
Step 1: Create Your Personal Configuration
cd ~/repos/nexus-e-framework # Or wherever your repo is
cp modules_playlists/euler_run.config.example modules_playlists/euler_run.config.local
Step 2: Edit Your Configuration
Open modules_playlists/euler_run.config.local and configure:
# Define which database/playlist combinations to run
JOBS=(
"test_nuclear_all_eu|centiv_2050"
"test_nuclear_all_eu|suportRES"
"my_scenario_db|centiv_2050"
)
# Path to your nexus-e-framework repository
REPO_PATH="$HOME/repos/nexus-e-framework"
# Email notifications (optional)
# Set your email to receive notifications when jobs END or FAIL
# Leave empty to disable: EMAIL_ADDRESS=""
EMAIL_ADDRESS="your.email@ethz.ch"
Understanding JOBS:
- Format:
"database_name|playlist_name" database_name: Name of the database in MySQLplaylist_name: Name of the playlist file inmodules_playlists/(without.tomlextension)- Each job runs independently in parallel
- Note: The script automatically patches
config.tomlwith these values for each job - you don't need to manually edit database or playlist settings inconfig.toml
Understanding EMAIL_ADDRESS:
- Optional: Set your email to receive notifications when jobs complete or fail
- Leave empty (
EMAIL_ADDRESS="") to disable email notifications - Each user can set their own email in their personal
euler_run.config.localfile - You'll receive an email for each job when it ends (success) or fails
Step 3: Verify config.toml
Check your config.toml has correct settings:
- Database server credentials (
[input_database_server],[output_database_server]) - User initials (
user_initials) create_a_copysetting (determines which database field(s) get patched):true: Creates a copy of the database;original_namewill be set from JOBSfalse: Uses database directly; bothoriginal_nameandcopy_namewill be set from JOBS- Any other settings except
original_name/copy_nameandplaylist_name(these are set automatically per job)
Common Workflows
Full Calibration Run (Multiple Scenarios in Parallel)
Run multiple scenarios simultaneously:
How It Works
Workflow Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. You run: sbatch euler_run.sbatch β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. Script reads modules_playlists/euler_run.config.local β
β - Loads JOBS array β
β - Loads REPO_PATH β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. Creates experiment tag (timestamp) β
β - Manifest file: ~/Euler_logs/multiRun_YYYYMMDD_HHMMSS β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. Submits one SLURM job per database|playlist pair β
β Each job runs independently in parallel β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββ΄βββββββββββββ¬ββββββββββββββββ
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β Job 1 β β Job 2 β ... β Job N β
ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ
β β β
βββββββββββββββββββββββββ΄ββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββ
β Each job independently: β
β a. Creates scratch directory β
β /cluster/scratch/$USER/run_$JOBID β
β b. Copies repo to scratch β
β c. Patches config.toml with DB/playlistβ
β d. Runs: uv run nexus_e/app.py β
β e. Writes results to scratch β
β f. Updates manifest file β
ββββββββββββββββββββββββββββββββββββββββββ
What Happens in Each Job
- Scratch Directory: Creates
/cluster/scratch/$USER/run_${SLURM_JOB_ID}/ - Copy Files: Copies from
REPO_PATH: - Folders:
CentIv/,nexus_e/,Shared/,modules_playlists/ - All top-level files including
config.toml - Patch config.toml: Automatically sets:
- If
create_a_copy = true:[scenario].original_name= "database_name" - If
create_a_copy = false: both[scenario].original_nameAND[scenario].copy_name= "database_name" [modules].playlist_name = "playlist_name"output_nameandsimulation_folder- Run Simulation: Executes
uv run nexus_e/app.py - Store Results: Saves to
scratch/nexus-e-framework/Results/run_${DATABASE}_${PLAYLIST_NAME}/ - Update Manifest: Records completion status, results path, and webviewer link
Monitoring Jobs
Email Notifications
If you configured EMAIL_ADDRESS in your euler_run.config.local, you'll receive automatic email notifications:
- When jobs complete successfully: Email includes job ID, runtime, and basic stats
- When jobs fail: Email includes job ID and failure information
The emails are sent by SLURM automatically. Check your manifest file (path shown in the email) for detailed results locations and webviewer links.
Check Job Status
Monitor Progress in Real-Time
The manifest file is your central dashboard:
# Find your manifest (created when you submit jobs)
ls -lt ~/Euler_logs/multiRun_*.manifest | head -1
# Watch it update in real-time
watch -n 5 cat ~/Euler_logs/multiRun_20250124_143022.manifest
Check Individual Job Logs
Each job creates three log files, output, error, and nexus-e log. The location of all files can be found in the manifest file.
Understanding the Manifest File
The manifest file (~/Euler_logs/multiRun_TIMESTAMP.manifest) tracks all jobs in your batch.
Example Manifest
# Multi-run manifest for 20250124_143022
[START] Job 12345678 | DB: test_nuclear_all_eu | Playlist: base
App: /cluster/scratch/username/run_12345678/nexus-e-framework/nexus-e.log
Out: /cluster/home/username/Euler_logs/output_12345678.log
Err: /cluster/home/username/Euler_logs/error_12345678.log
Monitor: tail -f <path> or cat <path>
[COMPLETE] Job 12345678 | DB: test_nuclear_all_eu | Playlist: base
Results: /cluster/scratch/username/run_12345678/nexus-e-framework/Results/run_test_nuclear_all_eu_base
Folder: run_test_nuclear_all_eu_base
Webviewer: http://example.com/webviewer/...
[START] Job 12345679 | DB: test_nuclear_all_eu | Playlist: suportRES
App: /cluster/scratch/username/run_12345679/nexus-e-framework/nexus-e.log
Out: /cluster/home/username/Euler_logs/output_12345679.log
Err: /cluster/home/username/Euler_logs/error_12345679.log
[FAILED] Job 12345679 | DB: test_nuclear_all_eu | Playlist: suportRES
Reason: Application error detected in log (exit code was 1)
Check app log: /cluster/scratch/username/run_12345679/nexus-e-framework/nexus-e.log
Check error log: /cluster/home/username/Euler_logs/error_12345679.log
Entry Types
| Entry Type | Meaning |
|---|---|
[START] |
Job started running, includes all log paths for monitoring |
[COMPLETE] |
Job finished successfully, includes results location and webviewer link |
[FAILED] |
Job failed, includes exit code and error log path |
Troubleshooting
Problem: "euler_run.config.local not found"
Symptom:
Solution:
# Create your config file
cd ~/repos/nexus-e-framework
cp modules_playlists/euler_run.config.example modules_playlists/euler_run.config.local
# Edit it with your settings
nano modules_playlists/euler_run.config.local
Problem: Job Fails Immediately
Symptom: Job shows in manifest as [FAILED] shortly after submission
Debug Steps:
- Check the error log first:
- Check the application log:
Common Causes:
- Database connection error: Check credentials in
config.toml - Database doesn't exist: Verify database name in
JOBSarray - Playlist file missing: Check
modules_playlists/playlist_name.tomlexists
Problem: "Repo not found at $REPO_PATH"
Symptom:
Solution:
Edit modules_playlists/euler_run.config.local with the correct path:
Additional Resources
Appendix: Understanding create_a_copy Behavior
The create_a_copy setting in config.toml controls how the script patches database names and how the application uses them:
When create_a_copy = true
- Script behavior: Patches
[scenario].original_namewith the database from JOBS - Application behavior:
- Reads from
original_namedatabase - Creates a temporary copy with auto-generated name (timestamp + initials + random string)
- Runs simulation on the copy
- Postprocessing reads input data from
original_name - Optionally deletes the copy after simulation (if
delete_copy_after_simulation = true) - Use case: Safe experimentation - leaves the original database untouched
When create_a_copy = false
- Script behavior: Patches both
[scenario].original_nameAND[scenario].copy_namewith the database from JOBS - Application behavior:
- Uses
copy_namedatabase directly (no copying) - Runs simulation on that database
- Postprocessing reads input data from
original_name(same database) - Use case: Working directly with a specific database, faster startup (no copy time)
Originally created by Ali Darudi in discussion with Claude.