misc: Phoenix runner management scripts by sbryngelson · Pull Request #1314 · MFlowCode/MFC

sbryngelson · 2026-03-15T21:27:09Z

Summary

Add a toolkit for managing GitHub Actions self-hosted runners on Phoenix login nodes
Replace the outdated misc/starting-phoenix-runners.md (SOCKS5 proxy setup no longer needed)
Scripts handle runner health checks, restart, rebalancing across login nodes, creation, and CI rerun

Scripts

Script	Purpose
`config.sh`	Shared config: node list, cgroup limit, runner discovery helpers
`check-runners.sh`	Quick per-node health check
`list-runners.sh`	Detailed runner table (name, node, status, slurm, RSS, pool)
`restart-runner.sh`	Restart one runner with proper login shell PATH
`restart-all.sh`	Restart all runners in place
`rebalance-runners.sh`	Auto-compute optimal distribution and move runners
`create-runner.sh`	Register and start a new runner
`rerun-failed.sh`	Find and rerun failed GHA workflows on open PRs

Test plan

check-runners.sh tested on live Phoenix nodes
list-runners.sh tested on live Phoenix nodes
rebalance-runners.sh dry run tested (correctly identifies moves needed)
Runner rebalancing executed successfully (3/4/4 distribution achieved)

🤖 Generated with Claude Code

Add a toolkit for managing GitHub Actions self-hosted runners on Phoenix login nodes: check-runners.sh — quick per-node health check list-runners.sh — detailed runner table (name, node, status, slurm, RSS) restart-runner.sh — restart one runner with proper login shell PATH restart-all.sh — restart all runners in place rebalance-runners.sh — auto-compute optimal distribution and move runners create-runner.sh — register and start a new runner rerun-failed.sh — find and rerun failed GHA workflows on open PRs config.sh — shared config (nodes, cgroup limit, discovery helpers) Remove the outdated starting-phoenix-runners.md which referenced the old SOCKS5 proxy setup that is no longer needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-15T21:29:02Z

Claude Code Review

Incremental review from: 91ef65e8fe723d63818d26aa664aa752b926782e
Head SHA: 84d6ec5d58023f7ac414e4e18890b0ce09de3d22

New findings since last Claude review (frontier/ scripts are new in this push):

1. Tarball download race condition — make-runner.sh + deploy-runners.sh

deploy-runners.sh launches multiple make-runner.sh instances in parallel via &. Each independently checks:

if [ ! -f "${SHARED_DIR}/${TARBALL}" ]; then
    curl -fsSL ... -o "${SHARED_DIR}/${TARBALL}"
fi

If N runners are deployed simultaneously and the tarball doesn't yet exist, all N pass the -f check at nearly the same moment and concurrently write to the same -o path. curl does not write atomically (it streams directly to the target file), so the result will be a corrupted tarball. The subsequent tar xzf will fail on every instance except possibly the last writer.

Fix: download (or verify) the tarball once before entering the parallel loop in deploy-runners.sh, or use flock in make-runner.sh:

# In deploy-runners.sh, before the for loop:
bash "${SHARED_DIR}/make-runner.sh" --download-only "${NUM}" 2>/dev/null || true

Or simpler — download once in deploy-runners.sh itself before spawning workers.

2. RUNNER_GROUP="phoenix" in make-runner.sh (likely copy-paste error)

misc/frontier/make-runner.sh line 20 sets RUNNER_GROUP="phoenix" and passes it as --runnergroup phoenix during runner configuration. Frontier runners being registered under the phoenix runner group may be intentional (if the same GitHub org group covers both clusters), but it looks like a copy-paste from the Phoenix scripts. Verify this is the correct org-level group for Frontier runners.

Previously reported, still open:

Word-splitting on for dir in $(find_runner_dirs) in Phoenix scripts (list-runners.sh, restart-all.sh, rebalance-runners.sh).
Arithmetic expansion on "?" in check-runners.sh and list-runners.sh per-node summary under set -euo pipefail.

… config Simplify scripts to be Phoenix-aware: org, runner group, label, nodes, and parent dirs are all in config.sh rather than passed as arguments. Add GitHub API helpers (gh_list_runners, gh_registration_token) so list-runners.sh shows both API status and process state. create-runner.sh now only needs a name and node. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Scripts for deploying, restarting, and stopping self-hosted runners on OLCF Frontier login nodes, modeled after the existing Phoenix runner scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ed discovery - Add misc/common/rerun-failed.sh (site-agnostic, moved from phoenix/) - Replace misc/phoenix/rerun-failed.sh with a thin wrapper to common/ - Add misc/frontier/config.sh modeled after phoenix/config.sh: - Frontier constants, SSH_OPTS with BatchMode for MOTD suppression - find_pids() filters stdout through grep -E '^[0-9]+$' to strip MOTD noise - start_runner() uses timeout+setsid+nohup (no login shell needed) - stop_runner(), find_node(), gh_list_runners(), gh_registration_token() - Rewrite make-runner.sh: sources config.sh, no runner.pid, uses start_runner() - Rewrite deploy-runners.sh: sources config.sh, calls $SCRIPT_DIR/make-runner.sh - Rewrite restart-offline-runners.sh: GitHub API query, CWD-based discovery, stop stale processes before restart, parallel restarts, final status report - Rewrite stop-runner.sh: find_node() + stop_runner() + gh_remove_runner() - Add check-runners.sh: SSH per-node health check (name/status/RSS) - Update README.md: new scripts table, no runner.pid references, updated troubleshooting guide Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e-runner - list-runners.sh: new script using parallel SSH sweep across all 11 nodes simultaneously instead of serial per-runner discovery; flags stale runner.node entries with a warning in the node column - move-runner.sh: new script to relocate a runner between login nodes with one retry on start failure - restart-offline-runners.sh: add retry logic (sleep 5 + second attempt) and runner.node self-healing (detects and corrects stale node entries when a runner is found on a different node than recorded) - make-runner.sh: replace hardcoded RUNNER_VERSION with dynamic GitHub API lookup falling back to pinned version; print selected version at startup - misc/frontier/README.md: document new scripts, update quick reference and troubleshooting sections, note runner.node self-healing behavior - misc/common/README.md: new file documenting site-agnostic shared scripts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…y logic - config.sh: add sync_runner_nodes() — parallel sweep of all login nodes that updates runner.node files before any action, ensuring accuracy even when runners are manually restarted on different nodes - make-runner.sh: use gh_latest_runner_version() with pinned fallback - restart-offline-runners.sh: call sync_runner_nodes first; self-heal stale runner.node on restart; retry start_runner once on failure - check-runners.sh: call sync_runner_nodes first - list-runners.sh: new — parallel SSH sweep + GitHub API combined view - move-runner.sh: new — move a runner to a different login node - misc/common/README.md: new — document shared scripts - misc/frontier/README.md: update for new scripts and sync behavior Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-03-15T22:34:13Z

Claude Code Review

Incremental review from: a557e7e7c020b513c2673f8dd479d77c709f0de5
Head SHA: 536f43374c8101296194ba1b18842f4fdeff2391

Previously reported issues — now fixed in this push:

✅ gh run view unguarded exit in rerun-failed.sh
✅ rerun_count dead variable removed
✅ Tarball download race condition fixed (atomic tmp+mv in both deploy-runners.sh and make-runner.sh)
✅ Word-splitting on find_runner_dirs fixed (frontier list-runners.sh; phoenix scripts also use < <(...))
✅ Arithmetic on "?" guard added to phoenix/check-runners.sh and phoenix/list-runners.sh

New finding:

misc/phoenix/config.sh — SSH_OPTS is missing BatchMode=yes and StrictHostKeyChecking=no

This push introduces SSH_OPTS="-o ConnectTimeout=5" in phoenix/config.sh, but the frontier equivalent is much more complete:

# Frontier (misc/frontier/config.sh)
SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o BatchMode=yes -o ServerAliveInterval=10 -o ServerAliveCountMax=3"

# Phoenix (misc/phoenix/config.sh)  ← newly added in this push
SSH_OPTS="-o ConnectTimeout=5"

The missing options matter for unattended scripted use:

BatchMode=yes — prevents SSH from prompting for a password or host-key confirmation. Without it, any SSH call where key auth unexpectedly fails (e.g., agent not running, expired key) hangs the script indefinitely rather than failing fast. This is the most important one.
StrictHostKeyChecking=no — without it, first-connection to a login node whose key isn't in known_hosts blocks on an interactive prompt.
ServerAliveInterval/ServerAliveCountMax — without keepalives, a stalled SSH session to a busy login node can hang a find_pids or stop_runner call for minutes.

Suggested fix — align with frontier's opts:

SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=5 -o BatchMode=yes -o ServerAliveInterval=10 -o ServerAliveCountMax=3"

Still open from previous review:

RUNNER_GROUP="phoenix" in misc/frontier/config.sh line 12 — verify this is the correct org-level group for Frontier runners (looks like a copy-paste from the Phoenix scripts).

start_runner() was launching run.sh via absolute path without cd-ing into the runner directory first. The Runner.Listener process inherited the SSH login shell CWD (user home dir), so find_pids() CWD matching always failed for runners started this way. Fix: cd into $dir before running ./run.sh so the process CWD matches what find_pids() expects. Relative log path follows automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- frontier/deploy-runners.sh: pre-download tarball once before spawning parallel make-runner.sh instances; use atomic tmp+mv to prevent concurrent curl writes corrupting the tarball - frontier/make-runner.sh: same atomic tmp+mv for solo-invocation safety - common/rerun-failed.sh: guard gh run view with || continue so an expired/deleted run skips rather than exits the whole script; switch pipe loops to process substitution so continue works correctly and remove the dead rerun_count variable (incremented in a subshell, never read) - frontier/list-runners.sh, phoenix/list-runners.sh, restart-all.sh, rebalance-runners.sh: replace for dir in $(find_runner_dirs) with while IFS= read -r dir; do ... done < <(find_runner_dirs) to eliminate word-splitting and glob expansion on runner paths - phoenix/check-runners.sh, phoenix/list-runners.sh: guard rss value before arithmetic expansion — if SSH fails and rss="?", the expression $(( CGROUP_LIMIT - rss )) is a syntax error that exits under set -euo pipefail; default to 0 instead Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move functions identical across both sites into a common library: gh_registration_token, gh_latest_runner_version, gh_remove_runner, get_runner_name, find_pids, find_node, start_runner, stop_runner Both config.sh files now source runner-lib.sh after defining their site constants (ORG, NODES, SSH_OPTS), keeping only site-specific logic locally: - frontier/config.sh: gh_list_runners, find_runner_dirs, sync_runner_nodes - phoenix/config.sh: gh_list_runners, find_runner_dirs, has_slurm Harmonize start_runner() across both sites: - Use bash -lc on both (was frontier-only) for login shell PATH - Use timeout 15 + synchronous SSH (was phoenix's background SSH + poll loop) - cd into runner dir before run.sh so CWD-based discovery works - Standardize log file to runner.log (was runner-nohup.log on phoenix) - Use $SSH_OPTS variable (added to phoenix config) throughout all phoenix scripts instead of hardcoded -o ConnectTimeout=5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

/proc/$p/exe is intrinsic to the binary and cannot change after exec, making it more reliable than CWD which depends on how the process was launched. Also correctly excludes Runner.Worker processes that share the same directory, since their exe is bin/Runner.Worker not bin/Runner.Listener. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Mirrors phoenix/rebalance-runners.sh with frontier-specific changes: - calls sync_runner_nodes first to correct any stale runner.node files - updates runner.node after each successful move - no has_slurm check (not applicable on frontier) With 22 runners across 11 nodes the target is 2 per node. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-03-16T00:00:27Z

Claude Code Review

Incremental review from: 958cce7d0f00e8a0cb98b1d0a1cb6f1c6a9f6986
Head SHA: 0992785e458c20200d7301bee4ece71eed8ab835

New findings since last Claude review:

1. misc/phoenix/create-runner.sh — ls -d exits non-zero on first-ever runner creation

set -euo pipefail is active. The line that determines the next runner directory number:

existing=$(ls -d "$parent_dir"/actions-runner-* 2>/dev/null | sed '...' | sort -n | tail -1)

When $parent_dir contains no actions-runner-* directories yet (first runner being created), bash expands the glob to the literal pattern string and passes it to ls. ls finds no such path, prints an error to stderr (suppressed by 2>/dev/null), and exits non-zero. Under set -e + pipefail, this immediately aborts the script before any runner is created.

Minimal fix — append || true to tolerate zero-match glob:

existing=$(ls -d "$parent_dir"/actions-runner-* 2>/dev/null | sed 's/.*actions-runner-//' | sort -n | tail -1 || true)

Previously reported, still open:

start_runner no-retry in misc/common/rebalance-runners.sh — Previously flagged for misc/frontier/rebalance-runners.sh; the code is now in the shared misc/common/rebalance-runners.sh, so the same missing-retry risk now also affects misc/phoenix/rebalance-runners.sh.
SSH_OPTS in misc/phoenix/config.sh — Still "-o ConnectTimeout=5" only; missing BatchMode=yes, StrictHostKeyChecking=no, and ServerAliveInterval/ServerAliveCountMax. Without BatchMode=yes in particular, a password-prompt or host-key-confirmation request will hang unattended scripts indefinitely.
RUNNER_GROUP="phoenix" in misc/frontier/config.sh line 12 — Verify this is the correct org-level runner group for Frontier runners.

Extract shared logic from frontier/phoenix into common scripts: - runner-lib.sh: add gh_list_runners(), has_slurm() (portable, grep PATH for slurm keyword), sweep_all_nodes() (exe-based, parallel SSH), CGROUP_LIMIT default - check-runners.sh: new common script (exe-based discovery, slurm column, conditional cgroup footer); both sites now show slurm status - list-runners.sh: new common script (parallel sweep, slurm column, stale runner.node detection, conditional cgroup footer) - rebalance-runners.sh: new common script (optional sync_runner_nodes hook, writes runner.node after start, checks slurm after start) All site scripts (frontier/, phoenix/) reduced to thin wrappers that source config.sh then the common implementation. phoenix/create-runner.sh fixed to write runner.node after successful start. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

restart-offline-runners.sh is superseded by rebalance-runners.sh, which handles offline runners as part of its distribution algorithm and also calls sync_runner_nodes to self-heal stale runner.node files. Update frontier/README.md: remove restart-offline-runners.sh references, replace CWD-based discovery language with EXE-based, add rebalance-runners.sh. Update phoenix/README.md: fix config.sh description (functions now in runner-lib.sh), fix list-runners.sh description (now parallel sweep), fix slurm PATH wording (portable grep, not /opt/slurm). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

stop-runner, move-runner, restart-runner, and restart-all were each only available on one cluster. Extract implementations to common/ and add thin wrappers for both sites so all operational scripts work on both Frontier and Phoenix. Improvements over the originals: - stop-runner: searches find_runner_dirs() by name instead of hardcoded SHARED_DIR path; fixes runner_id lookup to use process substitution - move-runner: optional sync_runner_nodes hook (runs on Frontier, no-op on Phoenix); searches find_runner_dirs() by name - restart-runner: writes runner.node after successful start (was missing) - restart-all: adds optional sync_runner_nodes hook; writes runner.node after each successful start (was missing) Update common/README.md and both site READMEs to reflect full script inventory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-03-16T00:37:37Z

Claude Code Review

Incremental review from: e5934bc598434f4304fd745bfdd2218f4345d5aa
Head SHA: eb9f98a6609a68892553bbdf5a9a30d57c0dd41a

New findings since last Claude review (this push: refactored to misc/runner.sh dispatcher, removed site-specific thin wrappers, hardened stop_runner/start_runner/get_runner_name):

1. misc/common/runner.sh — SITE_SCRIPT_DIR is set but never used (dead variable)

runner.sh assigns:

SITE_SCRIPT_DIR="$MISC_DIR/$SITE"
source "$MISC_DIR/$SITE/config.sh"
source "$MISC_DIR/common/$CMD.sh" "$@"

No common script references SITE_SCRIPT_DIR. Previously, the common scripts were invoked via the old site wrapper pattern that did set this variable for use with bash "${SITE_SCRIPT_DIR}/check-runners.sh", but rebalance-runners.sh was updated this push to use source "$(dirname "${BASH_SOURCE[0]}")/check-runners.sh" instead. SITE_SCRIPT_DIR is now dead code. Low priority, but could be removed or documented if a future common script needs a site-specific callback.

Previously reported, still open:

misc/phoenix/create-runner.sh line 33 — ls -d exits non-zero under set -e + pipefail on first runner creation. When $parent_dir contains no actions-runner-* directories, bash passes the literal glob pattern to ls, which exits 1. Under set -euo pipefail, the assignment existing=$(ls -d ... | sed ... | sort ... | tail -1) triggers set -e because the pipeline exit code (from ls via pipefail) is 1. The script aborts before creating any runner. Fix: append || true to the pipeline.
misc/phoenix/config.sh — SSH_OPTS is still "-o ConnectTimeout=5" only. Missing BatchMode=yes (prevents hanging on password/host-key prompts in unattended scripts), StrictHostKeyChecking=no, and ServerAliveInterval/ServerAliveCountMax. Frontier's config.sh already has the complete set. BatchMode=yes is the most critical missing option for non-interactive use.

- runner-lib.sh: use sys.argv[1] in get_runner_name (prevent path injection); verify process dead after SIGKILL in stop_runner; warn on sweep_all_nodes SSH failure; log SSH failure in start_runner - rebalance-runners.sh: fix busy-runner grep with trailing slash; handle stop_runner failure before attempting move - restart-all.sh: replace space-delimited restart_list with three separate arrays (space-safe); fix busy grep trailing slash; handle stop_runner failure - restart-runner.sh: exit 1 on start failure (was silently exiting 0) - stop-runner.sh: handle gh_list_runners API failure with warning - rerun-failed.sh: guard empty failed_jobs to avoid blank bullet; add per-job display and expired-run guard to master branch block - deploy-runners.sh: track background PIDs; detect and report failures - create-runner.sh: use #!/usr/bin/env bash; atomic curl download - frontier/config.sh: document RUNNER_GROUP="phoenix" is intentional - phoenix/config.sh: use #!/usr/bin/env bash Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

All shared commands (check-runners, list-runners, move-runner, rebalance-runners, restart-all, restart-runner, stop-runner, rerun-failed) previously existed as near-identical thin wrappers in both frontier/ and phoenix/. Replace them all with a single dispatcher: bash misc/runner.sh <site> <command> [args...] Site directories now contain only truly site-specific files: - frontier/: config.sh, make-runner.sh, deploy-runners.sh - phoenix/: config.sh, create-runner.sh Also fix rebalance-runners.sh to source common/check-runners.sh directly (was calling the now-deleted site wrapper). Update all READMEs to document the new dispatcher invocation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Consolidates all runner management scripts under misc/runners/ to separate them from unrelated misc/ files. Structure: misc/runners/runner.sh dispatcher misc/runners/common/ shared scripts and library misc/runners/frontier/ Frontier config + make/deploy scripts misc/runners/phoenix/ Phoenix config + create script All paths are computed from ${BASH_SOURCE[0]} so no internal path changes are needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

frontier/make-runner.sh and phoenix/create-runner.sh were the same concept with different names and minor behavioral differences. Unified into common/create-runner.sh with site-specific behavior driven by two config.sh hooks: TARBALL_CACHE_DIR Frontier: "$SHARED_DIR" — tarball cached on shared Lustre, reused across parallel deployments (prevents download races). Phoenix: "" — each runner downloads independently. runner_install_dir <name> [override-dir] Frontier: "$SHARED_DIR/<name>" Phoenix: auto-numbered "$RUNNER_PARENT_DIRS[0]/actions-runner-N" frontier/make-runner.sh is now a 3-line wrapper that derives the name "frontier-<N>" from a number argument and delegates to create-runner. deploy-runners.sh updated to reference $TARBALL_CACHE_DIR instead of $SHARED_DIR directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

All new runner installs now go to RUNNER_PARENT_DIRS[1]: /storage/project/r-sbryngelson3-0/sbryngelson3/mfc-runners-2 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-03-16T01:28:33Z

Claude Code Review

Incremental review from: eb9f98a6609a68892553bbdf5a9a30d57c0dd41a
Head SHA: e0ad7714228f3fc224905b4475329814c747dd3c

New findings since last Claude review:

1. misc/runners/common/create-runner.sh — orphaned $RUNNER_DIR on download or extraction failure

When TARBALL_CACHE_DIR is unset (the Phoenix path), the script creates the runner directory before the download:

mkdir -p "$RUNNER_DIR"
tarball_path="$RUNNER_DIR/runner-download.tmp.$$"
curl -fsSL "$TARBALL_URL" -o "$tarball_path"

If curl fails (network error, 404, disk full), set -euo pipefail aborts immediately, leaving an empty $RUNNER_DIR on disk. The script has no trap to clean up. On the next invocation, the [ -d "$RUNNER_DIR" ] guard at the top fires:

if [ -d "$RUNNER_DIR" ]; then
    echo "ERROR: Directory already exists: $RUNNER_DIR" >&2
    exit 1
fi

The user must then manually remove the orphaned directory before retrying. The same issue applies if extraction succeeds but config.sh (runner registration) fails — $RUNNER_DIR is left populated but unregistered.

Suggested fix — add a cleanup trap immediately after the directory is created:

mkdir -p "$RUNNER_DIR"
trap 'rm -rf "$RUNNER_DIR"' ERR
tarball_path="$RUNNER_DIR/runner-download.tmp.$$"
curl -fsSL "$TARBALL_URL" -o "$tarball_path"

Clear the trap once start succeeds:

if start_runner "$TARGET_NODE" "$RUNNER_DIR"; then
    trap - ERR   # runner is up; don't clean up on subsequent errors
    ...

Still open from previous reviews:

misc/runners/phoenix/config.sh runner_install_dir() — ls -d exits non-zero on first runner creation (previously filed against misc/phoenix/create-runner.sh; the same code is now in phoenix/config.sh line ~31). Bash expands the glob to the literal pattern when no directories match, ls exits 1, and the pipefail-active subshell in RUNNER_DIR=$(runner_install_dir ...) aborts create-runner.sh. Fix: append || true to the pipeline.
misc/runners/phoenix/config.sh SSH_OPTS still "-o ConnectTimeout=5" only — missing BatchMode=yes (hangs on password/host-key prompts in unattended use), StrictHostKeyChecking=no, and ServerAliveInterval/ServerAliveCountMax. Frontier's config.sh already has the complete set.
misc/runners/runner.sh line 45 — SITE_SCRIPT_DIR is set but never read — dead variable since rebalance-runners.sh was updated to use source "$(dirname "${BASH_SOURCE[0]}")/check-runners.sh" internally. Low priority.

github-actions · 2026-03-16T01:29:28Z

Claude Code Review

Head SHA: e0ad7714228f3fc224905b4475329814c747dd3c

Files Changed (19)

misc/runners/runner.sh — top-level dispatcher
misc/runners/common/runner-lib.sh — shared library (GitHub API, process mgmt, SSH sweep)
misc/runners/common/check-runners.sh
misc/runners/common/list-runners.sh
misc/runners/common/create-runner.sh
misc/runners/common/move-runner.sh
misc/runners/common/rebalance-runners.sh
misc/runners/common/restart-all.sh
misc/runners/common/restart-runner.sh
misc/runners/common/stop-runner.sh
misc/runners/common/rerun-failed.sh
misc/runners/phoenix/config.sh
misc/runners/phoenix/README.md
misc/runners/frontier/config.sh
misc/runners/frontier/deploy-runners.sh
misc/runners/frontier/make-runner.sh
misc/runners/frontier/README.md
misc/runners/common/README.md
misc/starting-phoenix-runners.md (deleted)

Summary

This PR replaces the outdated misc/starting-phoenix-runners.md with a well-structured, fully-operational runner management toolkit covering Frontier and Phoenix HPC clusters. The overall architecture — a common library, site configs, and a single dispatcher — is clean.
All previously reported issues (tarball download race condition, word-splitting on find_runner_dirs, arithmetic-expansion guard for "?" strings, gh run view unguarded exit in rerun-failed.sh, rerun_count dead variable) appear resolved in the final head.
Three findings remain from prior review rounds; this is the full-review pass to confirm current state:
1. SSH_OPTS in phoenix/config.sh is still minimal — critical for unattended scripted use.
2. runner_install_dir() in phoenix/config.sh exits non-zero under set -e on first runner creation.
3. SITE_SCRIPT_DIR in runner.sh is set but never referenced — dead variable.

Findings

1. [Bug] phoenix/config.sh — runner_install_dir() aborts on first runner creation

File: misc/runners/phoenix/config.sh, lines 33–37

existing=

With set -euo pipefail active (from create-runner.sh), when no actions-runner-* directories exist yet (first-ever runner creation), the unquoted glob is expanded by bash to the literal string /path/actions-runner-*, which ls does not find. ls exits 1. Under pipefail, the pipeline assignment exits non-zero and the script aborts before the runner is created.

Fix: append || true to tolerate the zero-match case:

existing=$(ls -d "$parent"/actions-runner-* 2>/dev/null | sed 's/.*actions-runner-//' | sort -n | tail -1 || true)

2. [Correctness] phoenix/config.sh — SSH_OPTS is incomplete for unattended use

File: misc/runners/phoenix/config.sh, line 15

Current: SSH_OPTS="-o ConnectTimeout=5"

Frontier's equivalent (misc/runners/frontier/config.sh, line 14) is:
SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o BatchMode=yes -o ServerAliveInterval=10 -o ServerAliveCountMax=3"

The missing options matter for unattended non-interactive scripts:

BatchMode=yes — prevents SSH from hanging indefinitely when key auth unexpectedly fails (e.g., expired key, agent not forwarded). Without it, any SSH call in find_pids, sweep_all_nodes, stop_runner, or start_runner can block the script forever waiting for an interactive prompt. This is the most critical missing option.
StrictHostKeyChecking=no — first connection to a login node whose host key is not yet in known_hosts blocks on an interactive prompt.
ServerAliveInterval/ServerAliveCountMax — without keepalives, stalled SSH sessions to busy login nodes can hang find_pids or stop_runner calls for minutes.

Suggested fix — align with Frontier:

SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=5 -o BatchMode=yes -o ServerAliveInterval=10 -o ServerAliveCountMax=3"

3. [Minor] runner.sh — SITE_SCRIPT_DIR is set but never used

File: misc/runners/runner.sh, line 40

SITE_SCRIPT_DIR="$MISC_DIR/$SITE"
source "$MISC_DIR/$SITE/config.sh"
source "$MISC_DIR/common/$CMD.sh" "$@"

SITE_SCRIPT_DIR is exported into the environment that the sourced common scripts run in, but no common script reads it. It was carried over from an earlier dispatch pattern. No functional impact, but it is dead code. Can be removed unless a future common script will need a site-specific callback path.

Non-findings (resolved from prior review rounds)

Tarball download race condition: fixed via tmp + mv in both deploy-runners.sh and create-runner.sh.
Word-splitting on find_runner_dirs: fixed — all callers now use while IFS= read -r dir; do ... done < <(find_runner_dirs).
"?" arithmetic expansion under set -e: fixed — guarded with [[ "$rss" =~ ^[0-9]+$ ]] || rss=0 in both check-runners.sh and list-runners.sh.
gh run view unguarded exit in rerun-failed.sh: fixed — guarded with 2>/dev/null) || { ... continue; }.
RUNNER_GROUP="phoenix" in frontier/config.sh: now documented inline as intentional (# Both sites share one GitHub runner group named "phoenix").

coderabbitai · 2026-03-16T01:29:39Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b5252d72-7c8f-40bd-b962-97541cda1b62

📥 Commits

Reviewing files that changed from the base of the PR and between 007b6a9 and e0ad771.

📒 Files selected for processing (19)

misc/runners/common/README.md
misc/runners/common/check-runners.sh
misc/runners/common/create-runner.sh
misc/runners/common/list-runners.sh
misc/runners/common/move-runner.sh
misc/runners/common/rebalance-runners.sh
misc/runners/common/rerun-failed.sh
misc/runners/common/restart-all.sh
misc/runners/common/restart-runner.sh
misc/runners/common/runner-lib.sh
misc/runners/common/stop-runner.sh
misc/runners/frontier/README.md
misc/runners/frontier/config.sh
misc/runners/frontier/deploy-runners.sh
misc/runners/frontier/make-runner.sh
misc/runners/phoenix/README.md
misc/runners/phoenix/config.sh
misc/runners/runner.sh
misc/starting-phoenix-runners.md

📝 Walkthrough

Walkthrough

Introduces a comprehensive GitHub Actions runner management infrastructure for Frontier and Phoenix systems. Adds a shared library (runner-lib.sh) with common GitHub API and process management functions. Implements site-agnostic utility scripts (check-runners.sh, create-runner.sh, list-runners.sh, move-runner.sh, rebalance-runners.sh, restart-runner.sh, restart-all.sh, stop-runner.sh, rerun-failed.sh) for runner operations. Adds site-specific configurations for Frontier and Phoenix with node discovery and installation directory logic. Includes a dispatcher script (runner.sh) that routes commands to site or common implementations. Provides comprehensive documentation for both sites and removes obsolete Phoenix startup documentation.

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds a new misc/runners/ toolkit to manage GitHub Actions self-hosted runners on Phoenix and Frontier (health checks, listing, restarting, moving, rebalancing, creation, and rerunning failed CI), and removes the obsolete Phoenix runner startup doc (SOCKS5 proxy no longer needed).

Changes:

Introduce a site-dispatched runner management CLI (misc/runners/runner.sh) with shared library code under misc/runners/common/.
Add site configs/docs for Phoenix and Frontier plus Frontier convenience scripts for provisioning.
Remove misc/starting-phoenix-runners.md (outdated proxy-based procedure).

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
misc/starting-phoenix-runners.md	Removes outdated Phoenix runner startup documentation.
misc/runners/runner.sh	Dispatcher for site/common runner management commands.
misc/runners/common/runner-lib.sh	Shared GitHub API + SSH process discovery + start/stop primitives.
misc/runners/common/check-runners.sh	Per-node runner health check.
misc/runners/common/list-runners.sh	Combined GitHub API + live process table.
misc/runners/common/rebalance-runners.sh	Rebalance algorithm for distributing runners across login nodes.
misc/runners/common/restart-runner.sh	Restart a single runner and verify SLURM is in PATH.
misc/runners/common/restart-all.sh	Restart all runners (dry-run by default).
misc/runners/common/move-runner.sh	Move a runner between login nodes by name.
misc/runners/common/stop-runner.sh	Stop and deregister a runner.
misc/runners/common/create-runner.sh	Download/register/start a new runner.
misc/runners/common/rerun-failed.sh	Rerun failed workflows on open PRs and master.
misc/runners/common/README.md	Documentation for shared commands and library.
misc/runners/phoenix/config.sh	Phoenix-specific constants and runner directory discovery.
misc/runners/phoenix/README.md	Phoenix operator documentation and usage examples.
misc/runners/frontier/config.sh	Frontier-specific constants, shared dir, and node sync helper.
misc/runners/frontier/make-runner.sh	Frontier convenience wrapper for numbered runner creation.
misc/runners/frontier/deploy-runners.sh	Parallel runner deployment across multiple login nodes.
misc/runners/frontier/README.md	Frontier operator documentation and usage examples.

misc/runners/phoenix/config.sh

+# Return the directory where a named runner should be installed.
+# Auto-increments the actions-runner-N suffix within RUNNER_PARENT_DIRS[0].
+# Args: $1 = runner name (unused; directory is numbered, not named), $2 = optional override dir
+runner_install_dir() {
+    local override="${2:-}"
+    [ -n "$override" ] && echo "$override" && return
+    local parent="${RUNNER_PARENT_DIRS[1]}"
+    local existing next_num
+    existing=$(ls -d "$parent"/actions-runner-* 2>/dev/null | sed 's/.*actions-runner-//' | sort -n | tail -1)
+    next_num=$(( ${existing:-0} + 1 ))
+    echo "$parent/actions-runner-$next_num"


misc/runners/runner.sh

+echo "ERROR: Unknown command '$CMD' for site '$SITE'." >&2
+echo "Common:   check-runners list-runners move-runner rebalance-runners restart-all restart-runner stop-runner rerun-failed" >&2
+echo "Frontier: make-runner deploy-runners" >&2
+echo "Phoenix:  create-runner" >&2


misc/runners/common/runner-lib.sh

+        ssh $SSH_OPTS "$node" '
+            for p in $(ps aux | grep Runner.Listener | grep -v grep | awk "{print \$2}"); do
+                exe=$(readlink -f /proc/$p/exe 2>/dev/null || true)
+                [ -z "$exe" ] && continue
+                dir=$(dirname "$(dirname "$exe")")
+                rss=$(ps -p $p -o rss= 2>/dev/null | awk "{printf \"%.0f\", \$1/1024}" || echo 0)
+                slurm=$(tr "\0" "\n" < /proc/$p/environ 2>/dev/null | grep -c "^PATH=.*slurm" || echo 0)
+                [ "$slurm" -gt 0 ] && slurm_ok="ok" || slurm_ok="MISSING"
+                echo "RUNNER '"$node"' $dir $rss $slurm_ok"
+            done
+        ' 2>/dev/null > "$tmpdir/$node.out" &


misc/runners/common/runner-lib.sh

+start_runner() {
+    local node="$1" dir="$2"
+    timeout 15 ssh $SSH_OPTS "$node" \
+        "cd $dir && setsid bash -lc 'nohup ./run.sh >> runner.log 2>&1 < /dev/null &'" \


misc/runners/common/list-runners.sh

+    echo "=== Per-node memory ==="
+    for node in "${NODES[@]}"; do
+        count=$(ssh $SSH_OPTS "$node" \
+            "ps aux | grep Runner.Listener | grep -v grep | wc -l" 2>/dev/null || echo 0)


misc/runners/common/check-runners.sh

+        for p in $(ps aux | grep Runner.Listener | grep -v grep | awk "{print \$2}"); do
+            found=1
+            exe=$(readlink -f /proc/$p/exe 2>/dev/null || echo "???")
+            dir=$(dirname "$(dirname "$exe")" 2>/dev/null || echo "???")
+            name=$(basename "$dir")
+            worker=$(ps aux | grep "Runner.Worker" | grep "$dir" | grep -v grep | awk "{print \$2}" | head -1)


misc/runners/common/check-runners.sh

+            exe=$(readlink -f /proc/$p/exe 2>/dev/null || echo "???")
+            dir=$(dirname "$(dirname "$exe")" 2>/dev/null || echo "???")
+            name=$(basename "$dir")
+            worker=$(ps aux | grep "Runner.Worker" | grep "$dir" | grep -v grep | awk "{print \$2}" | head -1)
+            [ -n "$worker" ] && status="BUSY" || status="idle"
+            rss=$(ps -p $p -o rss= 2>/dev/null | awk "{printf \"%.0f\", \$1/1024}" || echo "?")
+            slurm=$(tr "\0" "\n" < /proc/$p/environ 2>/dev/null | grep -c "^PATH=.*slurm" || echo 0)
+            [ "$slurm" -gt 0 ] && slurm_ok="ok" || slurm_ok="MISSING"
+            printf "  %-30s %5s  slurm=%-7s  %s MB\n" "$name" "$status" "$slurm_ok" "$rss"


misc/runners/runner.sh

+# Sites:    frontier  phoenix
+# Common:   check-runners  list-runners  move-runner  rebalance-runners
+#           restart-all  restart-runner  stop-runner  rerun-failed
+# Frontier: make-runner  deploy-runners
+# Phoenix:  create-runner
+#


misc/runners/common/runner-lib.sh

+    ssh $SSH_OPTS "$1" '
+        for p in $(ps aux | grep Runner.Listener | grep -v grep | awk "{print \$2}"); do
+            exe=$(readlink -f /proc/$p/exe 2>/dev/null || true)
+            [ "$exe" = "'"$2"'/bin/Runner.Listener" ] && echo "$p"
+        done
+    ' 2>/dev/null | grep -E '^[0-9]+$' | tr '\n' ' ' || true


sbryngelson and others added 2 commits March 15, 2026 17:24

misc: add README for Phoenix runner management scripts

13fad55

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sbryngelson and others added 5 commits March 15, 2026 17:45

Add Frontier GitHub Actions runner management scripts

84d6ec5

Scripts for deploying, restarting, and stopping self-hosted runners on OLCF Frontier login nodes, modeled after the existing Phoenix runner scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Spencer Bryngelson and others added 5 commits March 15, 2026 18:44

Spencer Bryngelson and others added 3 commits March 15, 2026 20:17

Spencer Bryngelson and others added 5 commits March 15, 2026 20:50

Direct new Phoenix runners to mfc-runners-2

e0ad771

All new runner installs now go to RUNNER_PARENT_DIRS[1]: /storage/project/r-sbryngelson3-0/sbryngelson3/mfc-runners-2 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sbryngelson marked this pull request as ready for review March 16, 2026 01:27

Copilot AI review requested due to automatic review settings March 16, 2026 01:27

sbryngelson merged commit 7ea6b5c into MFlowCode:master Mar 16, 2026
32 checks passed

sbryngelson deleted the phoenix-runner-scripts branch March 16, 2026 01:27

Copilot started reviewing on behalf of sbryngelson March 16, 2026 01:27 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

misc: Phoenix runner management scripts#1314

misc: Phoenix runner management scripts#1314
sbryngelson merged 20 commits intoMFlowCode:masterfrom
sbryngelson:phoenix-runner-scripts

sbryngelson commented Mar 15, 2026

Uh oh!

github-actions bot commented Mar 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

coderabbitai bot commented Mar 16, 2026

Review failed

Walkthrough

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

sbryngelson commented Mar 15, 2026

Summary

Scripts

Test plan

Uh oh!

github-actions bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Uh oh!

github-actions bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2026

Claude Code Review

Uh oh!

github-actions bot commented Mar 16, 2026

Claude Code Review

Files Changed (19)

Summary

Findings

Non-findings (resolved from prior review rounds)

Uh oh!

coderabbitai bot commented Mar 16, 2026

Review failed

Walkthrough

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 15, 2026 •

edited

Loading

github-actions bot commented Mar 15, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading