Skip to content

misc: Phoenix runner management scripts#1314

Merged
sbryngelson merged 20 commits intoMFlowCode:masterfrom
sbryngelson:phoenix-runner-scripts
Mar 16, 2026
Merged

misc: Phoenix runner management scripts#1314
sbryngelson merged 20 commits intoMFlowCode:masterfrom
sbryngelson:phoenix-runner-scripts

Conversation

@sbryngelson
Copy link
Member

Summary

  • Add a toolkit for managing GitHub Actions self-hosted runners on Phoenix login nodes
  • Replace the outdated misc/starting-phoenix-runners.md (SOCKS5 proxy setup no longer needed)
  • Scripts handle runner health checks, restart, rebalancing across login nodes, creation, and CI rerun

Scripts

Script Purpose
config.sh Shared config: node list, cgroup limit, runner discovery helpers
check-runners.sh Quick per-node health check
list-runners.sh Detailed runner table (name, node, status, slurm, RSS, pool)
restart-runner.sh Restart one runner with proper login shell PATH
restart-all.sh Restart all runners in place
rebalance-runners.sh Auto-compute optimal distribution and move runners
create-runner.sh Register and start a new runner
rerun-failed.sh Find and rerun failed GHA workflows on open PRs

Test plan

  • check-runners.sh tested on live Phoenix nodes
  • list-runners.sh tested on live Phoenix nodes
  • rebalance-runners.sh dry run tested (correctly identifies moves needed)
  • Runner rebalancing executed successfully (3/4/4 distribution achieved)

🤖 Generated with Claude Code

sbryngelson and others added 2 commits March 15, 2026 17:24
Add a toolkit for managing GitHub Actions self-hosted runners on
Phoenix login nodes:

  check-runners.sh    — quick per-node health check
  list-runners.sh     — detailed runner table (name, node, status, slurm, RSS)
  restart-runner.sh   — restart one runner with proper login shell PATH
  restart-all.sh      — restart all runners in place
  rebalance-runners.sh — auto-compute optimal distribution and move runners
  create-runner.sh    — register and start a new runner
  rerun-failed.sh     — find and rerun failed GHA workflows on open PRs
  config.sh           — shared config (nodes, cgroup limit, discovery helpers)

Remove the outdated starting-phoenix-runners.md which referenced the
old SOCKS5 proxy setup that is no longer needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 15, 2026

Claude Code Review

Incremental review from: 91ef65e8fe723d63818d26aa664aa752b926782e
Head SHA: 84d6ec5d58023f7ac414e4e18890b0ce09de3d22

New findings since last Claude review (frontier/ scripts are new in this push):

1. Tarball download race condition — make-runner.sh + deploy-runners.sh

deploy-runners.sh launches multiple make-runner.sh instances in parallel via &. Each independently checks:

if [ ! -f "${SHARED_DIR}/${TARBALL}" ]; then
    curl -fsSL ... -o "${SHARED_DIR}/${TARBALL}"
fi

If N runners are deployed simultaneously and the tarball doesn't yet exist, all N pass the -f check at nearly the same moment and concurrently write to the same -o path. curl does not write atomically (it streams directly to the target file), so the result will be a corrupted tarball. The subsequent tar xzf will fail on every instance except possibly the last writer.

Fix: download (or verify) the tarball once before entering the parallel loop in deploy-runners.sh, or use flock in make-runner.sh:

# In deploy-runners.sh, before the for loop:
bash "${SHARED_DIR}/make-runner.sh" --download-only "${NUM}" 2>/dev/null || true

Or simpler — download once in deploy-runners.sh itself before spawning workers.

2. RUNNER_GROUP="phoenix" in make-runner.sh (likely copy-paste error)

misc/frontier/make-runner.sh line 20 sets RUNNER_GROUP="phoenix" and passes it as --runnergroup phoenix during runner configuration. Frontier runners being registered under the phoenix runner group may be intentional (if the same GitHub org group covers both clusters), but it looks like a copy-paste from the Phoenix scripts. Verify this is the correct org-level group for Frontier runners.

Previously reported, still open:

  • Word-splitting on for dir in $(find_runner_dirs) in Phoenix scripts (list-runners.sh, restart-all.sh, rebalance-runners.sh).
  • Arithmetic expansion on "?" in check-runners.sh and list-runners.sh per-node summary under set -euo pipefail.

sbryngelson and others added 5 commits March 15, 2026 17:45
… config

Simplify scripts to be Phoenix-aware: org, runner group, label, nodes,
and parent dirs are all in config.sh rather than passed as arguments.
Add GitHub API helpers (gh_list_runners, gh_registration_token) so
list-runners.sh shows both API status and process state. create-runner.sh
now only needs a name and node.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scripts for deploying, restarting, and stopping self-hosted runners on OLCF
Frontier login nodes, modeled after the existing Phoenix runner scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ed discovery

- Add misc/common/rerun-failed.sh (site-agnostic, moved from phoenix/)
- Replace misc/phoenix/rerun-failed.sh with a thin wrapper to common/
- Add misc/frontier/config.sh modeled after phoenix/config.sh:
  - Frontier constants, SSH_OPTS with BatchMode for MOTD suppression
  - find_pids() filters stdout through grep -E '^[0-9]+$' to strip MOTD noise
  - start_runner() uses timeout+setsid+nohup (no login shell needed)
  - stop_runner(), find_node(), gh_list_runners(), gh_registration_token()
- Rewrite make-runner.sh: sources config.sh, no runner.pid, uses start_runner()
- Rewrite deploy-runners.sh: sources config.sh, calls $SCRIPT_DIR/make-runner.sh
- Rewrite restart-offline-runners.sh: GitHub API query, CWD-based discovery,
  stop stale processes before restart, parallel restarts, final status report
- Rewrite stop-runner.sh: find_node() + stop_runner() + gh_remove_runner()
- Add check-runners.sh: SSH per-node health check (name/status/RSS)
- Update README.md: new scripts table, no runner.pid references, updated
  troubleshooting guide

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e-runner

- list-runners.sh: new script using parallel SSH sweep across all 11 nodes
  simultaneously instead of serial per-runner discovery; flags stale runner.node
  entries with a warning in the node column
- move-runner.sh: new script to relocate a runner between login nodes with one
  retry on start failure
- restart-offline-runners.sh: add retry logic (sleep 5 + second attempt) and
  runner.node self-healing (detects and corrects stale node entries when a runner
  is found on a different node than recorded)
- make-runner.sh: replace hardcoded RUNNER_VERSION with dynamic GitHub API lookup
  falling back to pinned version; print selected version at startup
- misc/frontier/README.md: document new scripts, update quick reference and
  troubleshooting sections, note runner.node self-healing behavior
- misc/common/README.md: new file documenting site-agnostic shared scripts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…y logic

- config.sh: add sync_runner_nodes() — parallel sweep of all login nodes
  that updates runner.node files before any action, ensuring accuracy even
  when runners are manually restarted on different nodes
- make-runner.sh: use gh_latest_runner_version() with pinned fallback
- restart-offline-runners.sh: call sync_runner_nodes first; self-heal
  stale runner.node on restart; retry start_runner once on failure
- check-runners.sh: call sync_runner_nodes first
- list-runners.sh: new — parallel SSH sweep + GitHub API combined view
- move-runner.sh: new — move a runner to a different login node
- misc/common/README.md: new — document shared scripts
- misc/frontier/README.md: update for new scripts and sync behavior

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 15, 2026

Claude Code Review

Incremental review from: a557e7e7c020b513c2673f8dd479d77c709f0de5
Head SHA: 536f43374c8101296194ba1b18842f4fdeff2391

Previously reported issues — now fixed in this push:

  • gh run view unguarded exit in rerun-failed.sh
  • rerun_count dead variable removed
  • ✅ Tarball download race condition fixed (atomic tmp+mv in both deploy-runners.sh and make-runner.sh)
  • ✅ Word-splitting on find_runner_dirs fixed (frontier list-runners.sh; phoenix scripts also use < <(...))
  • ✅ Arithmetic on "?" guard added to phoenix/check-runners.sh and phoenix/list-runners.sh

New finding:

misc/phoenix/config.shSSH_OPTS is missing BatchMode=yes and StrictHostKeyChecking=no

This push introduces SSH_OPTS="-o ConnectTimeout=5" in phoenix/config.sh, but the frontier equivalent is much more complete:

# Frontier (misc/frontier/config.sh)
SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o BatchMode=yes -o ServerAliveInterval=10 -o ServerAliveCountMax=3"

# Phoenix (misc/phoenix/config.sh)  ← newly added in this push
SSH_OPTS="-o ConnectTimeout=5"

The missing options matter for unattended scripted use:

  • BatchMode=yes — prevents SSH from prompting for a password or host-key confirmation. Without it, any SSH call where key auth unexpectedly fails (e.g., agent not running, expired key) hangs the script indefinitely rather than failing fast. This is the most important one.
  • StrictHostKeyChecking=no — without it, first-connection to a login node whose key isn't in known_hosts blocks on an interactive prompt.
  • ServerAliveInterval/ServerAliveCountMax — without keepalives, a stalled SSH session to a busy login node can hang a find_pids or stop_runner call for minutes.

Suggested fix — align with frontier's opts:

SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=5 -o BatchMode=yes -o ServerAliveInterval=10 -o ServerAliveCountMax=3"

Still open from previous review:

  • RUNNER_GROUP="phoenix" in misc/frontier/config.sh line 12 — verify this is the correct org-level group for Frontier runners (looks like a copy-paste from the Phoenix scripts).

Spencer Bryngelson and others added 5 commits March 15, 2026 18:44
start_runner() was launching run.sh via absolute path without cd-ing
into the runner directory first. The Runner.Listener process inherited
the SSH login shell CWD (user home dir), so find_pids() CWD matching
always failed for runners started this way.

Fix: cd into $dir before running ./run.sh so the process CWD matches
what find_pids() expects. Relative log path follows automatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- frontier/deploy-runners.sh: pre-download tarball once before spawning
  parallel make-runner.sh instances; use atomic tmp+mv to prevent
  concurrent curl writes corrupting the tarball
- frontier/make-runner.sh: same atomic tmp+mv for solo-invocation safety
- common/rerun-failed.sh: guard gh run view with || continue so an
  expired/deleted run skips rather than exits the whole script; switch
  pipe loops to process substitution so continue works correctly and
  remove the dead rerun_count variable (incremented in a subshell, never
  read)
- frontier/list-runners.sh, phoenix/list-runners.sh, restart-all.sh,
  rebalance-runners.sh: replace for dir in $(find_runner_dirs) with
  while IFS= read -r dir; do ... done < <(find_runner_dirs) to eliminate
  word-splitting and glob expansion on runner paths
- phoenix/check-runners.sh, phoenix/list-runners.sh: guard rss value
  before arithmetic expansion — if SSH fails and rss="?", the expression
  $(( CGROUP_LIMIT - rss )) is a syntax error that exits under
  set -euo pipefail; default to 0 instead

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move functions identical across both sites into a common library:
  gh_registration_token, gh_latest_runner_version, gh_remove_runner,
  get_runner_name, find_pids, find_node, start_runner, stop_runner

Both config.sh files now source runner-lib.sh after defining their
site constants (ORG, NODES, SSH_OPTS), keeping only site-specific
logic locally:
  - frontier/config.sh: gh_list_runners, find_runner_dirs, sync_runner_nodes
  - phoenix/config.sh:  gh_list_runners, find_runner_dirs, has_slurm

Harmonize start_runner() across both sites:
  - Use bash -lc on both (was frontier-only) for login shell PATH
  - Use timeout 15 + synchronous SSH (was phoenix's background SSH + poll loop)
  - cd into runner dir before run.sh so CWD-based discovery works
  - Standardize log file to runner.log (was runner-nohup.log on phoenix)
  - Use $SSH_OPTS variable (added to phoenix config) throughout all
    phoenix scripts instead of hardcoded -o ConnectTimeout=5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
/proc/$p/exe is intrinsic to the binary and cannot change after exec,
making it more reliable than CWD which depends on how the process was
launched. Also correctly excludes Runner.Worker processes that share
the same directory, since their exe is bin/Runner.Worker not
bin/Runner.Listener.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirrors phoenix/rebalance-runners.sh with frontier-specific changes:
- calls sync_runner_nodes first to correct any stale runner.node files
- updates runner.node after each successful move
- no has_slurm check (not applicable on frontier)

With 22 runners across 11 nodes the target is 2 per node.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 16, 2026

Claude Code Review

Incremental review from: 958cce7d0f00e8a0cb98b1d0a1cb6f1c6a9f6986
Head SHA: 0992785e458c20200d7301bee4ece71eed8ab835

New findings since last Claude review:

1. misc/phoenix/create-runner.shls -d exits non-zero on first-ever runner creation

set -euo pipefail is active. The line that determines the next runner directory number:

existing=$(ls -d "$parent_dir"/actions-runner-* 2>/dev/null | sed '...' | sort -n | tail -1)

When $parent_dir contains no actions-runner-* directories yet (first runner being created), bash expands the glob to the literal pattern string and passes it to ls. ls finds no such path, prints an error to stderr (suppressed by 2>/dev/null), and exits non-zero. Under set -e + pipefail, this immediately aborts the script before any runner is created.

Minimal fix — append || true to tolerate zero-match glob:

existing=$(ls -d "$parent_dir"/actions-runner-* 2>/dev/null | sed 's/.*actions-runner-//' | sort -n | tail -1 || true)

Previously reported, still open:

  • start_runner no-retry in misc/common/rebalance-runners.sh — Previously flagged for misc/frontier/rebalance-runners.sh; the code is now in the shared misc/common/rebalance-runners.sh, so the same missing-retry risk now also affects misc/phoenix/rebalance-runners.sh.
  • SSH_OPTS in misc/phoenix/config.sh — Still "-o ConnectTimeout=5" only; missing BatchMode=yes, StrictHostKeyChecking=no, and ServerAliveInterval/ServerAliveCountMax. Without BatchMode=yes in particular, a password-prompt or host-key-confirmation request will hang unattended scripts indefinitely.
  • RUNNER_GROUP="phoenix" in misc/frontier/config.sh line 12 — Verify this is the correct org-level runner group for Frontier runners.

Spencer Bryngelson and others added 3 commits March 15, 2026 20:17
Extract shared logic from frontier/phoenix into common scripts:
- runner-lib.sh: add gh_list_runners(), has_slurm() (portable, grep PATH
  for slurm keyword), sweep_all_nodes() (exe-based, parallel SSH), CGROUP_LIMIT default
- check-runners.sh: new common script (exe-based discovery, slurm column,
  conditional cgroup footer); both sites now show slurm status
- list-runners.sh: new common script (parallel sweep, slurm column,
  stale runner.node detection, conditional cgroup footer)
- rebalance-runners.sh: new common script (optional sync_runner_nodes hook,
  writes runner.node after start, checks slurm after start)

All site scripts (frontier/, phoenix/) reduced to thin wrappers that source
config.sh then the common implementation. phoenix/create-runner.sh fixed to
write runner.node after successful start.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
restart-offline-runners.sh is superseded by rebalance-runners.sh, which
handles offline runners as part of its distribution algorithm and also
calls sync_runner_nodes to self-heal stale runner.node files.

Update frontier/README.md: remove restart-offline-runners.sh references,
replace CWD-based discovery language with EXE-based, add rebalance-runners.sh.
Update phoenix/README.md: fix config.sh description (functions now in
runner-lib.sh), fix list-runners.sh description (now parallel sweep),
fix slurm PATH wording (portable grep, not /opt/slurm).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
stop-runner, move-runner, restart-runner, and restart-all were each only
available on one cluster. Extract implementations to common/ and add thin
wrappers for both sites so all operational scripts work on both Frontier
and Phoenix.

Improvements over the originals:
- stop-runner: searches find_runner_dirs() by name instead of hardcoded
  SHARED_DIR path; fixes runner_id lookup to use process substitution
- move-runner: optional sync_runner_nodes hook (runs on Frontier, no-op
  on Phoenix); searches find_runner_dirs() by name
- restart-runner: writes runner.node after successful start (was missing)
- restart-all: adds optional sync_runner_nodes hook; writes runner.node
  after each successful start (was missing)

Update common/README.md and both site READMEs to reflect full script inventory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Mar 16, 2026

Claude Code Review

Incremental review from: e5934bc598434f4304fd745bfdd2218f4345d5aa
Head SHA: eb9f98a6609a68892553bbdf5a9a30d57c0dd41a

New findings since last Claude review (this push: refactored to misc/runner.sh dispatcher, removed site-specific thin wrappers, hardened stop_runner/start_runner/get_runner_name):

1. misc/common/runner.shSITE_SCRIPT_DIR is set but never used (dead variable)

runner.sh assigns:

SITE_SCRIPT_DIR="$MISC_DIR/$SITE"
source "$MISC_DIR/$SITE/config.sh"
source "$MISC_DIR/common/$CMD.sh" "$@"

No common script references SITE_SCRIPT_DIR. Previously, the common scripts were invoked via the old site wrapper pattern that did set this variable for use with bash "${SITE_SCRIPT_DIR}/check-runners.sh", but rebalance-runners.sh was updated this push to use source "$(dirname "${BASH_SOURCE[0]}")/check-runners.sh" instead. SITE_SCRIPT_DIR is now dead code. Low priority, but could be removed or documented if a future common script needs a site-specific callback.

Previously reported, still open:

  • misc/phoenix/create-runner.sh line 33 — ls -d exits non-zero under set -e + pipefail on first runner creation. When $parent_dir contains no actions-runner-* directories, bash passes the literal glob pattern to ls, which exits 1. Under set -euo pipefail, the assignment existing=$(ls -d ... | sed ... | sort ... | tail -1) triggers set -e because the pipeline exit code (from ls via pipefail) is 1. The script aborts before creating any runner. Fix: append || true to the pipeline.

  • misc/phoenix/config.shSSH_OPTS is still "-o ConnectTimeout=5" only. Missing BatchMode=yes (prevents hanging on password/host-key prompts in unattended scripts), StrictHostKeyChecking=no, and ServerAliveInterval/ServerAliveCountMax. Frontier's config.sh already has the complete set. BatchMode=yes is the most critical missing option for non-interactive use.

Spencer Bryngelson and others added 5 commits March 15, 2026 20:50
- runner-lib.sh: use sys.argv[1] in get_runner_name (prevent path
  injection); verify process dead after SIGKILL in stop_runner; warn
  on sweep_all_nodes SSH failure; log SSH failure in start_runner
- rebalance-runners.sh: fix busy-runner grep with trailing slash;
  handle stop_runner failure before attempting move
- restart-all.sh: replace space-delimited restart_list with three
  separate arrays (space-safe); fix busy grep trailing slash; handle
  stop_runner failure
- restart-runner.sh: exit 1 on start failure (was silently exiting 0)
- stop-runner.sh: handle gh_list_runners API failure with warning
- rerun-failed.sh: guard empty failed_jobs to avoid blank bullet;
  add per-job display and expired-run guard to master branch block
- deploy-runners.sh: track background PIDs; detect and report failures
- create-runner.sh: use #!/usr/bin/env bash; atomic curl download
- frontier/config.sh: document RUNNER_GROUP="phoenix" is intentional
- phoenix/config.sh: use #!/usr/bin/env bash

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All shared commands (check-runners, list-runners, move-runner,
rebalance-runners, restart-all, restart-runner, stop-runner,
rerun-failed) previously existed as near-identical thin wrappers in
both frontier/ and phoenix/. Replace them all with a single dispatcher:

  bash misc/runner.sh <site> <command> [args...]

Site directories now contain only truly site-specific files:
- frontier/: config.sh, make-runner.sh, deploy-runners.sh
- phoenix/:  config.sh, create-runner.sh

Also fix rebalance-runners.sh to source common/check-runners.sh
directly (was calling the now-deleted site wrapper).

Update all READMEs to document the new dispatcher invocation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Consolidates all runner management scripts under misc/runners/ to
separate them from unrelated misc/ files. Structure:

  misc/runners/runner.sh        dispatcher
  misc/runners/common/          shared scripts and library
  misc/runners/frontier/        Frontier config + make/deploy scripts
  misc/runners/phoenix/         Phoenix config + create script

All paths are computed from ${BASH_SOURCE[0]} so no internal path
changes are needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
frontier/make-runner.sh and phoenix/create-runner.sh were the same
concept with different names and minor behavioral differences. Unified
into common/create-runner.sh with site-specific behavior driven by two
config.sh hooks:

  TARBALL_CACHE_DIR
    Frontier: "$SHARED_DIR" — tarball cached on shared Lustre,
    reused across parallel deployments (prevents download races).
    Phoenix:  "" — each runner downloads independently.

  runner_install_dir <name> [override-dir]
    Frontier: "$SHARED_DIR/<name>"
    Phoenix:  auto-numbered "$RUNNER_PARENT_DIRS[0]/actions-runner-N"

frontier/make-runner.sh is now a 3-line wrapper that derives the name
"frontier-<N>" from a number argument and delegates to create-runner.

deploy-runners.sh updated to reference $TARBALL_CACHE_DIR instead of
$SHARED_DIR directly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All new runner installs now go to RUNNER_PARENT_DIRS[1]:
/storage/project/r-sbryngelson3-0/sbryngelson3/mfc-runners-2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sbryngelson sbryngelson marked this pull request as ready for review March 16, 2026 01:27
Copilot AI review requested due to automatic review settings March 16, 2026 01:27
@sbryngelson sbryngelson merged commit 7ea6b5c into MFlowCode:master Mar 16, 2026
32 checks passed
@sbryngelson sbryngelson deleted the phoenix-runner-scripts branch March 16, 2026 01:27
@github-actions
Copy link

Claude Code Review

Incremental review from: eb9f98a6609a68892553bbdf5a9a30d57c0dd41a
Head SHA: e0ad7714228f3fc224905b4475329814c747dd3c

New findings since last Claude review:

1. misc/runners/common/create-runner.sh — orphaned $RUNNER_DIR on download or extraction failure

When TARBALL_CACHE_DIR is unset (the Phoenix path), the script creates the runner directory before the download:

mkdir -p "$RUNNER_DIR"
tarball_path="$RUNNER_DIR/runner-download.tmp.$$"
curl -fsSL "$TARBALL_URL" -o "$tarball_path"

If curl fails (network error, 404, disk full), set -euo pipefail aborts immediately, leaving an empty $RUNNER_DIR on disk. The script has no trap to clean up. On the next invocation, the [ -d "$RUNNER_DIR" ] guard at the top fires:

if [ -d "$RUNNER_DIR" ]; then
    echo "ERROR: Directory already exists: $RUNNER_DIR" >&2
    exit 1
fi

The user must then manually remove the orphaned directory before retrying. The same issue applies if extraction succeeds but config.sh (runner registration) fails — $RUNNER_DIR is left populated but unregistered.

Suggested fix — add a cleanup trap immediately after the directory is created:

mkdir -p "$RUNNER_DIR"
trap 'rm -rf "$RUNNER_DIR"' ERR
tarball_path="$RUNNER_DIR/runner-download.tmp.$$"
curl -fsSL "$TARBALL_URL" -o "$tarball_path"

Clear the trap once start succeeds:

if start_runner "$TARGET_NODE" "$RUNNER_DIR"; then
    trap - ERR   # runner is up; don't clean up on subsequent errors
    ...

Still open from previous reviews:

  • misc/runners/phoenix/config.sh runner_install_dir()ls -d exits non-zero on first runner creation (previously filed against misc/phoenix/create-runner.sh; the same code is now in phoenix/config.sh line ~31). Bash expands the glob to the literal pattern when no directories match, ls exits 1, and the pipefail-active subshell in RUNNER_DIR=$(runner_install_dir ...) aborts create-runner.sh. Fix: append || true to the pipeline.

  • misc/runners/phoenix/config.sh SSH_OPTS still "-o ConnectTimeout=5" only — missing BatchMode=yes (hangs on password/host-key prompts in unattended use), StrictHostKeyChecking=no, and ServerAliveInterval/ServerAliveCountMax. Frontier's config.sh already has the complete set.

  • misc/runners/runner.sh line 45 — SITE_SCRIPT_DIR is set but never read — dead variable since rebalance-runners.sh was updated to use source "$(dirname "${BASH_SOURCE[0]}")/check-runners.sh" internally. Low priority.

@github-actions
Copy link

Claude Code Review

Head SHA: e0ad7714228f3fc224905b4475329814c747dd3c

Files Changed (19)

  • misc/runners/runner.sh — top-level dispatcher
  • misc/runners/common/runner-lib.sh — shared library (GitHub API, process mgmt, SSH sweep)
  • misc/runners/common/check-runners.sh
  • misc/runners/common/list-runners.sh
  • misc/runners/common/create-runner.sh
  • misc/runners/common/move-runner.sh
  • misc/runners/common/rebalance-runners.sh
  • misc/runners/common/restart-all.sh
  • misc/runners/common/restart-runner.sh
  • misc/runners/common/stop-runner.sh
  • misc/runners/common/rerun-failed.sh
  • misc/runners/phoenix/config.sh
  • misc/runners/phoenix/README.md
  • misc/runners/frontier/config.sh
  • misc/runners/frontier/deploy-runners.sh
  • misc/runners/frontier/make-runner.sh
  • misc/runners/frontier/README.md
  • misc/runners/common/README.md
  • misc/starting-phoenix-runners.md (deleted)

Summary

  • This PR replaces the outdated misc/starting-phoenix-runners.md with a well-structured, fully-operational runner management toolkit covering Frontier and Phoenix HPC clusters. The overall architecture — a common library, site configs, and a single dispatcher — is clean.
  • All previously reported issues (tarball download race condition, word-splitting on find_runner_dirs, arithmetic-expansion guard for "?" strings, gh run view unguarded exit in rerun-failed.sh, rerun_count dead variable) appear resolved in the final head.
  • Three findings remain from prior review rounds; this is the full-review pass to confirm current state:
    1. SSH_OPTS in phoenix/config.sh is still minimal — critical for unattended scripted use.
    2. runner_install_dir() in phoenix/config.sh exits non-zero under set -e on first runner creation.
    3. SITE_SCRIPT_DIR in runner.sh is set but never referenced — dead variable.

Findings

1. [Bug] phoenix/config.shrunner_install_dir() aborts on first runner creation

File: misc/runners/phoenix/config.sh, lines 33–37

existing=

With set -euo pipefail active (from create-runner.sh), when no actions-runner-* directories exist yet (first-ever runner creation), the unquoted glob is expanded by bash to the literal string /path/actions-runner-*, which ls does not find. ls exits 1. Under pipefail, the pipeline assignment exits non-zero and the script aborts before the runner is created.

Fix: append || true to tolerate the zero-match case:

existing=$(ls -d "$parent"/actions-runner-* 2>/dev/null | sed 's/.*actions-runner-//' | sort -n | tail -1 || true)

2. [Correctness] phoenix/config.shSSH_OPTS is incomplete for unattended use

File: misc/runners/phoenix/config.sh, line 15

Current: SSH_OPTS="-o ConnectTimeout=5"

Frontier's equivalent (misc/runners/frontier/config.sh, line 14) is:
SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=10 -o BatchMode=yes -o ServerAliveInterval=10 -o ServerAliveCountMax=3"

The missing options matter for unattended non-interactive scripts:

  • BatchMode=yes — prevents SSH from hanging indefinitely when key auth unexpectedly fails (e.g., expired key, agent not forwarded). Without it, any SSH call in find_pids, sweep_all_nodes, stop_runner, or start_runner can block the script forever waiting for an interactive prompt. This is the most critical missing option.
  • StrictHostKeyChecking=no — first connection to a login node whose host key is not yet in known_hosts blocks on an interactive prompt.
  • ServerAliveInterval/ServerAliveCountMax — without keepalives, stalled SSH sessions to busy login nodes can hang find_pids or stop_runner calls for minutes.

Suggested fix — align with Frontier:

SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=5 -o BatchMode=yes -o ServerAliveInterval=10 -o ServerAliveCountMax=3"

3. [Minor] runner.shSITE_SCRIPT_DIR is set but never used

File: misc/runners/runner.sh, line 40

SITE_SCRIPT_DIR="$MISC_DIR/$SITE"
source "$MISC_DIR/$SITE/config.sh"
source "$MISC_DIR/common/$CMD.sh" "$@"

SITE_SCRIPT_DIR is exported into the environment that the sourced common scripts run in, but no common script reads it. It was carried over from an earlier dispatch pattern. No functional impact, but it is dead code. Can be removed unless a future common script will need a site-specific callback path.


Non-findings (resolved from prior review rounds)

  • Tarball download race condition: fixed via tmp + mv in both deploy-runners.sh and create-runner.sh.
  • Word-splitting on find_runner_dirs: fixed — all callers now use while IFS= read -r dir; do ... done < <(find_runner_dirs).
  • "?" arithmetic expansion under set -e: fixed — guarded with [[ "$rss" =~ ^[0-9]+$ ]] || rss=0 in both check-runners.sh and list-runners.sh.
  • gh run view unguarded exit in rerun-failed.sh: fixed — guarded with 2>/dev/null) || { ... continue; }.
  • RUNNER_GROUP="phoenix" in frontier/config.sh: now documented inline as intentional (# Both sites share one GitHub runner group named "phoenix").

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b5252d72-7c8f-40bd-b962-97541cda1b62

📥 Commits

Reviewing files that changed from the base of the PR and between 007b6a9 and e0ad771.

📒 Files selected for processing (19)
  • misc/runners/common/README.md
  • misc/runners/common/check-runners.sh
  • misc/runners/common/create-runner.sh
  • misc/runners/common/list-runners.sh
  • misc/runners/common/move-runner.sh
  • misc/runners/common/rebalance-runners.sh
  • misc/runners/common/rerun-failed.sh
  • misc/runners/common/restart-all.sh
  • misc/runners/common/restart-runner.sh
  • misc/runners/common/runner-lib.sh
  • misc/runners/common/stop-runner.sh
  • misc/runners/frontier/README.md
  • misc/runners/frontier/config.sh
  • misc/runners/frontier/deploy-runners.sh
  • misc/runners/frontier/make-runner.sh
  • misc/runners/phoenix/README.md
  • misc/runners/phoenix/config.sh
  • misc/runners/runner.sh
  • misc/starting-phoenix-runners.md

📝 Walkthrough

Walkthrough

Introduces a comprehensive GitHub Actions runner management infrastructure for Frontier and Phoenix systems. Adds a shared library (runner-lib.sh) with common GitHub API and process management functions. Implements site-agnostic utility scripts (check-runners.sh, create-runner.sh, list-runners.sh, move-runner.sh, rebalance-runners.sh, restart-runner.sh, restart-all.sh, stop-runner.sh, rerun-failed.sh) for runner operations. Adds site-specific configurations for Frontier and Phoenix with node discovery and installation directory logic. Includes a dispatcher script (runner.sh) that routes commands to site or common implementations. Provides comprehensive documentation for both sites and removes obsolete Phoenix startup documentation.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new misc/runners/ toolkit to manage GitHub Actions self-hosted runners on Phoenix and Frontier (health checks, listing, restarting, moving, rebalancing, creation, and rerunning failed CI), and removes the obsolete Phoenix runner startup doc (SOCKS5 proxy no longer needed).

Changes:

  • Introduce a site-dispatched runner management CLI (misc/runners/runner.sh) with shared library code under misc/runners/common/.
  • Add site configs/docs for Phoenix and Frontier plus Frontier convenience scripts for provisioning.
  • Remove misc/starting-phoenix-runners.md (outdated proxy-based procedure).

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
misc/starting-phoenix-runners.md Removes outdated Phoenix runner startup documentation.
misc/runners/runner.sh Dispatcher for site/common runner management commands.
misc/runners/common/runner-lib.sh Shared GitHub API + SSH process discovery + start/stop primitives.
misc/runners/common/check-runners.sh Per-node runner health check.
misc/runners/common/list-runners.sh Combined GitHub API + live process table.
misc/runners/common/rebalance-runners.sh Rebalance algorithm for distributing runners across login nodes.
misc/runners/common/restart-runner.sh Restart a single runner and verify SLURM is in PATH.
misc/runners/common/restart-all.sh Restart all runners (dry-run by default).
misc/runners/common/move-runner.sh Move a runner between login nodes by name.
misc/runners/common/stop-runner.sh Stop and deregister a runner.
misc/runners/common/create-runner.sh Download/register/start a new runner.
misc/runners/common/rerun-failed.sh Rerun failed workflows on open PRs and master.
misc/runners/common/README.md Documentation for shared commands and library.
misc/runners/phoenix/config.sh Phoenix-specific constants and runner directory discovery.
misc/runners/phoenix/README.md Phoenix operator documentation and usage examples.
misc/runners/frontier/config.sh Frontier-specific constants, shared dir, and node sync helper.
misc/runners/frontier/make-runner.sh Frontier convenience wrapper for numbered runner creation.
misc/runners/frontier/deploy-runners.sh Parallel runner deployment across multiple login nodes.
misc/runners/frontier/README.md Frontier operator documentation and usage examples.

Comment on lines +29 to +39
# Return the directory where a named runner should be installed.
# Auto-increments the actions-runner-N suffix within RUNNER_PARENT_DIRS[0].
# Args: $1 = runner name (unused; directory is numbered, not named), $2 = optional override dir
runner_install_dir() {
local override="${2:-}"
[ -n "$override" ] && echo "$override" && return
local parent="${RUNNER_PARENT_DIRS[1]}"
local existing next_num
existing=$(ls -d "$parent"/actions-runner-* 2>/dev/null | sed 's/.*actions-runner-//' | sort -n | tail -1)
next_num=$(( ${existing:-0} + 1 ))
echo "$parent/actions-runner-$next_num"
Comment on lines +51 to +54
echo "ERROR: Unknown command '$CMD' for site '$SITE'." >&2
echo "Common: check-runners list-runners move-runner rebalance-runners restart-all restart-runner stop-runner rerun-failed" >&2
echo "Frontier: make-runner deploy-runners" >&2
echo "Phoenix: create-runner" >&2
Comment on lines +105 to +115
ssh $SSH_OPTS "$node" '
for p in $(ps aux | grep Runner.Listener | grep -v grep | awk "{print \$2}"); do
exe=$(readlink -f /proc/$p/exe 2>/dev/null || true)
[ -z "$exe" ] && continue
dir=$(dirname "$(dirname "$exe")")
rss=$(ps -p $p -o rss= 2>/dev/null | awk "{printf \"%.0f\", \$1/1024}" || echo 0)
slurm=$(tr "\0" "\n" < /proc/$p/environ 2>/dev/null | grep -c "^PATH=.*slurm" || echo 0)
[ "$slurm" -gt 0 ] && slurm_ok="ok" || slurm_ok="MISSING"
echo "RUNNER '"$node"' $dir $rss $slurm_ok"
done
' 2>/dev/null > "$tmpdir/$node.out" &
start_runner() {
local node="$1" dir="$2"
timeout 15 ssh $SSH_OPTS "$node" \
"cd $dir && setsid bash -lc 'nohup ./run.sh >> runner.log 2>&1 < /dev/null &'" \
echo "=== Per-node memory ==="
for node in "${NODES[@]}"; do
count=$(ssh $SSH_OPTS "$node" \
"ps aux | grep Runner.Listener | grep -v grep | wc -l" 2>/dev/null || echo 0)
Comment on lines +21 to +26
for p in $(ps aux | grep Runner.Listener | grep -v grep | awk "{print \$2}"); do
found=1
exe=$(readlink -f /proc/$p/exe 2>/dev/null || echo "???")
dir=$(dirname "$(dirname "$exe")" 2>/dev/null || echo "???")
name=$(basename "$dir")
worker=$(ps aux | grep "Runner.Worker" | grep "$dir" | grep -v grep | awk "{print \$2}" | head -1)
Comment on lines +23 to +31
exe=$(readlink -f /proc/$p/exe 2>/dev/null || echo "???")
dir=$(dirname "$(dirname "$exe")" 2>/dev/null || echo "???")
name=$(basename "$dir")
worker=$(ps aux | grep "Runner.Worker" | grep "$dir" | grep -v grep | awk "{print \$2}" | head -1)
[ -n "$worker" ] && status="BUSY" || status="idle"
rss=$(ps -p $p -o rss= 2>/dev/null | awk "{printf \"%.0f\", \$1/1024}" || echo "?")
slurm=$(tr "\0" "\n" < /proc/$p/environ 2>/dev/null | grep -c "^PATH=.*slurm" || echo 0)
[ "$slurm" -gt 0 ] && slurm_ok="ok" || slurm_ok="MISSING"
printf " %-30s %5s slurm=%-7s %s MB\n" "$name" "$status" "$slurm_ok" "$rss"
Comment on lines +11 to +16
# Sites: frontier phoenix
# Common: check-runners list-runners move-runner rebalance-runners
# restart-all restart-runner stop-runner rerun-failed
# Frontier: make-runner deploy-runners
# Phoenix: create-runner
#
Comment on lines +65 to +70
ssh $SSH_OPTS "$1" '
for p in $(ps aux | grep Runner.Listener | grep -v grep | awk "{print \$2}"); do
exe=$(readlink -f /proc/$p/exe 2>/dev/null || true)
[ "$exe" = "'"$2"'/bin/Runner.Listener" ] && echo "$p"
done
' 2>/dev/null | grep -E '^[0-9]+$' | tr '\n' ' ' || true
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants