fix: evict stale Claude frontend via liveness probe (#68)#73
fix: evict stale Claude frontend via liveness probe (#68)#73raysonmeng wants to merge 1 commit intomasterfrom
Conversation
…旧的 Claude 前端 When Claude Code crashed or was killed hard, the OS never surfaced FIN on the control WebSocket, so the daemon's `readyState` stayed OPEN forever and new sessions got rejected with close code 4001 indefinitely. Fix: challenge-on-contest admission. When a new frontend arrives and the slot is occupied, ping the incumbent and wait up to 3 s for a pong. If no pong, evict the incumbent with a new close code 4002 (`CLOSE_CODE_EVICTED_STALE`) and admit the newcomer. Client-side changes: - `daemon-client.ts` now treats 4002 the same as 4001 — emits `rejected` with the close code, stops reconnect loop. - `bridge.ts` shows a distinct bilingual "evicted as stale" notification for 4002 so the user knows their session was kicked, not rejected as a late-comer. Extracted the probe loop into `src/liveness-probe.ts` with an injected clock and sleep, so the timeout semantics (baseline pong timestamp, early abort on non-OPEN state, ping throw handling) are covered by 7 unit tests without a real WebSocket. - Unit tests: liveness-probe (7), daemon-client.test.ts mirrored 4001/4002 - E2E plan: `src/unit-test/e2e/issue-68-stale-frontend.md` — 5 scenarios - Env knob: `AGENTBRIDGE_LIVENESS_PROBE_TIMEOUT_MS` (default 3000) 当 Claude Code 崩溃或被强杀时,操作系统可能永远不会在控制 WebSocket 上 传递 FIN,导致守护进程的 readyState 永远报 OPEN,新会话被 4001 无限拒绝。 修复方案:争用时挑战 (challenge-on-contest) 接入。当新前端到达而槽位已占 用时,向既有连接发 ping 并等待最多 3 秒的 pong。超时即以新关闭码 4002 (CLOSE_CODE_EVICTED_STALE) 驱逐旧连接,接纳新连接。 客户端同步处理 4002:与 4001 一致进入拒绝态、停止重连,但展示不同的中英 双语提示,让用户清晰了解自己是被新会话接管(4002),而非作为晚到者被拒 (4001)。 探测循环抽离到 src/liveness-probe.ts,使用可注入时钟/sleep,7 个单测覆盖 基线 pong、非 OPEN 提前退出、ping 抛错等边界场景。
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 84d983ec36
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| attachClaude(ws).catch((err) => { | ||
| log(`attachClaude threw for #${ws.data.clientId}: ${err?.message ?? err}`); | ||
| }); |
There was a problem hiding this comment.
Block contestant traffic until admission decision
Starting attachClaude asynchronously here creates a new window where the socket is connected but not yet admitted/rejected while the liveness probe runs (up to 3s). During that window, claude_to_codex messages are still processed without verifying ws === attachedClaude, so a second session that is ultimately rejected with 4001 can still inject work into Codex before the close arrives, breaking the single-active-session guarantee.
Useful? React with 👍 / 👎.
Summary
Fixes #68 — P0 from
docs/issues-2026-04-18-codex-stuck-and-resume.md(Issue A).When Claude Code crashed or was killed hard (e.g., OOM, SIGKILL), the OS never
surfaced FIN on the control WebSocket, so the daemon's
readyStatestayedOPENforever and new sessions got rejected with close code
4001for up to an hour(reproduced in the failing log: conn #2 through #19 all rejected).
Fix: challenge-on-contest admission. When a second frontend arrives and the
slot is "occupied", the daemon pings the incumbent and waits up to 3 s for a
pong. If no pong, it evicts the incumbent with a new close code
4002(CLOSE_CODE_EVICTED_STALE) and admits the newcomer.修复 #68 — 来自
docs/issues-2026-04-18-codex-stuck-and-resume.md的 Issue A(P0)。当 Claude Code 崩溃或强杀时操作系统可能不送 FIN,守护进程认为旧连接
仍然 OPEN,新会话被 4001 无限拒绝。修复方案:新前端到达时向既有连接发 ping,
3 秒内未收到 pong 则以 4002 驱逐旧连接,让新会话接管。
Key changes
src/control-protocol.ts— newCLOSE_CODE_EVICTED_STALE = 4002.src/daemon.ts— challenge-on-contest admission;attachClaudeis now async;tracks
lastPongAtper socket; newponghandler;challengeInProgressserialization so concurrent contestants can't double-attach.
src/liveness-probe.ts(new) — pure-ish probe primitive with injected clocksrc/daemon-client.ts— treats 4002 the same as 4001 (emitsrejected+ code),so an evicted session stops its reconnect loop instead of flapping.
src/bridge.ts— distinct bilingual user-facing message for 4002 ("evicted asstale") vs 4001 ("another session already connected"), so the user knows which
branch applies.
AGENTBRIDGE_LIVENESS_PROBE_TIMEOUT_MS(default3000).Tests
src/unit-test/liveness-probe.test.ts(7 tests): pong-before-timeout,no-pong-timeout, not-OPEN-early-return, ping-throws, readyState-transitions-mid-probe,
stale-pong-not-trusted (baseline semantic), injected-clock-deterministic-timeout.
src/unit-test/daemon-client.test.ts: extended existing 4001 coveragewith a mirrored 4002 test and asserts the
rejectedevent now carries theclose code.
bun run checkclean: typecheck + tests + plugin-sync + plugin version align.Test plan
See
src/unit-test/e2e/issue-68-stale-frontend.mdfor the 5-scenario plan:kill -9the Claude PID, start a newagentbridge claudewithin 30 s, verifynew session becomes ready within ~3-5 s)
agentbridge killsentinel still wins (probe doesn't revive a session)Please run
Test 3at minimum — it's the exact repro for the reported bug.🤖 Generated with Claude Code