Skip to content

fix: evict stale Claude frontend via liveness probe (#68)#73

Open
raysonmeng wants to merge 1 commit intomasterfrom
fix/daemon-stale-frontend
Open

fix: evict stale Claude frontend via liveness probe (#68)#73
raysonmeng wants to merge 1 commit intomasterfrom
fix/daemon-stale-frontend

Conversation

@raysonmeng
Copy link
Copy Markdown
Owner

Summary

Fixes #68 — P0 from docs/issues-2026-04-18-codex-stuck-and-resume.md (Issue A).

When Claude Code crashed or was killed hard (e.g., OOM, SIGKILL), the OS never
surfaced FIN on the control WebSocket, so the daemon's readyState stayed OPEN
forever and new sessions got rejected with close code 4001 for up to an hour
(reproduced in the failing log: conn #2 through #19 all rejected).

Fix: challenge-on-contest admission. When a second frontend arrives and the
slot is "occupied", the daemon pings the incumbent and waits up to 3 s for a
pong. If no pong, it evicts the incumbent with a new close code
4002 (CLOSE_CODE_EVICTED_STALE) and admits the newcomer.

修复 #68 — 来自 docs/issues-2026-04-18-codex-stuck-and-resume.md 的 Issue A
(P0)。当 Claude Code 崩溃或强杀时操作系统可能不送 FIN,守护进程认为旧连接
仍然 OPEN,新会话被 4001 无限拒绝。修复方案:新前端到达时向既有连接发 ping,
3 秒内未收到 pong 则以 4002 驱逐旧连接,让新会话接管。

Key changes

  • src/control-protocol.ts — new CLOSE_CODE_EVICTED_STALE = 4002.
  • src/daemon.ts — challenge-on-contest admission; attachClaude is now async;
    tracks lastPongAt per socket; new pong handler; challengeInProgress
    serialization so concurrent contestants can't double-attach.
  • src/liveness-probe.ts (new) — pure-ish probe primitive with injected clock
    • sleep for deterministic unit tests.
  • src/daemon-client.ts — treats 4002 the same as 4001 (emits rejected + code),
    so an evicted session stops its reconnect loop instead of flapping.
  • src/bridge.ts — distinct bilingual user-facing message for 4002 ("evicted as
    stale") vs 4001 ("another session already connected"), so the user knows which
    branch applies.
  • Env knob: AGENTBRIDGE_LIVENESS_PROBE_TIMEOUT_MS (default 3000).

Tests

  • Unit — src/unit-test/liveness-probe.test.ts (7 tests): pong-before-timeout,
    no-pong-timeout, not-OPEN-early-return, ping-throws, readyState-transitions-mid-probe,
    stale-pong-not-trusted (baseline semantic), injected-clock-deterministic-timeout.
  • Unit — src/unit-test/daemon-client.test.ts: extended existing 4001 coverage
    with a mirrored 4002 test and asserts the rejected event now carries the
    close code.
  • Full suite: 183 pass / 0 fail / 510 expect() calls.
  • bun run check clean: typecheck + tests + plugin-sync + plugin version align.

Test plan

See src/unit-test/e2e/issue-68-stale-frontend.md for the 5-scenario plan:

  • Test 1 — happy path unchanged (single-session admission still works)
  • Test 2 — rejection still blocks a live second session (4001 regression guard)
  • Test 3 — stale frontend evicted when old process is killed hard (the core fix;
    kill -9 the Claude PID, start a new agentbridge claude within 30 s, verify
    new session becomes ready within ~3-5 s)
  • Test 4 — concurrent contestants serialize safely (one wins, no double-attach)
  • Test 5 — agentbridge kill sentinel still wins (probe doesn't revive a session)

Please run Test 3 at minimum — it's the exact repro for the reported bug.

🤖 Generated with Claude Code

…旧的 Claude 前端

When Claude Code crashed or was killed hard, the OS never surfaced FIN on
the control WebSocket, so the daemon's `readyState` stayed OPEN forever
and new sessions got rejected with close code 4001 indefinitely.

Fix: challenge-on-contest admission. When a new frontend arrives and the
slot is occupied, ping the incumbent and wait up to 3 s for a pong. If no
pong, evict the incumbent with a new close code 4002
(`CLOSE_CODE_EVICTED_STALE`) and admit the newcomer.

Client-side changes:
- `daemon-client.ts` now treats 4002 the same as 4001 — emits `rejected`
  with the close code, stops reconnect loop.
- `bridge.ts` shows a distinct bilingual "evicted as stale" notification
  for 4002 so the user knows their session was kicked, not rejected as a
  late-comer.

Extracted the probe loop into `src/liveness-probe.ts` with an injected
clock and sleep, so the timeout semantics (baseline pong timestamp, early
abort on non-OPEN state, ping throw handling) are covered by 7 unit tests
without a real WebSocket.

- Unit tests: liveness-probe (7), daemon-client.test.ts mirrored 4001/4002
- E2E plan: `src/unit-test/e2e/issue-68-stale-frontend.md` — 5 scenarios
- Env knob: `AGENTBRIDGE_LIVENESS_PROBE_TIMEOUT_MS` (default 3000)

当 Claude Code 崩溃或被强杀时,操作系统可能永远不会在控制 WebSocket 上
传递 FIN,导致守护进程的 readyState 永远报 OPEN,新会话被 4001 无限拒绝。

修复方案:争用时挑战 (challenge-on-contest) 接入。当新前端到达而槽位已占
用时,向既有连接发 ping 并等待最多 3 秒的 pong。超时即以新关闭码 4002
(CLOSE_CODE_EVICTED_STALE) 驱逐旧连接,接纳新连接。

客户端同步处理 4002:与 4001 一致进入拒绝态、停止重连,但展示不同的中英
双语提示,让用户清晰了解自己是被新会话接管(4002),而非作为晚到者被拒
(4001)。

探测循环抽离到 src/liveness-probe.ts,使用可注入时钟/sleep,7 个单测覆盖
基线 pong、非 OPEN 提前退出、ping 抛错等边界场景。
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 84d983ec36

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/daemon.ts
Comment on lines +289 to +291
attachClaude(ws).catch((err) => {
log(`attachClaude threw for #${ws.data.clientId}: ${err?.message ?? err}`);
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Block contestant traffic until admission decision

Starting attachClaude asynchronously here creates a new window where the socket is connected but not yet admitted/rejected while the liveness probe runs (up to 3s). During that window, claude_to_codex messages are still processed without verifying ws === attachedClaude, so a second session that is ultimately rejected with 4001 can still inject work into Codex before the close arrives, breaking the single-active-session guarantee.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

daemon: stale frontend socket blocks reconnection after Claude Code exits

1 participant