Skip to content

fix: clean up stale Codex app-server processes on Windows pre-flight#76

Open
Tominori666 wants to merge 1 commit intoraysonmeng:masterfrom
Tominori666:fix/codex-orphan-cleanup
Open

fix: clean up stale Codex app-server processes on Windows pre-flight#76
Tominori666 wants to merge 1 commit intoraysonmeng:masterfrom
Tominori666:fix/codex-orphan-cleanup

Conversation

@Tominori666
Copy link
Copy Markdown

Summary

When the AgentBridge daemon is hard-killed (e.g. Stop-Process, taskkill, IDE-driven plugin reload), its spawned codex.exe child does not die with it on Windows. The orphan keeps holding port 4500, so the next daemon startup spawns a fresh codex.exe that exits with os error 10048 (port in use) — surfaced to the user as the recurring "PARTIAL state, codex exit code 66" failure that requires a manual agentbridge kill every time.

The existing pre-flight cleanup in CodexAdapter.start() was Unix-only (lsof / ps / kill), so on Windows the occupied-port check fell through silently and the spawn proceeded into the EADDRINUSE failure.

Changes

  • src/codex-adapter.ts — replace Unix-only helpers with platform-aware implementations:
    • getPortPids(port) — Windows: Get-NetTCPConnection -LocalPort -State Listen / Unix: lsof -ti :PORT
    • getProcessCommandLine(pid) — Windows: Get-CimInstance Win32_Process / Unix: ps -p PID -o args=
    • killProcess(pid) — Windows: Stop-Process -Force / Unix: kill PID
    • isCodexAppServerCommandLine(cmdline) — single source of truth for matching
  • Foreign port owners still fail explicitly with Port X is already in use by non-Codex process(es)… — we never silently kill processes we didn't spawn.
  • src/unit-test/codex-adapter.test.ts — added coverage for the stale-Codex cleanup path and the foreign-process refusal path.
  • plugins/agentbridge/server/daemon.js — rebuilt bundle.

Caveat

This is recovery-on-next-startup, not a Job Object kill-on-daemon-death guarantee. Graceful daemon shutdown already calls codex.stop() unchanged. A Win32 Job Object (with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE) would be cleaner — OS would kill children at the moment of daemon death — but is a much larger surgery. This patch gets ~90% of the value with ~10% of the risk and unblocks the chronic friction immediately.

Relation to #75

Independent. #75 fixes stale Claude attach ownership at the WebSocket / control-protocol layer (probe + ack). This fixes stale Codex app-server process ownership at the OS / port layer. Both can ship independently and address different failure modes that look similar from the outside.

Verification

  • bun run typecheck OK
  • bun test src/unit-test/codex-adapter.test.ts → 44 pass
  • bun run build:plugin OK
  • bun run verify:plugin-sync OK
  • Reproduced locally on Windows 11 / Bun on company PC — the orphan-codex-on-4500 failure mode hits without this patch on every daemon hard-restart, and is resolved with this patch.

Test plan

  • Maintainer-side smoke test: hard-kill daemon while codex TUI attached, verify next agentbridge startup recovers cleanly without manual agentbridge kill.
  • Confirm foreign-process refusal still triggers when port is held by an unrelated PID (e.g. python -m http.server 4500).

🤖 Generated with Claude Code

When the AgentBridge daemon is hard-killed (Stop-Process / taskkill), its
spawned codex.exe child does not die with it on Windows. The orphan keeps
holding port 4500, so the next daemon startup spawns a new codex.exe that
exits with `os error 10048` (port in use) — surfaced to the user as the
chronic "PARTIAL state, codex exit code 66" failure that needs manual
`agentbridge kill` every time.

Existing pre-flight cleanup in CodexAdapter.start() was Unix-only
(lsof/ps/kill), so on Windows the occupied-port check silently
fell through.

Changes:
- src/codex-adapter.ts: replace Unix-only helpers with platform-aware
  getPortPids / getProcessCommandLine / killProcess / isCodexAppServerCommandLine.
  Windows uses Get-NetTCPConnection + Get-CimInstance + Stop-Process.
  Foreign port owners still throw explicitly — only matching `codex
  app-server` PIDs are killed.
- src/unit-test/codex-adapter.test.ts: add coverage for stale Codex
  cleanup and non-Codex refusal.
- plugins/agentbridge/server/daemon.js: rebuilt bundle.

Caveat: this is recovery-on-next-startup, not a Job Object kill-on-daemon-
death guarantee. Graceful shutdown still calls codex.stop() unchanged.
A Job Object follow-up would be cleaner but is much larger surgery; this
gets ~90% of the value with ~10% of the risk.

Independent of raysonmeng#75 — that PR fixes stale Claude attach ownership at the
WebSocket layer; this fixes stale Codex app-server process ownership at
the OS layer. Both can ship independently.

Verification:
- bun run typecheck OK
- bun test src/unit-test/codex-adapter.test.ts → 44 pass
- bun run build:plugin OK
- bun run verify:plugin-sync OK

Co-Authored-By: Codex <noreply@openai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant