fix: clean up stale Codex app-server processes on Windows pre-flight#76
Open
Tominori666 wants to merge 1 commit intoraysonmeng:masterfrom
Open
fix: clean up stale Codex app-server processes on Windows pre-flight#76Tominori666 wants to merge 1 commit intoraysonmeng:masterfrom
Tominori666 wants to merge 1 commit intoraysonmeng:masterfrom
Conversation
When the AgentBridge daemon is hard-killed (Stop-Process / taskkill), its spawned codex.exe child does not die with it on Windows. The orphan keeps holding port 4500, so the next daemon startup spawns a new codex.exe that exits with `os error 10048` (port in use) — surfaced to the user as the chronic "PARTIAL state, codex exit code 66" failure that needs manual `agentbridge kill` every time. Existing pre-flight cleanup in CodexAdapter.start() was Unix-only (lsof/ps/kill), so on Windows the occupied-port check silently fell through. Changes: - src/codex-adapter.ts: replace Unix-only helpers with platform-aware getPortPids / getProcessCommandLine / killProcess / isCodexAppServerCommandLine. Windows uses Get-NetTCPConnection + Get-CimInstance + Stop-Process. Foreign port owners still throw explicitly — only matching `codex app-server` PIDs are killed. - src/unit-test/codex-adapter.test.ts: add coverage for stale Codex cleanup and non-Codex refusal. - plugins/agentbridge/server/daemon.js: rebuilt bundle. Caveat: this is recovery-on-next-startup, not a Job Object kill-on-daemon- death guarantee. Graceful shutdown still calls codex.stop() unchanged. A Job Object follow-up would be cleaner but is much larger surgery; this gets ~90% of the value with ~10% of the risk. Independent of raysonmeng#75 — that PR fixes stale Claude attach ownership at the WebSocket layer; this fixes stale Codex app-server process ownership at the OS layer. Both can ship independently. Verification: - bun run typecheck OK - bun test src/unit-test/codex-adapter.test.ts → 44 pass - bun run build:plugin OK - bun run verify:plugin-sync OK Co-Authored-By: Codex <noreply@openai.com>
This was referenced May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the AgentBridge daemon is hard-killed (e.g.
Stop-Process,taskkill, IDE-driven plugin reload), its spawnedcodex.exechild does not die with it on Windows. The orphan keeps holding port 4500, so the next daemon startup spawns a freshcodex.exethat exits withos error 10048(port in use) — surfaced to the user as the recurring "PARTIAL state, codex exit code 66" failure that requires a manualagentbridge killevery time.The existing pre-flight cleanup in
CodexAdapter.start()was Unix-only (lsof/ps/kill), so on Windows the occupied-port check fell through silently and the spawn proceeded into theEADDRINUSEfailure.Changes
src/codex-adapter.ts— replace Unix-only helpers with platform-aware implementations:getPortPids(port)— Windows:Get-NetTCPConnection -LocalPort -State Listen/ Unix:lsof -ti :PORTgetProcessCommandLine(pid)— Windows:Get-CimInstance Win32_Process/ Unix:ps -p PID -o args=killProcess(pid)— Windows:Stop-Process -Force/ Unix:kill PIDisCodexAppServerCommandLine(cmdline)— single source of truth for matchingPort X is already in use by non-Codex process(es)…— we never silently kill processes we didn't spawn.src/unit-test/codex-adapter.test.ts— added coverage for the stale-Codex cleanup path and the foreign-process refusal path.plugins/agentbridge/server/daemon.js— rebuilt bundle.Caveat
This is recovery-on-next-startup, not a Job Object kill-on-daemon-death guarantee. Graceful daemon shutdown already calls
codex.stop()unchanged. A Win32 Job Object (withJOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE) would be cleaner — OS would kill children at the moment of daemon death — but is a much larger surgery. This patch gets ~90% of the value with ~10% of the risk and unblocks the chronic friction immediately.Relation to #75
Independent. #75 fixes stale Claude attach ownership at the WebSocket / control-protocol layer (probe + ack). This fixes stale Codex app-server process ownership at the OS / port layer. Both can ship independently and address different failure modes that look similar from the outside.
Verification
bun run typecheckOKbun test src/unit-test/codex-adapter.test.ts→ 44 passbun run build:pluginOKbun run verify:plugin-syncOKTest plan
agentbridgestartup recovers cleanly without manualagentbridge kill.python -m http.server 4500).🤖 Generated with Claude Code