agent/engine: add idle-message timeout to receive_response loop#66
Merged
pufit merged 2 commits intoMay 13, 2026
Merged
Conversation
Symptom: chats stall indefinitely with is_running=true and no progress.
The Claude CLI subprocess is alive but blocked on socket I/O (stuck
Anthropic API call, broken stdio pipe, etc.). receive_response()'s
async generator never yields, the engine's per-session lock stays
held, queued user messages back up forever, and /stop has no effect
because the asyncio.Task is sitting in an uncancellable await.
The synthetic-done fix in eff4394 covered three other shapes of "no
terminal event sent" (post-stream DB exception, lost WS message,
streaming-state asymmetry). It can't help here: the loop never exits
in the first place, so the finally block that ships the synthetic
done never runs.
Fix: wrap each __anext__ of receive_response in asyncio.wait_for. If
no SDK message of any kind arrives within agent.cli_idle_timeout_seconds
(default 900s = 15 min), close the iterator and raise TimeoutError.
That falls through to the existing CLI-crash retry path in _run_inner
(line 1846-1865 on origin/main), which discards the client, reconnects,
and either retries (no content yet) or surfaces the error (content
already streamed).
The timeout is per-message, not per-turn — a Bash with timeout=600000ms
streams tool_use/tool_result chunks that reset the idle timer, so
legitimate long tool calls aren't affected. Set
agent.cli_idle_timeout_seconds=0 in config.local.yaml to restore the
old behaviour.
Two adjacent fixes shipped in the same commit because they're triggered
by the same set of stalls:
1. plans.py:254 — POST /api/plans/{id}/approve schedules _run_impl
with asyncio.create_task without engine.register_task, so /stop
can't cancel a stuck impl session. Register the task.
2. sessions.py register_task — the prior done-callback popped the
session_id unconditionally. A racing register_task call (or now,
the plan-approve register on top of the engine.run() implicit
register) could leave the new task without a stop registry entry
when the old task finished. Identity-check the task before popping.
Tests:
- test_engine.py: four cases for _iter_response_with_timeout
(normal stream, idle timeout raises and closes iterator, timeout=0
disables, empty stream).
- test_sessions.py: one case for register_task replacement (old
task's done-callback must not clobber the new entry).
Live repro before this patch: session eb740973 hung 15+ min, CLI
PID 24391 sleeping in __arm64_sys_epoll_pwait2; SIGKILL on the CLI
unblocked the lock and the queued user message processed on retry.
This was referenced May 6, 2026
Contributor
Author
|
Bumping for review. This is one of five small reliability fixes (#63, #64, #65, #66, #67) that compound: each closes one cause of chats that wedge mid-turn, lose context after restart, or get routed into the wrong session. Happy to split, rebase, or land any of them in any order; they don't depend on each other. In daily use the chat surface is the single biggest friction right now, so any of these merging would help. Let me know if you want changes. |
This was referenced May 13, 2026
pufit
approved these changes
May 13, 2026
pufit
pushed a commit
that referenced
this pull request
May 13, 2026
User-reported glitch: "I ask for something and there is no response, then I ask again and it answers to the previous question." Five reliability PRs (#63 shorthand-schema, #64 synthetic done, #65 stale sdk_session_id, #66 idle timeout, #67 sticky session) each close one underlying cause. Two gaps remain that none of those PRs cover. Gap 1: client-side send silently drops payloads. web/src/api/websocket.ts checked readyState === OPEN and no-op'd otherwise. The 3s reconnect window leaves a hole: send() returns to the caller and chatStore.sendMessage has already optimistically appended the user message and set isStreaming=true. The user thinks the agent is thinking but the message never reached the server, so the next reply lands against a stale prompt. Track readyState explicitly. CONNECTING or reconnect-scheduled now queues the payload (bounded to 5 entries; oldest evicted) and flushes from onopen. CLOSED-without-reconnect and CLOSING return 'dropped' so the caller can revert. chatStore.sendMessage pops the optimistic user message on 'dropped' and surfaces an inline assistant error so the user can retry. Gap 2: gateway initial-bind never replayed the broadcaster buffer. The switch_session handler already shipped session_status with buffered_events on session switch, but the initial-connect handshake at server.py:286-311 didn't. Reload mid-turn (or a transient 3s WS drop) and the in-flight stream was lost from the client's view even though the events sat in broadcaster._session_buffers waiting to be replayed. Lift the duplicated send-status construction into _send_session_status and call it from both branches. Initial-bind gates on broadcaster.is_buffering so idle sessions stay silent; switch_session calls unconditionally so the client refreshes is_running/status on every selection. The frontend handleSessionStatus already restores streamingBlocks, panels, todos, and interaction state from the buffer (handled by #69), so this is purely additive at the gateway. Tests: - 9 new asserts in tests/test_gateway_ws.py covering the helper output, the initial-bind gate, the switch_session regression path, and a load-fidelity check for buffer ordering. - Full pytest run: 444 pass, 2 skip, 2 pre-existing failures unrelated (test_bootstrap docker-env detection and test_cli_upgrade docker mode, both noted in notes/repo-conventions/nerve.md). - web/ tsc --noEmit clean, npm run build clean. Out of scope (followups, not blocking): - Stale-listener cleanup on swallowed send_json exceptions (server.py:298-301). - Application-level message_received ack from engine after sessions.add_message. - _session_locks TTL on session archive.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes chat sessions that stall indefinitely with
is_running=trueand no progress. Root cause: the Claude CLI subprocess can hang on socket I/O (stuck Anthropic API call, broken stdio pipe, etc.) andclient.receive_response()'s async generator blocks forever. The engine has no other signal that the CLI is dead, the per-session lock stays held, queued user messages back up, and/stophas no effect because the asyncio task is sitting in an uncancellableawait.The synthetic-done backstop from #64 covers three other "no terminal event" shapes (post-stream DB exception, lost WS, streaming-state asymmetry) but can't help when the loop never exits in the first place.
Fix
Wrap each
__anext__ofreceive_response()inasyncio.wait_for. If no SDK message of any kind arrives withinagent.cli_idle_timeout_seconds(default 900s = 15 min), close the iterator and raiseTimeoutError. That falls through to the existing CLI-crash retry path in_run_inner, which discards the client, reconnects, and either retries (if no content yet) or surfaces the error (if content already streamed).The timeout is per-message, not per-turn. A
Bashwithtimeout=600000msstreamstool_use/tool_resultchunks that reset the idle timer, so legitimate long tool calls don't trip it. Setagent.cli_idle_timeout_seconds: 0inconfig.local.yamlto disable.Adjacent fixes shipped in the same patch
Both are triggered by the same family of stalls and are too small to split:
plans.py:254—POST /api/plans/{id}/approvescheduled_run_implviaasyncio.create_taskwithoutengine.register_task, so/stopcouldn't cancel a stuck impl session. The asyncio task was invisible to the stop registry. Now registered.sessions.pyregister_task— the previous done-callback popped thesession_idunconditionally. A racingregister_taskcall (or, now, the plan-approve register layered withengine.run's implicit register) could leave the new task without a stop-registry entry when the old task finished. The callback now identity-checks the task before popping, and emits a warning when register_task replaces a still-live task.Tests
New tests pass; full suite has one pre-existing unrelated capitalization mismatch in
test_cli_upgrade.pynot touched here.test_engine.py::test_iter_response_yields_messages_normallytest_engine.py::test_iter_response_raises_on_idle_timeouttest_engine.py::test_iter_response_disabled_when_timeout_zerotest_engine.py::test_iter_response_handles_empty_streamtest_sessions.py::test_register_task_replacement_does_not_clobber_new_entryLive repro before this patch
Session
eb740973hung 15+ minutes after the user sent a follow-up message:is_running=truesince 02:33:05; no DB activity for 15 minSstate, blocked in__arm64_sys_epoll_pwait2kill -9 24391immediately unblocked the lock; the existing exception handler retried and the user message processed on retryThis patch makes that recovery automatic after
cli_idle_timeout_seconds.Test plan
test_cli_upgrade.py::test_docker_mode_bails_out— capitalization)claude/engine-sdk-resume-guard(this PR cherry-picked on top) and observe a new stall: confirmcli_idle_timeout_secondsfires and the error path runs cleanly