agent/engine: add idle-message timeout to receive_response loop by alex-fedotyev · Pull Request #66 · ClickHouse/nerve

alex-fedotyev · 2026-05-06T03:25:09Z

Summary

Fixes chat sessions that stall indefinitely with is_running=true and no progress. Root cause: the Claude CLI subprocess can hang on socket I/O (stuck Anthropic API call, broken stdio pipe, etc.) and client.receive_response()'s async generator blocks forever. The engine has no other signal that the CLI is dead, the per-session lock stays held, queued user messages back up, and /stop has no effect because the asyncio task is sitting in an uncancellable await.

The synthetic-done backstop from #64 covers three other "no terminal event" shapes (post-stream DB exception, lost WS, streaming-state asymmetry) but can't help when the loop never exits in the first place.

Fix

Wrap each __anext__ of receive_response() in asyncio.wait_for. If no SDK message of any kind arrives within agent.cli_idle_timeout_seconds (default 900s = 15 min), close the iterator and raise TimeoutError. That falls through to the existing CLI-crash retry path in _run_inner, which discards the client, reconnects, and either retries (if no content yet) or surfaces the error (if content already streamed).

The timeout is per-message, not per-turn. A Bash with timeout=600000ms streams tool_use/tool_result chunks that reset the idle timer, so legitimate long tool calls don't trip it. Set agent.cli_idle_timeout_seconds: 0 in config.local.yaml to disable.

Adjacent fixes shipped in the same patch

Both are triggered by the same family of stalls and are too small to split:

plans.py:254 — POST /api/plans/{id}/approve scheduled _run_impl via asyncio.create_task without engine.register_task, so /stop couldn't cancel a stuck impl session. The asyncio task was invisible to the stop registry. Now registered.
sessions.py register_task — the previous done-callback popped the session_id unconditionally. A racing register_task call (or, now, the plan-approve register layered with engine.run's implicit register) could leave the new task without a stop-registry entry when the old task finished. The callback now identity-checks the task before popping, and emits a warning when register_task replaces a still-live task.

Tests

New tests pass; full suite has one pre-existing unrelated capitalization mismatch in test_cli_upgrade.py not touched here.

test_engine.py::test_iter_response_yields_messages_normally
test_engine.py::test_iter_response_raises_on_idle_timeout
test_engine.py::test_iter_response_disabled_when_timeout_zero
test_engine.py::test_iter_response_handles_empty_stream
test_sessions.py::test_register_task_replacement_does_not_clobber_new_entry

Live repro before this patch

Session eb740973 hung 15+ minutes after the user sent a follow-up message:

is_running=true since 02:33:05; no DB activity for 15 min
queued user message at 02:47:04 sitting behind the per-session asyncio lock
CLI subprocess (PID 24391) alive in S state, blocked in __arm64_sys_epoll_pwait2
kill -9 24391 immediately unblocked the lock; the existing exception handler retried and the user message processed on retry

This patch makes that recovery automatic after cli_idle_timeout_seconds.

Test plan

Unit tests for the new helper and the register_task race
Full suite green except one pre-existing unrelated failure (test_cli_upgrade.py::test_docker_mode_bails_out — capitalization)
Apply locally on claude/engine-sdk-resume-guard (this PR cherry-picked on top) and observe a new stall: confirm cli_idle_timeout_seconds fires and the error path runs cleanly

Symptom: chats stall indefinitely with is_running=true and no progress. The Claude CLI subprocess is alive but blocked on socket I/O (stuck Anthropic API call, broken stdio pipe, etc.). receive_response()'s async generator never yields, the engine's per-session lock stays held, queued user messages back up forever, and /stop has no effect because the asyncio.Task is sitting in an uncancellable await. The synthetic-done fix in eff4394 covered three other shapes of "no terminal event sent" (post-stream DB exception, lost WS message, streaming-state asymmetry). It can't help here: the loop never exits in the first place, so the finally block that ships the synthetic done never runs. Fix: wrap each __anext__ of receive_response in asyncio.wait_for. If no SDK message of any kind arrives within agent.cli_idle_timeout_seconds (default 900s = 15 min), close the iterator and raise TimeoutError. That falls through to the existing CLI-crash retry path in _run_inner (line 1846-1865 on origin/main), which discards the client, reconnects, and either retries (no content yet) or surfaces the error (content already streamed). The timeout is per-message, not per-turn — a Bash with timeout=600000ms streams tool_use/tool_result chunks that reset the idle timer, so legitimate long tool calls aren't affected. Set agent.cli_idle_timeout_seconds=0 in config.local.yaml to restore the old behaviour. Two adjacent fixes shipped in the same commit because they're triggered by the same set of stalls: 1. plans.py:254 — POST /api/plans/{id}/approve schedules _run_impl with asyncio.create_task without engine.register_task, so /stop can't cancel a stuck impl session. Register the task. 2. sessions.py register_task — the prior done-callback popped the session_id unconditionally. A racing register_task call (or now, the plan-approve register on top of the engine.run() implicit register) could leave the new task without a stop registry entry when the old task finished. Identity-check the task before popping. Tests: - test_engine.py: four cases for _iter_response_with_timeout (normal stream, idle timeout raises and closes iterator, timeout=0 disables, empty stream). - test_sessions.py: one case for register_task replacement (old task's done-callback must not clobber the new entry). Live repro before this patch: session eb740973 hung 15+ min, CLI PID 24391 sleeping in __arm64_sys_epoll_pwait2; SIGKILL on the CLI unblocked the lock and the queued user message processed on retry.

alex-fedotyev · 2026-05-13T02:20:49Z

Bumping for review. This is one of five small reliability fixes (#63, #64, #65, #66, #67) that compound: each closes one cause of chats that wedge mid-turn, lose context after restart, or get routed into the wrong session. Happy to split, rebase, or land any of them in any order; they don't depend on each other.

In daily use the chat surface is the single biggest friction right now, so any of these merging would help. Let me know if you want changes.

User-reported glitch: "I ask for something and there is no response, then I ask again and it answers to the previous question." Five reliability PRs (#63 shorthand-schema, #64 synthetic done, #65 stale sdk_session_id, #66 idle timeout, #67 sticky session) each close one underlying cause. Two gaps remain that none of those PRs cover. Gap 1: client-side send silently drops payloads. web/src/api/websocket.ts checked readyState === OPEN and no-op'd otherwise. The 3s reconnect window leaves a hole: send() returns to the caller and chatStore.sendMessage has already optimistically appended the user message and set isStreaming=true. The user thinks the agent is thinking but the message never reached the server, so the next reply lands against a stale prompt. Track readyState explicitly. CONNECTING or reconnect-scheduled now queues the payload (bounded to 5 entries; oldest evicted) and flushes from onopen. CLOSED-without-reconnect and CLOSING return 'dropped' so the caller can revert. chatStore.sendMessage pops the optimistic user message on 'dropped' and surfaces an inline assistant error so the user can retry. Gap 2: gateway initial-bind never replayed the broadcaster buffer. The switch_session handler already shipped session_status with buffered_events on session switch, but the initial-connect handshake at server.py:286-311 didn't. Reload mid-turn (or a transient 3s WS drop) and the in-flight stream was lost from the client's view even though the events sat in broadcaster._session_buffers waiting to be replayed. Lift the duplicated send-status construction into _send_session_status and call it from both branches. Initial-bind gates on broadcaster.is_buffering so idle sessions stay silent; switch_session calls unconditionally so the client refreshes is_running/status on every selection. The frontend handleSessionStatus already restores streamingBlocks, panels, todos, and interaction state from the buffer (handled by #69), so this is purely additive at the gateway. Tests: - 9 new asserts in tests/test_gateway_ws.py covering the helper output, the initial-bind gate, the switch_session regression path, and a load-fidelity check for buffer ordering. - Full pytest run: 444 pass, 2 skip, 2 pre-existing failures unrelated (test_bootstrap docker-env detection and test_cli_upgrade docker mode, both noted in notes/repo-conventions/nerve.md). - web/ tsc --noEmit clean, npm run build clean. Out of scope (followups, not blocking): - Stale-listener cleanup on swallowed send_json exceptions (server.py:298-301). - Application-level message_received ack from engine after sessions.add_message. - _session_locks TTL on session archive.

This was referenced May 6, 2026

agent/sessions: keep channel sticky to active sessions on hang #67

Open

agent/tools: fix shorthand-schema bug that aborts agent turns mid-task #63

Merged

agent/streaming: ship synthetic done when a turn ends without one #64

Merged

This was referenced May 13, 2026

agent/engine: guard against stale sdk_session_id when .jsonl is missing #65

Merged

fix(chat): queue WS sends, replay buffer on initial bind #71

Merged

Merge branch 'main' into claude/fix-cli-hang-timeout

bf8702f

pufit approved these changes May 13, 2026

View reviewed changes

pufit merged commit 3338f66 into ClickHouse:main May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent/engine: add idle-message timeout to receive_response loop#66

agent/engine: add idle-message timeout to receive_response loop#66
pufit merged 2 commits into
ClickHouse:mainfrom
alex-fedotyev:claude/fix-cli-hang-timeout

alex-fedotyev commented May 6, 2026

Uh oh!

alex-fedotyev commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alex-fedotyev commented May 6, 2026

Summary

Fix

Adjacent fixes shipped in the same patch

Tests

Live repro before this patch

Test plan

Uh oh!

alex-fedotyev commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants