Skip to content

agent/engine: add idle-message timeout to receive_response loop#66

Merged
pufit merged 2 commits into
ClickHouse:mainfrom
alex-fedotyev:claude/fix-cli-hang-timeout
May 13, 2026
Merged

agent/engine: add idle-message timeout to receive_response loop#66
pufit merged 2 commits into
ClickHouse:mainfrom
alex-fedotyev:claude/fix-cli-hang-timeout

Conversation

@alex-fedotyev
Copy link
Copy Markdown
Contributor

Summary

Fixes chat sessions that stall indefinitely with is_running=true and no progress. Root cause: the Claude CLI subprocess can hang on socket I/O (stuck Anthropic API call, broken stdio pipe, etc.) and client.receive_response()'s async generator blocks forever. The engine has no other signal that the CLI is dead, the per-session lock stays held, queued user messages back up, and /stop has no effect because the asyncio task is sitting in an uncancellable await.

The synthetic-done backstop from #64 covers three other "no terminal event" shapes (post-stream DB exception, lost WS, streaming-state asymmetry) but can't help when the loop never exits in the first place.

Fix

Wrap each __anext__ of receive_response() in asyncio.wait_for. If no SDK message of any kind arrives within agent.cli_idle_timeout_seconds (default 900s = 15 min), close the iterator and raise TimeoutError. That falls through to the existing CLI-crash retry path in _run_inner, which discards the client, reconnects, and either retries (if no content yet) or surfaces the error (if content already streamed).

The timeout is per-message, not per-turn. A Bash with timeout=600000ms streams tool_use/tool_result chunks that reset the idle timer, so legitimate long tool calls don't trip it. Set agent.cli_idle_timeout_seconds: 0 in config.local.yaml to disable.

Adjacent fixes shipped in the same patch

Both are triggered by the same family of stalls and are too small to split:

  1. plans.py:254POST /api/plans/{id}/approve scheduled _run_impl via asyncio.create_task without engine.register_task, so /stop couldn't cancel a stuck impl session. The asyncio task was invisible to the stop registry. Now registered.

  2. sessions.py register_task — the previous done-callback popped the session_id unconditionally. A racing register_task call (or, now, the plan-approve register layered with engine.run's implicit register) could leave the new task without a stop-registry entry when the old task finished. The callback now identity-checks the task before popping, and emits a warning when register_task replaces a still-live task.

Tests

New tests pass; full suite has one pre-existing unrelated capitalization mismatch in test_cli_upgrade.py not touched here.

  • test_engine.py::test_iter_response_yields_messages_normally
  • test_engine.py::test_iter_response_raises_on_idle_timeout
  • test_engine.py::test_iter_response_disabled_when_timeout_zero
  • test_engine.py::test_iter_response_handles_empty_stream
  • test_sessions.py::test_register_task_replacement_does_not_clobber_new_entry

Live repro before this patch

Session eb740973 hung 15+ minutes after the user sent a follow-up message:

  • is_running=true since 02:33:05; no DB activity for 15 min
  • queued user message at 02:47:04 sitting behind the per-session asyncio lock
  • CLI subprocess (PID 24391) alive in S state, blocked in __arm64_sys_epoll_pwait2
  • kill -9 24391 immediately unblocked the lock; the existing exception handler retried and the user message processed on retry

This patch makes that recovery automatic after cli_idle_timeout_seconds.

Test plan

  • Unit tests for the new helper and the register_task race
  • Full suite green except one pre-existing unrelated failure (test_cli_upgrade.py::test_docker_mode_bails_out — capitalization)
  • Apply locally on claude/engine-sdk-resume-guard (this PR cherry-picked on top) and observe a new stall: confirm cli_idle_timeout_seconds fires and the error path runs cleanly

Symptom: chats stall indefinitely with is_running=true and no progress.
The Claude CLI subprocess is alive but blocked on socket I/O (stuck
Anthropic API call, broken stdio pipe, etc.). receive_response()'s
async generator never yields, the engine's per-session lock stays
held, queued user messages back up forever, and /stop has no effect
because the asyncio.Task is sitting in an uncancellable await.

The synthetic-done fix in eff4394 covered three other shapes of "no
terminal event sent" (post-stream DB exception, lost WS message,
streaming-state asymmetry). It can't help here: the loop never exits
in the first place, so the finally block that ships the synthetic
done never runs.

Fix: wrap each __anext__ of receive_response in asyncio.wait_for. If
no SDK message of any kind arrives within agent.cli_idle_timeout_seconds
(default 900s = 15 min), close the iterator and raise TimeoutError.
That falls through to the existing CLI-crash retry path in _run_inner
(line 1846-1865 on origin/main), which discards the client, reconnects,
and either retries (no content yet) or surfaces the error (content
already streamed).

The timeout is per-message, not per-turn — a Bash with timeout=600000ms
streams tool_use/tool_result chunks that reset the idle timer, so
legitimate long tool calls aren't affected. Set
agent.cli_idle_timeout_seconds=0 in config.local.yaml to restore the
old behaviour.

Two adjacent fixes shipped in the same commit because they're triggered
by the same set of stalls:

1. plans.py:254 — POST /api/plans/{id}/approve schedules _run_impl
   with asyncio.create_task without engine.register_task, so /stop
   can't cancel a stuck impl session. Register the task.

2. sessions.py register_task — the prior done-callback popped the
   session_id unconditionally. A racing register_task call (or now,
   the plan-approve register on top of the engine.run() implicit
   register) could leave the new task without a stop registry entry
   when the old task finished. Identity-check the task before popping.

Tests:
- test_engine.py: four cases for _iter_response_with_timeout
  (normal stream, idle timeout raises and closes iterator, timeout=0
  disables, empty stream).
- test_sessions.py: one case for register_task replacement (old
  task's done-callback must not clobber the new entry).

Live repro before this patch: session eb740973 hung 15+ min, CLI
PID 24391 sleeping in __arm64_sys_epoll_pwait2; SIGKILL on the CLI
unblocked the lock and the queued user message processed on retry.
@alex-fedotyev
Copy link
Copy Markdown
Contributor Author

Bumping for review. This is one of five small reliability fixes (#63, #64, #65, #66, #67) that compound: each closes one cause of chats that wedge mid-turn, lose context after restart, or get routed into the wrong session. Happy to split, rebase, or land any of them in any order; they don't depend on each other.

In daily use the chat surface is the single biggest friction right now, so any of these merging would help. Let me know if you want changes.

@pufit pufit merged commit 3338f66 into ClickHouse:main May 13, 2026
pufit pushed a commit that referenced this pull request May 13, 2026
User-reported glitch: "I ask for something and there is no response, then
I ask again and it answers to the previous question." Five reliability
PRs (#63 shorthand-schema, #64 synthetic done, #65 stale sdk_session_id,
#66 idle timeout, #67 sticky session) each close one underlying cause.
Two gaps remain that none of those PRs cover.

Gap 1: client-side send silently drops payloads.
web/src/api/websocket.ts checked readyState === OPEN and no-op'd
otherwise. The 3s reconnect window leaves a hole: send() returns to the
caller and chatStore.sendMessage has already optimistically appended the
user message and set isStreaming=true. The user thinks the agent is
thinking but the message never reached the server, so the next reply
lands against a stale prompt.

Track readyState explicitly. CONNECTING or reconnect-scheduled now queues
the payload (bounded to 5 entries; oldest evicted) and flushes from
onopen. CLOSED-without-reconnect and CLOSING return 'dropped' so the
caller can revert. chatStore.sendMessage pops the optimistic user message
on 'dropped' and surfaces an inline assistant error so the user can
retry.

Gap 2: gateway initial-bind never replayed the broadcaster buffer.
The switch_session handler already shipped session_status with
buffered_events on session switch, but the initial-connect handshake at
server.py:286-311 didn't. Reload mid-turn (or a transient 3s WS drop)
and the in-flight stream was lost from the client's view even though
the events sat in broadcaster._session_buffers waiting to be replayed.

Lift the duplicated send-status construction into _send_session_status
and call it from both branches. Initial-bind gates on
broadcaster.is_buffering so idle sessions stay silent; switch_session
calls unconditionally so the client refreshes is_running/status on
every selection. The frontend handleSessionStatus already restores
streamingBlocks, panels, todos, and interaction state from the buffer
(handled by #69), so this is purely additive at the gateway.

Tests:
- 9 new asserts in tests/test_gateway_ws.py covering the helper output,
  the initial-bind gate, the switch_session regression path, and a
  load-fidelity check for buffer ordering.
- Full pytest run: 444 pass, 2 skip, 2 pre-existing failures unrelated
  (test_bootstrap docker-env detection and test_cli_upgrade docker
  mode, both noted in notes/repo-conventions/nerve.md).
- web/ tsc --noEmit clean, npm run build clean.

Out of scope (followups, not blocking):
- Stale-listener cleanup on swallowed send_json exceptions
  (server.py:298-301).
- Application-level message_received ack from engine after
  sessions.add_message.
- _session_locks TTL on session archive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants