Skip to content

fix(web): refresh todos panel from buffered events on reconnect#69

Merged
pufit merged 1 commit into
ClickHouse:mainfrom
alex-fedotyev:alex/fix-todo-panel-stale-buffer-replay
May 13, 2026
Merged

fix(web): refresh todos panel from buffered events on reconnect#69
pufit merged 1 commit into
ClickHouse:mainfrom
alex-fedotyev:alex/fix-todo-panel-stale-buffer-replay

Conversation

@alex-fedotyev
Copy link
Copy Markdown
Contributor

Summary

Fixes #68. The todos panel in the web UI froze on a stale snapshot when a client reconnected in the middle of an active turn (page refresh, WS drop, tab backgrounded long enough to drop). New TodoWrite calls kept landing in the WS buffer and on the persisted history, but the panel kept showing the count from the last TodoWrite that was persisted before the reconnect.

Cause

handleSessionStatus rebuilds streamingBlocks from msg.buffered_events via applyStreamEvent, but never updates currentTodos. The only currentTodos writers were:

  • live handleToolUse (only runs while WS is live, not during buffer replay), and
  • extractTodosFromMessages(hydrated) in switchSession (only sees persisted DB messages).

Any TodoWrite that arrived in the buffer for an in-flight turn before the reconnect bypassed both writers.

Fix

  • Add extractTodosFromBuffer(events) next to extractTodosFromMessages. Walks the buffered events backward, returns the latest top-level TodoWrite input, or null if there isn't one. Filters out sub-agent (parent_tool_use_id) entries to match the live handler's behaviour.
  • Call it from handleSessionStatus after rebuilding blocks. Update currentTodos only when the helper returns non-null; otherwise preserve whatever was already set from persisted history.

extractTodosFromBuffer does not skip the all-completed case (unlike extractTodosFromMessages), so the panel can show the final state and let its own 5s auto-hide animate.

Test plan

  • npx tsc --noEmit clean in web/.
  • npm run build clean in web/.
  • Manual: Start a session with many TodoWrite calls in one turn, refresh the browser mid-turn, confirm the panel reflects the latest todos rather than the persisted snapshot.

Notes

No new unit tests; web/src/ has no test framework configured. Happy to add one in a follow-up if a runner gets wired in.

[no-changeset: allow] (no changeset infra in this repo).

handleSessionStatus rebuilds streamingBlocks from the WS buffer when a
client reconnects mid-turn (page refresh, WS drop, tab backgrounded), but
it never updated currentTodos. The todos panel was driven only by the
persisted-history extractor (extractTodosFromMessages on switchSession)
plus the live tool_use handler. If a TodoWrite landed in the buffer
between the last persisted assistant turn and the live reconnect, the
panel froze on the older snapshot from the persisted history while the
underlying messages already had the newer todos.

Add extractTodosFromBuffer alongside extractTodosFromMessages and call it
from handleSessionStatus. Returns null when no top-level TodoWrite is in
the buffer so the caller preserves whatever was already set from persisted
history. Unlike extractTodosFromMessages it does not skip the all-completed
case so the panel can show the final state and let its own 5s auto-hide
animate.

Repro before the fix:
1. Start a session that runs many TodoWrite calls in one turn.
2. Refresh the browser tab in the middle of the turn.
3. Panel sticks at the latest persisted TodoWrite even though new ones
   keep arriving in the buffer and (later) on the live stream.
@pufit pufit merged commit 0cf5648 into ClickHouse:main May 13, 2026
pufit pushed a commit that referenced this pull request May 13, 2026
User-reported glitch: "I ask for something and there is no response, then
I ask again and it answers to the previous question." Five reliability
PRs (#63 shorthand-schema, #64 synthetic done, #65 stale sdk_session_id,
#66 idle timeout, #67 sticky session) each close one underlying cause.
Two gaps remain that none of those PRs cover.

Gap 1: client-side send silently drops payloads.
web/src/api/websocket.ts checked readyState === OPEN and no-op'd
otherwise. The 3s reconnect window leaves a hole: send() returns to the
caller and chatStore.sendMessage has already optimistically appended the
user message and set isStreaming=true. The user thinks the agent is
thinking but the message never reached the server, so the next reply
lands against a stale prompt.

Track readyState explicitly. CONNECTING or reconnect-scheduled now queues
the payload (bounded to 5 entries; oldest evicted) and flushes from
onopen. CLOSED-without-reconnect and CLOSING return 'dropped' so the
caller can revert. chatStore.sendMessage pops the optimistic user message
on 'dropped' and surfaces an inline assistant error so the user can
retry.

Gap 2: gateway initial-bind never replayed the broadcaster buffer.
The switch_session handler already shipped session_status with
buffered_events on session switch, but the initial-connect handshake at
server.py:286-311 didn't. Reload mid-turn (or a transient 3s WS drop)
and the in-flight stream was lost from the client's view even though
the events sat in broadcaster._session_buffers waiting to be replayed.

Lift the duplicated send-status construction into _send_session_status
and call it from both branches. Initial-bind gates on
broadcaster.is_buffering so idle sessions stay silent; switch_session
calls unconditionally so the client refreshes is_running/status on
every selection. The frontend handleSessionStatus already restores
streamingBlocks, panels, todos, and interaction state from the buffer
(handled by #69), so this is purely additive at the gateway.

Tests:
- 9 new asserts in tests/test_gateway_ws.py covering the helper output,
  the initial-bind gate, the switch_session regression path, and a
  load-fidelity check for buffer ordering.
- Full pytest run: 444 pass, 2 skip, 2 pre-existing failures unrelated
  (test_bootstrap docker-env detection and test_cli_upgrade docker
  mode, both noted in notes/repo-conventions/nerve.md).
- web/ tsc --noEmit clean, npm run build clean.

Out of scope (followups, not blocking):
- Stale-listener cleanup on swallowed send_json exceptions
  (server.py:298-301).
- Application-level message_received ack from engine after
  sessions.add_message.
- _session_locks TTL on session archive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Web UI: todos panel stuck on stale snapshot after mid-turn reconnect

2 participants