agent/sessions: keep channel sticky to active sessions on hang#67
Open
alex-fedotyev wants to merge 1 commit into
Open
agent/sessions: keep channel sticky to active sessions on hang#67alex-fedotyev wants to merge 1 commit into
alex-fedotyev wants to merge 1 commit into
Conversation
Symptom: a chat that hangs gets orphaned. The user types "continue" to nudge the stuck session and the next message lands in a brand-new, empty session instead. All context from the original session is gone. Root cause: get_active_session checks _is_within_sticky_period before returning the channel's mapped session, and _is_within_sticky_period only looks at last_activity_at vs the 120-min cutoff. last_activity_at is written by mark_active(), which fires from engine.run at two moments: when a fresh client is created, and after a turn completes successfully. A hung turn never reaches the post-turn mark_active, so last_activity_at freezes at turn-start. Once 120 min elapse from that frozen timestamp, the cutoff trips, the function returns False, and get_active_session mints a new session id and overwrites the channel_sessions row. The hung session is now orphaned. Web sessions don't see this because gateway/server.py routes through get_last_session, which has no sticky check. Fix: treat any session with status=active as sticky regardless of the timestamp. An active session is by definition the channel's current owner; either it's making progress (and the next mark_active will refresh the timestamp) or it's hung. The engine's idle-message timeout in receive_response (cli_idle_timeout_seconds, default 15 min, shipped in ClickHouse#66) bounds how long a truly hung session can hold the channel: when the timeout fires the engine flips the session to idle/error in its exception path, and the next inbound message after that point falls through to the time-based check and rolls over to a fresh session normally. Tests: - test_auto_session_reused_when_active_despite_old_timestamp: an active session with a back-dated last_activity_at is still reused. - test_auto_session_rotated_when_idle_after_sticky_period: once the hung session has been recovered to idle, the time-based cutoff applies and the channel rolls over. The existing test_auto_session_rotated_after_sticky_period continues to cover the rotation path for sessions that were never marked active in the first place.
This was referenced May 13, 2026
Contributor
Author
|
Bumping for review. This is one of five small reliability fixes (#63, #64, #65, #66, #67) that compound: each closes one cause of chats that wedge mid-turn, lose context after restart, or get routed into the wrong session. Happy to split, rebase, or land any of them in any order; they don't depend on each other. In daily use the chat surface is the single biggest friction right now, so any of these merging would help. Let me know if you want changes. |
This was referenced May 13, 2026
pufit
pushed a commit
that referenced
this pull request
May 13, 2026
User-reported glitch: "I ask for something and there is no response, then I ask again and it answers to the previous question." Five reliability PRs (#63 shorthand-schema, #64 synthetic done, #65 stale sdk_session_id, #66 idle timeout, #67 sticky session) each close one underlying cause. Two gaps remain that none of those PRs cover. Gap 1: client-side send silently drops payloads. web/src/api/websocket.ts checked readyState === OPEN and no-op'd otherwise. The 3s reconnect window leaves a hole: send() returns to the caller and chatStore.sendMessage has already optimistically appended the user message and set isStreaming=true. The user thinks the agent is thinking but the message never reached the server, so the next reply lands against a stale prompt. Track readyState explicitly. CONNECTING or reconnect-scheduled now queues the payload (bounded to 5 entries; oldest evicted) and flushes from onopen. CLOSED-without-reconnect and CLOSING return 'dropped' so the caller can revert. chatStore.sendMessage pops the optimistic user message on 'dropped' and surfaces an inline assistant error so the user can retry. Gap 2: gateway initial-bind never replayed the broadcaster buffer. The switch_session handler already shipped session_status with buffered_events on session switch, but the initial-connect handshake at server.py:286-311 didn't. Reload mid-turn (or a transient 3s WS drop) and the in-flight stream was lost from the client's view even though the events sat in broadcaster._session_buffers waiting to be replayed. Lift the duplicated send-status construction into _send_session_status and call it from both branches. Initial-bind gates on broadcaster.is_buffering so idle sessions stay silent; switch_session calls unconditionally so the client refreshes is_running/status on every selection. The frontend handleSessionStatus already restores streamingBlocks, panels, todos, and interaction state from the buffer (handled by #69), so this is purely additive at the gateway. Tests: - 9 new asserts in tests/test_gateway_ws.py covering the helper output, the initial-bind gate, the switch_session regression path, and a load-fidelity check for buffer ordering. - Full pytest run: 444 pass, 2 skip, 2 pre-existing failures unrelated (test_bootstrap docker-env detection and test_cli_upgrade docker mode, both noted in notes/repo-conventions/nerve.md). - web/ tsc --noEmit clean, npm run build clean. Out of scope (followups, not blocking): - Stale-listener cleanup on swallowed send_json exceptions (server.py:298-301). - Application-level message_received ack from engine after sessions.add_message. - _session_locks TTL on session archive.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
A chat that hangs gets orphaned. The user types "continue" to nudge
the stuck session and the next message lands in a brand-new, empty
session instead. All context from the original session is gone.
This is the open follow-up #1 from the PR #66 decision doc.
Root cause
get_active_sessioninnerve/agent/sessions.py:215decides reusevs. rotate based on
_is_within_sticky_period(session). The helperonly looks at
last_activity_atversus the 120-min cutoff. Thattimestamp is written by
mark_active(), which fires fromengine.runat two moments:
engine.py:1242)engine.py:2062)A hung turn never reaches step 2, so
last_activity_atfreezes atturn-start. Once 120 min elapse, the cutoff trips, the helper returns
False, and
get_active_sessionmints a new session id and overwritesthe
channel_sessionsrow. The hung session is now orphaned and theuser's "continue" lands in a fresh empty session.
Web sessions dodge this because
gateway/server.py:289routes throughget_last_session, which has no sticky check. Telegram and anyfuture channel through
ChannelRouterhit the buggy path.Fix
Treat any session with
status=activeas sticky regardless of thetimestamp. An active session is by definition the channel's current
owner: either it is making progress (and the next
mark_activewillrefresh the timestamp) or it is hung. The engine's idle-message
timeout in
receive_response(cli_idle_timeout_seconds, default15 min, shipped in #66) bounds how long a truly hung session can
hold the channel: when the timeout fires, the engine's exception
path flips the session to idle/error, and the next inbound message
after that falls through to the time-based check and rolls over to
a fresh session as before.
The carve-out is one branch at the top of
_is_within_sticky_period.The rest of the function is unchanged.
Tests
test_auto_session_reused_when_active_despite_old_timestamp—an active session with a back-dated
last_activity_atis stillreused.
test_auto_session_rotated_when_idle_after_sticky_period— oncethe hung session has been recovered to idle, the time-based cutoff
applies and the channel rolls over to a fresh session.
The existing
test_auto_session_rotated_after_sticky_periodcontinues to cover the rotation path for sessions that were never
marked active in the first place.
All 46 tests in
test_sessions.pypass.Test plan
test_sessions.pysuite green (46/46)message routes to the original session and queues behind the lock
rather than spawning a new empty session