Skip to content

agent/sessions: keep channel sticky to active sessions on hang#67

Open
alex-fedotyev wants to merge 1 commit into
ClickHouse:mainfrom
alex-fedotyev:claude/sticky-keeps-active-sessions
Open

agent/sessions: keep channel sticky to active sessions on hang#67
alex-fedotyev wants to merge 1 commit into
ClickHouse:mainfrom
alex-fedotyev:claude/sticky-keeps-active-sessions

Conversation

@alex-fedotyev
Copy link
Copy Markdown
Contributor

Symptom

A chat that hangs gets orphaned. The user types "continue" to nudge
the stuck session and the next message lands in a brand-new, empty
session instead. All context from the original session is gone.
This is the open follow-up #1 from the PR #66 decision doc.

Root cause

get_active_session in nerve/agent/sessions.py:215 decides reuse
vs. rotate based on _is_within_sticky_period(session). The helper
only looks at last_activity_at versus the 120-min cutoff. That
timestamp is written by mark_active(), which fires from engine.run
at two moments:

  1. when a fresh client is created (engine.py:1242)
  2. after a turn completes successfully (engine.py:2062)

A hung turn never reaches step 2, so last_activity_at freezes at
turn-start. Once 120 min elapse, the cutoff trips, the helper returns
False, and get_active_session mints a new session id and overwrites
the channel_sessions row. The hung session is now orphaned and the
user's "continue" lands in a fresh empty session.

Web sessions dodge this because gateway/server.py:289 routes through
get_last_session, which has no sticky check. Telegram and any
future channel through ChannelRouter hit the buggy path.

Fix

Treat any session with status=active as sticky regardless of the
timestamp. An active session is by definition the channel's current
owner: either it is making progress (and the next mark_active will
refresh the timestamp) or it is hung. The engine's idle-message
timeout in receive_response (cli_idle_timeout_seconds, default
15 min, shipped in #66) bounds how long a truly hung session can
hold the channel: when the timeout fires, the engine's exception
path flips the session to idle/error, and the next inbound message
after that falls through to the time-based check and rolls over to
a fresh session as before.

The carve-out is one branch at the top of _is_within_sticky_period.
The rest of the function is unchanged.

Tests

  • test_auto_session_reused_when_active_despite_old_timestamp
    an active session with a back-dated last_activity_at is still
    reused.
  • test_auto_session_rotated_when_idle_after_sticky_period — once
    the hung session has been recovered to idle, the time-based cutoff
    applies and the channel rolls over to a fresh session.

The existing test_auto_session_rotated_after_sticky_period
continues to cover the rotation path for sessions that were never
marked active in the first place.

All 46 tests in test_sessions.py pass.

Test plan

  • Unit tests for the new branch and the unchanged idle path
  • Full test_sessions.py suite green (46/46)
  • Apply locally and observe the next stall: confirm a follow-up
    message routes to the original session and queues behind the lock
    rather than spawning a new empty session

Symptom: a chat that hangs gets orphaned. The user types "continue"
to nudge the stuck session and the next message lands in a brand-new,
empty session instead. All context from the original session is gone.

Root cause: get_active_session checks _is_within_sticky_period before
returning the channel's mapped session, and _is_within_sticky_period
only looks at last_activity_at vs the 120-min cutoff. last_activity_at
is written by mark_active(), which fires from engine.run at two
moments: when a fresh client is created, and after a turn completes
successfully. A hung turn never reaches the post-turn mark_active, so
last_activity_at freezes at turn-start. Once 120 min elapse from that
frozen timestamp, the cutoff trips, the function returns False, and
get_active_session mints a new session id and overwrites the
channel_sessions row. The hung session is now orphaned.

Web sessions don't see this because gateway/server.py routes through
get_last_session, which has no sticky check.

Fix: treat any session with status=active as sticky regardless of the
timestamp. An active session is by definition the channel's current
owner; either it's making progress (and the next mark_active will
refresh the timestamp) or it's hung. The engine's idle-message
timeout in receive_response (cli_idle_timeout_seconds, default 15
min, shipped in ClickHouse#66) bounds how long a truly hung session can hold
the channel: when the timeout fires the engine flips the session to
idle/error in its exception path, and the next inbound message after
that point falls through to the time-based check and rolls over to a
fresh session normally.

Tests:
- test_auto_session_reused_when_active_despite_old_timestamp: an
  active session with a back-dated last_activity_at is still reused.
- test_auto_session_rotated_when_idle_after_sticky_period: once the
  hung session has been recovered to idle, the time-based cutoff
  applies and the channel rolls over.

The existing test_auto_session_rotated_after_sticky_period continues
to cover the rotation path for sessions that were never marked
active in the first place.
@alex-fedotyev
Copy link
Copy Markdown
Contributor Author

Bumping for review. This is one of five small reliability fixes (#63, #64, #65, #66, #67) that compound: each closes one cause of chats that wedge mid-turn, lose context after restart, or get routed into the wrong session. Happy to split, rebase, or land any of them in any order; they don't depend on each other.

In daily use the chat surface is the single biggest friction right now, so any of these merging would help. Let me know if you want changes.

pufit pushed a commit that referenced this pull request May 13, 2026
User-reported glitch: "I ask for something and there is no response, then
I ask again and it answers to the previous question." Five reliability
PRs (#63 shorthand-schema, #64 synthetic done, #65 stale sdk_session_id,
#66 idle timeout, #67 sticky session) each close one underlying cause.
Two gaps remain that none of those PRs cover.

Gap 1: client-side send silently drops payloads.
web/src/api/websocket.ts checked readyState === OPEN and no-op'd
otherwise. The 3s reconnect window leaves a hole: send() returns to the
caller and chatStore.sendMessage has already optimistically appended the
user message and set isStreaming=true. The user thinks the agent is
thinking but the message never reached the server, so the next reply
lands against a stale prompt.

Track readyState explicitly. CONNECTING or reconnect-scheduled now queues
the payload (bounded to 5 entries; oldest evicted) and flushes from
onopen. CLOSED-without-reconnect and CLOSING return 'dropped' so the
caller can revert. chatStore.sendMessage pops the optimistic user message
on 'dropped' and surfaces an inline assistant error so the user can
retry.

Gap 2: gateway initial-bind never replayed the broadcaster buffer.
The switch_session handler already shipped session_status with
buffered_events on session switch, but the initial-connect handshake at
server.py:286-311 didn't. Reload mid-turn (or a transient 3s WS drop)
and the in-flight stream was lost from the client's view even though
the events sat in broadcaster._session_buffers waiting to be replayed.

Lift the duplicated send-status construction into _send_session_status
and call it from both branches. Initial-bind gates on
broadcaster.is_buffering so idle sessions stay silent; switch_session
calls unconditionally so the client refreshes is_running/status on
every selection. The frontend handleSessionStatus already restores
streamingBlocks, panels, todos, and interaction state from the buffer
(handled by #69), so this is purely additive at the gateway.

Tests:
- 9 new asserts in tests/test_gateway_ws.py covering the helper output,
  the initial-bind gate, the switch_session regression path, and a
  load-fidelity check for buffer ordering.
- Full pytest run: 444 pass, 2 skip, 2 pre-existing failures unrelated
  (test_bootstrap docker-env detection and test_cli_upgrade docker
  mode, both noted in notes/repo-conventions/nerve.md).
- web/ tsc --noEmit clean, npm run build clean.

Out of scope (followups, not blocking):
- Stale-listener cleanup on swallowed send_json exceptions
  (server.py:298-301).
- Application-level message_received ack from engine after
  sessions.add_message.
- _session_locks TTL on session archive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant