agent/sessions: keep channel sticky to active sessions on hang by alex-fedotyev · Pull Request #67 · ClickHouse/nerve

alex-fedotyev · 2026-05-06T03:46:24Z

Symptom

A chat that hangs gets orphaned. The user types "continue" to nudge
the stuck session and the next message lands in a brand-new, empty
session instead. All context from the original session is gone.
This is the open follow-up #1 from the PR #66 decision doc.

Root cause

get_active_session in nerve/agent/sessions.py:215 decides reuse
vs. rotate based on _is_within_sticky_period(session). The helper
only looks at last_activity_at versus the 120-min cutoff. That
timestamp is written by mark_active(), which fires from engine.run
at two moments:

when a fresh client is created (engine.py:1242)
after a turn completes successfully (engine.py:2062)

A hung turn never reaches step 2, so last_activity_at freezes at
turn-start. Once 120 min elapse, the cutoff trips, the helper returns
False, and get_active_session mints a new session id and overwrites
the channel_sessions row. The hung session is now orphaned and the
user's "continue" lands in a fresh empty session.

Web sessions dodge this because gateway/server.py:289 routes through
get_last_session, which has no sticky check. Telegram and any
future channel through ChannelRouter hit the buggy path.

Fix

Treat any session with status=active as sticky regardless of the
timestamp. An active session is by definition the channel's current
owner: either it is making progress (and the next mark_active will
refresh the timestamp) or it is hung. The engine's idle-message
timeout in receive_response (cli_idle_timeout_seconds, default
15 min, shipped in #66) bounds how long a truly hung session can
hold the channel: when the timeout fires, the engine's exception
path flips the session to idle/error, and the next inbound message
after that falls through to the time-based check and rolls over to
a fresh session as before.

The carve-out is one branch at the top of _is_within_sticky_period.
The rest of the function is unchanged.

Tests

test_auto_session_reused_when_active_despite_old_timestamp —
an active session with a back-dated last_activity_at is still
reused.
test_auto_session_rotated_when_idle_after_sticky_period — once
the hung session has been recovered to idle, the time-based cutoff
applies and the channel rolls over to a fresh session.

The existing test_auto_session_rotated_after_sticky_period
continues to cover the rotation path for sessions that were never
marked active in the first place.

All 46 tests in test_sessions.py pass.

Test plan

Unit tests for the new branch and the unchanged idle path
Full test_sessions.py suite green (46/46)
Apply locally and observe the next stall: confirm a follow-up
message routes to the original session and queues behind the lock
rather than spawning a new empty session

Symptom: a chat that hangs gets orphaned. The user types "continue" to nudge the stuck session and the next message lands in a brand-new, empty session instead. All context from the original session is gone. Root cause: get_active_session checks _is_within_sticky_period before returning the channel's mapped session, and _is_within_sticky_period only looks at last_activity_at vs the 120-min cutoff. last_activity_at is written by mark_active(), which fires from engine.run at two moments: when a fresh client is created, and after a turn completes successfully. A hung turn never reaches the post-turn mark_active, so last_activity_at freezes at turn-start. Once 120 min elapse from that frozen timestamp, the cutoff trips, the function returns False, and get_active_session mints a new session id and overwrites the channel_sessions row. The hung session is now orphaned. Web sessions don't see this because gateway/server.py routes through get_last_session, which has no sticky check. Fix: treat any session with status=active as sticky regardless of the timestamp. An active session is by definition the channel's current owner; either it's making progress (and the next mark_active will refresh the timestamp) or it's hung. The engine's idle-message timeout in receive_response (cli_idle_timeout_seconds, default 15 min, shipped in ClickHouse#66) bounds how long a truly hung session can hold the channel: when the timeout fires the engine flips the session to idle/error in its exception path, and the next inbound message after that point falls through to the time-based check and rolls over to a fresh session normally. Tests: - test_auto_session_reused_when_active_despite_old_timestamp: an active session with a back-dated last_activity_at is still reused. - test_auto_session_rotated_when_idle_after_sticky_period: once the hung session has been recovered to idle, the time-based cutoff applies and the channel rolls over. The existing test_auto_session_rotated_after_sticky_period continues to cover the rotation path for sessions that were never marked active in the first place.

alex-fedotyev · 2026-05-13T02:20:50Z

Bumping for review. This is one of five small reliability fixes (#63, #64, #65, #66, #67) that compound: each closes one cause of chats that wedge mid-turn, lose context after restart, or get routed into the wrong session. Happy to split, rebase, or land any of them in any order; they don't depend on each other.

In daily use the chat surface is the single biggest friction right now, so any of these merging would help. Let me know if you want changes.

User-reported glitch: "I ask for something and there is no response, then I ask again and it answers to the previous question." Five reliability PRs (#63 shorthand-schema, #64 synthetic done, #65 stale sdk_session_id, #66 idle timeout, #67 sticky session) each close one underlying cause. Two gaps remain that none of those PRs cover. Gap 1: client-side send silently drops payloads. web/src/api/websocket.ts checked readyState === OPEN and no-op'd otherwise. The 3s reconnect window leaves a hole: send() returns to the caller and chatStore.sendMessage has already optimistically appended the user message and set isStreaming=true. The user thinks the agent is thinking but the message never reached the server, so the next reply lands against a stale prompt. Track readyState explicitly. CONNECTING or reconnect-scheduled now queues the payload (bounded to 5 entries; oldest evicted) and flushes from onopen. CLOSED-without-reconnect and CLOSING return 'dropped' so the caller can revert. chatStore.sendMessage pops the optimistic user message on 'dropped' and surfaces an inline assistant error so the user can retry. Gap 2: gateway initial-bind never replayed the broadcaster buffer. The switch_session handler already shipped session_status with buffered_events on session switch, but the initial-connect handshake at server.py:286-311 didn't. Reload mid-turn (or a transient 3s WS drop) and the in-flight stream was lost from the client's view even though the events sat in broadcaster._session_buffers waiting to be replayed. Lift the duplicated send-status construction into _send_session_status and call it from both branches. Initial-bind gates on broadcaster.is_buffering so idle sessions stay silent; switch_session calls unconditionally so the client refreshes is_running/status on every selection. The frontend handleSessionStatus already restores streamingBlocks, panels, todos, and interaction state from the buffer (handled by #69), so this is purely additive at the gateway. Tests: - 9 new asserts in tests/test_gateway_ws.py covering the helper output, the initial-bind gate, the switch_session regression path, and a load-fidelity check for buffer ordering. - Full pytest run: 444 pass, 2 skip, 2 pre-existing failures unrelated (test_bootstrap docker-env detection and test_cli_upgrade docker mode, both noted in notes/repo-conventions/nerve.md). - web/ tsc --noEmit clean, npm run build clean. Out of scope (followups, not blocking): - Stale-listener cleanup on swallowed send_json exceptions (server.py:298-301). - Application-level message_received ack from engine after sessions.add_message. - _session_locks TTL on session archive.

This was referenced May 13, 2026

agent/tools: fix shorthand-schema bug that aborts agent turns mid-task #63

Merged

agent/streaming: ship synthetic done when a turn ends without one #64

Merged

agent/engine: guard against stale sdk_session_id when .jsonl is missing #65

Merged

This was referenced May 13, 2026

agent/engine: add idle-message timeout to receive_response loop #66

Merged

fix(chat): queue WS sends, replay buffer on initial bind #71

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent/sessions: keep channel sticky to active sessions on hang#67

agent/sessions: keep channel sticky to active sessions on hang#67
alex-fedotyev wants to merge 1 commit into
ClickHouse:mainfrom
alex-fedotyev:claude/sticky-keeps-active-sessions

alex-fedotyev commented May 6, 2026

Uh oh!

alex-fedotyev commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alex-fedotyev commented May 6, 2026

Symptom

Root cause

Fix

Tests

Test plan

Uh oh!

alex-fedotyev commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant