agent/engine: guard against stale sdk_session_id when .jsonl is missing by alex-fedotyev · Pull Request #65 · ClickHouse/nerve

alex-fedotyev · 2026-05-05T13:04:59Z

What and why

When a Nerve container restarts without a persisted /root/.claude bind mount, the Claude Code .jsonl conversation files are wiped. The Nerve DB still holds the old sdk_session_id, so the next turn invokes the CLI with --resume <stale-id> and the CLI exits 1 immediately: "No conversation found with session ID." The session hangs mid-turn.

This PR has two fixes:

1. Engine guard (nerve/agent/engine.py)

Added _sdk_resume_file_exists(sdk_session_id) which checks whether ~/.claude/projects/<encoded-cwd>/<sdk_session_id>.jsonl is on disk before passing --resume to the CLI. If the file is missing, the stale sdk_session_id is cleared in the DB and the turn starts a fresh conversation instead of crashing. On any unexpected error the method returns True (fail-open) so genuine bugs still surface via the CLI.

Fork-source sessions are exempt: a fork's resume target is the source session's .jsonl, which lives on its own row.

2. Bootstrap mount (nerve/bootstrap.py)

Added ~/.nerve/claude:/root/.claude to the generated docker-compose so conversation history survives docker compose down/up. The mount lives under ~/.nerve to keep the agent's CLI state isolated from the host user's personal ~/.claude (macOS stores OAuth tokens there).

The entrypoint now runs mkdir -p /root/.claude && chmod 700 /root/.claude on startup so the bind-mounted directory is always accessible before any tool touches it.

Tests

Five new cases in tests/test_engine.py:

file present → True
file absent → False
os.path.isfile raises → True (fail-open)
workspace / encoded to - in the projects subdir path
path always ends with <session_id>.jsonl

All 23 tests pass.

When a Nerve container is restarted without persisting /root/.claude, the .jsonl conversation files are wiped but the DB still holds the old sdk_session_id. On the next turn the CLI is invoked with --resume and immediately exits 1 with "No conversation found with session ID", which hangs the session mid-turn. Two fixes: 1. Engine guard (_sdk_resume_file_exists): before passing --resume to the CLI, check that the .jsonl exists. If it is missing, clear the stale sdk_session_id from the DB and start a fresh conversation instead of crashing. Forks are exempt because their context is owned by the source session. 2. Bootstrap (.claude mount): add ~/.nerve/claude:/root/.claude to the generated docker-compose so conversation history survives restarts. The mount is siloed under ~/.nerve to keep the agent's CLI state separate from the host user's personal ~/.claude directory. Tests cover file-present, file-absent, fail-open on exception, and workspace-slash encoding. Co-Authored-By: Claude Opus <model> <noreply@anthropic.com>

alex-fedotyev · 2026-05-13T02:20:48Z

Bumping for review. This is one of five small reliability fixes (#63, #64, #65, #66, #67) that compound: each closes one cause of chats that wedge mid-turn, lose context after restart, or get routed into the wrong session. Happy to split, rebase, or land any of them in any order; they don't depend on each other.

In daily use the chat surface is the single biggest friction right now, so any of these merging would help. Let me know if you want changes.

User-reported glitch: "I ask for something and there is no response, then I ask again and it answers to the previous question." Five reliability PRs (#63 shorthand-schema, #64 synthetic done, #65 stale sdk_session_id, #66 idle timeout, #67 sticky session) each close one underlying cause. Two gaps remain that none of those PRs cover. Gap 1: client-side send silently drops payloads. web/src/api/websocket.ts checked readyState === OPEN and no-op'd otherwise. The 3s reconnect window leaves a hole: send() returns to the caller and chatStore.sendMessage has already optimistically appended the user message and set isStreaming=true. The user thinks the agent is thinking but the message never reached the server, so the next reply lands against a stale prompt. Track readyState explicitly. CONNECTING or reconnect-scheduled now queues the payload (bounded to 5 entries; oldest evicted) and flushes from onopen. CLOSED-without-reconnect and CLOSING return 'dropped' so the caller can revert. chatStore.sendMessage pops the optimistic user message on 'dropped' and surfaces an inline assistant error so the user can retry. Gap 2: gateway initial-bind never replayed the broadcaster buffer. The switch_session handler already shipped session_status with buffered_events on session switch, but the initial-connect handshake at server.py:286-311 didn't. Reload mid-turn (or a transient 3s WS drop) and the in-flight stream was lost from the client's view even though the events sat in broadcaster._session_buffers waiting to be replayed. Lift the duplicated send-status construction into _send_session_status and call it from both branches. Initial-bind gates on broadcaster.is_buffering so idle sessions stay silent; switch_session calls unconditionally so the client refreshes is_running/status on every selection. The frontend handleSessionStatus already restores streamingBlocks, panels, todos, and interaction state from the buffer (handled by #69), so this is purely additive at the gateway. Tests: - 9 new asserts in tests/test_gateway_ws.py covering the helper output, the initial-bind gate, the switch_session regression path, and a load-fidelity check for buffer ordering. - Full pytest run: 444 pass, 2 skip, 2 pre-existing failures unrelated (test_bootstrap docker-env detection and test_cli_upgrade docker mode, both noted in notes/repo-conventions/nerve.md). - web/ tsc --noEmit clean, npm run build clean. Out of scope (followups, not blocking): - Stale-listener cleanup on swallowed send_json exceptions (server.py:298-301). - Application-level message_received ack from engine after sessions.add_message. - _session_locks TTL on session archive.

alex-fedotyev mentioned this pull request May 13, 2026

agent/tools: fix shorthand-schema bug that aborts agent turns mid-task #63

Merged

3 tasks

pufit approved these changes May 13, 2026

View reviewed changes

pufit merged commit 88cac0b into ClickHouse:main May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent/engine: guard against stale sdk_session_id when .jsonl is missing#65

agent/engine: guard against stale sdk_session_id when .jsonl is missing#65
pufit merged 1 commit into
ClickHouse:mainfrom
alex-fedotyev:claude/engine-sdk-resume-guard

alex-fedotyev commented May 5, 2026

Uh oh!

alex-fedotyev commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alex-fedotyev commented May 5, 2026

What and why

Tests

Related

Uh oh!

alex-fedotyev commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants