Surface real cause when broker startup fails#828
Conversation
Improve broker startup handling when calling getSession(): create a brokerExited promise that rejects with a formatted broker startup error (including stderr tail and exit code) and race it against client.getSession() so socket failures during the initial Relaycast handshake surface as the broker exit instead of a vague network error. Keep existing 503 polling (1s interval) for the broker-warming case, suppress unhandledRejection from the race, and simplify 503 detection by normalizing the error message.
📝 WalkthroughWalkthroughThis PR enhances broker startup failure diagnostics by classifying errors via cause-chain unwrapping, detecting broker exit during SDK handshake via Promise.race, and reporting detailed failures with telemetry. Three new error analysis functions extract network codes, build descriptions, and map stages; the SDK client now races getSession against brokerExited to surface broker-death errors separately from warmup timeouts; and the startup handler integrates these utilities for richer error reporting. ChangesBroker Startup Error Classification and Detection
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/cli/lib/broker-lifecycle.test.ts (1)
73-98: 💤 Low valueConsider adding tests for the remaining
classifyBrokerStartStagebranches.Two stage classifications are untested:
'spawn'for "Broker did not report API port" (line 163 in broker-lifecycle.ts)'resolve_binary'for ENOENT with "broker" in message (line 165)📝 Suggested additional tests
it('classifies broker-exited-before-ready as a spawn failure', () => { const message = 'Broker process exited with code 1 before becoming ready (pid=123; …)'; expect(classifyBrokerStartStage(new Error(message), message, false)).toBe('spawn'); }); + it('classifies API port timeout as a spawn failure', () => { + const message = 'Broker did not report API port within 45000ms'; + expect(classifyBrokerStartStage(new Error(message), message, false)).toBe('spawn'); + }); + + it('classifies missing broker binary as resolve_binary', () => { + const err = Object.assign(new Error('ENOENT: no such file broker'), { code: 'ENOENT' }); + expect(classifyBrokerStartStage(err, 'ENOENT: no such file broker', false)).toBe('resolve_binary'); + }); + it('falls back to startup for everything else', () => {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cli/lib/broker-lifecycle.test.ts` around lines 73 - 98, Tests are missing coverage for two classifyBrokerStartStage branches: add one test in src/cli/lib/broker-lifecycle.test.ts that passes an Error with message "Broker did not report API port" (or the exact string used in broker-lifecycle.ts) and asserts classifyBrokerStartStage(..., message, false) returns 'spawn'; add another test that constructs an Error whose code is 'ENOENT' and whose message contains "broker" (simulate resolving a missing binary) and assert classifyBrokerStartStage(..., err.message, false) returns 'resolve_binary'; use the existing test patterns (Object.assign to set .code when needed) and place them alongside the other it(...) cases referencing classifyBrokerStartStage.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/cli/lib/broker-lifecycle.test.ts`:
- Around line 73-98: Tests are missing coverage for two classifyBrokerStartStage
branches: add one test in src/cli/lib/broker-lifecycle.test.ts that passes an
Error with message "Broker did not report API port" (or the exact string used in
broker-lifecycle.ts) and asserts classifyBrokerStartStage(..., message, false)
returns 'spawn'; add another test that constructs an Error whose code is
'ENOENT' and whose message contains "broker" (simulate resolving a missing
binary) and assert classifyBrokerStartStage(..., err.message, false) returns
'resolve_binary'; use the existing test patterns (Object.assign to set .code
when needed) and place them alongside the other it(...) cases referencing
classifyBrokerStartStage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 13b967d6-7736-457e-8f37-43de2c5f06f7
📒 Files selected for processing (3)
packages/sdk/src/client.tssrc/cli/lib/broker-lifecycle.test.tssrc/cli/lib/broker-lifecycle.ts
Code Review SummaryVerdict: Approve ✅ No critical or warning-level issues found. The PR addresses the real bug (losing error context when a broker dies during startup) with a clean, minimal fix. Review Highlightspackages/sdk/src/client.ts
src/cli/lib/broker-lifecycle.ts
src/cli/lib/broker-lifecycle.test.ts
Tests
Reviewed by Hermes Agent |
Summary
Reported bug: on a second machine,
agent-relay up --workspace-key rk_live_…fails with bareFailed to start broker: fetch failed. The actual cause is the broker exiting mid-handshake (e.g. Relaycast unreachable) — but the error chain dropped every detail.Three small fixes:
packages/sdk/src/client.ts— racegetSession()againstchild.exitduring the initial handshake. If the broker dies, we surface its formatted error (pid, command, stderr tail, exit code) instead of waiting on a now-dead socket and reportingfetch failed. Kept the existing 503 polling for the legitimate "broker warming up" case; no exponential backoff.src/cli/lib/broker-lifecycle.ts—describeErrorwalkserr.causeto surface the underlying network code (ECONNREFUSED,ENOTFOUND, etc.), and the catch block now firestrack('broker_start_failed', { stage, error_class })— that PostHog event was defined inpackages/telemetry/src/events.tsbut never wired up.src/cli/lib/broker-lifecycle.test.ts— 13 unit tests covering cause-chain unwrapping, network code preference, cycle protection, and each stage classifier branch.Test plan
npx vitest run src/cli/lib— 25/25 passnpx tsc --noEmit -p tsconfig.json— clean (cloud package errors are pre-existing onmain)cd packages/sdk && npm run build— cleanAGENT_RELAY_DISABLE_RELAYCAST=1(or similar Relaycast-unreachable simulation) and confirm the new error includes the rust stderr tailbroker_start_failedlands in PostHog with the rightstageanderror_class🤖 Generated with Claude Code