Skip to content

Surface real cause when broker startup fails#828

Merged
willwashburn merged 2 commits intomainfrom
fix/broker-start-error-detail
May 9, 2026
Merged

Surface real cause when broker startup fails#828
willwashburn merged 2 commits intomainfrom
fix/broker-start-error-detail

Conversation

@willwashburn
Copy link
Copy Markdown
Member

Summary

Reported bug: on a second machine, agent-relay up --workspace-key rk_live_… fails with bare Failed to start broker: fetch failed. The actual cause is the broker exiting mid-handshake (e.g. Relaycast unreachable) — but the error chain dropped every detail.

Three small fixes:

  • packages/sdk/src/client.ts — race getSession() against child.exit during the initial handshake. If the broker dies, we surface its formatted error (pid, command, stderr tail, exit code) instead of waiting on a now-dead socket and reporting fetch failed. Kept the existing 503 polling for the legitimate "broker warming up" case; no exponential backoff.
  • src/cli/lib/broker-lifecycle.tsdescribeError walks err.cause to surface the underlying network code (ECONNREFUSED, ENOTFOUND, etc.), and the catch block now fires track('broker_start_failed', { stage, error_class }) — that PostHog event was defined in packages/telemetry/src/events.ts but never wired up.
  • src/cli/lib/broker-lifecycle.test.ts — 13 unit tests covering cause-chain unwrapping, network code preference, cycle protection, and each stage classifier branch.

Test plan

  • npx vitest run src/cli/lib — 25/25 pass
  • npx tsc --noEmit -p tsconfig.json — clean (cloud package errors are pre-existing on main)
  • cd packages/sdk && npm run build — clean
  • Reproduce the failure path manually with AGENT_RELAY_DISABLE_RELAYCAST=1 (or similar Relaycast-unreachable simulation) and confirm the new error includes the rust stderr tail
  • Verify broker_start_failed lands in PostHog with the right stage and error_class

🤖 Generated with Claude Code

Improve broker startup handling when calling getSession(): create a brokerExited promise that rejects with a formatted broker startup error (including stderr tail and exit code) and race it against client.getSession() so socket failures during the initial Relaycast handshake surface as the broker exit instead of a vague network error. Keep existing 503 polling (1s interval) for the broker-warming case, suppress unhandledRejection from the race, and simplify 503 detection by normalizing the error message.
@willwashburn willwashburn requested a review from khaliqgant as a code owner May 8, 2026 22:50
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR enhances broker startup failure diagnostics by classifying errors via cause-chain unwrapping, detecting broker exit during SDK handshake via Promise.race, and reporting detailed failures with telemetry. Three new error analysis functions extract network codes, build descriptions, and map stages; the SDK client now races getSession against brokerExited to surface broker-death errors separately from warmup timeouts; and the startup handler integrates these utilities for richer error reporting.

Changes

Broker Startup Error Classification and Detection

Layer / File(s) Summary
Error Classification Utilities
src/cli/lib/broker-lifecycle.ts
Exports describeError() to unwrap nested err.cause chains, classifyBrokerStartError() to extract network error codes with constructor-name fallback, and classifyBrokerStartStage() to map error codes and message patterns to stage labels (spawn, connect, startup, dash). Module imports updated to include telemetry and error-class utilities.
Error Classification Tests
src/cli/lib/broker-lifecycle.test.ts
Vitest suites verify describeError unwrapping of deep causes and handling of non-Error inputs; classifyBrokerStartError code precedence and fallback behavior; and classifyBrokerStartStage detection of dashboard port conflicts, "already running" messages, fetch failures, and spawn-related exit messages.
Broker Exit Detection
packages/sdk/src/client.ts
AgentRelayClient.spawn() creates a brokerExited promise that rejects on child process exit with formatted error containing binary, cwd, args, and buffered stdout/stderr. Each getSession() retry is raced against brokerExited to surface broker-death errors while preserving 503 warmup retry behavior.
Failure Reporting Integration
src/cli/lib/broker-lifecycle.ts
runUpCommand() catch block now computes error stage and error_class using classification utilities, reports broker_start_failed telemetry with both fields, and surfaces detailed error descriptions via describeError() instead of raw messages. Special-case messaging for dashboard conflicts and "already running" scenarios is preserved.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • barryollama

Poem

🐰 A broker that exits mid-shake,
No longer hides its mistake—
With cause chains unwound,
True errors are found,
And telemetry charts truth we make! 📊✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: improving broker startup error reporting by exposing the underlying cause.
Description check ✅ Passed The description follows the template with a comprehensive Summary section and detailed Test Plan section documenting completed and pending tests.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/broker-start-error-detail

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/cli/lib/broker-lifecycle.test.ts (1)

73-98: 💤 Low value

Consider adding tests for the remaining classifyBrokerStartStage branches.

Two stage classifications are untested:

  • 'spawn' for "Broker did not report API port" (line 163 in broker-lifecycle.ts)
  • 'resolve_binary' for ENOENT with "broker" in message (line 165)
📝 Suggested additional tests
   it('classifies broker-exited-before-ready as a spawn failure', () => {
     const message = 'Broker process exited with code 1 before becoming ready (pid=123; …)';
     expect(classifyBrokerStartStage(new Error(message), message, false)).toBe('spawn');
   });
 
+  it('classifies API port timeout as a spawn failure', () => {
+    const message = 'Broker did not report API port within 45000ms';
+    expect(classifyBrokerStartStage(new Error(message), message, false)).toBe('spawn');
+  });
+
+  it('classifies missing broker binary as resolve_binary', () => {
+    const err = Object.assign(new Error('ENOENT: no such file broker'), { code: 'ENOENT' });
+    expect(classifyBrokerStartStage(err, 'ENOENT: no such file broker', false)).toBe('resolve_binary');
+  });
+
   it('falls back to startup for everything else', () => {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cli/lib/broker-lifecycle.test.ts` around lines 73 - 98, Tests are missing
coverage for two classifyBrokerStartStage branches: add one test in
src/cli/lib/broker-lifecycle.test.ts that passes an Error with message "Broker
did not report API port" (or the exact string used in broker-lifecycle.ts) and
asserts classifyBrokerStartStage(..., message, false) returns 'spawn'; add
another test that constructs an Error whose code is 'ENOENT' and whose message
contains "broker" (simulate resolving a missing binary) and assert
classifyBrokerStartStage(..., err.message, false) returns 'resolve_binary'; use
the existing test patterns (Object.assign to set .code when needed) and place
them alongside the other it(...) cases referencing classifyBrokerStartStage.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/cli/lib/broker-lifecycle.test.ts`:
- Around line 73-98: Tests are missing coverage for two classifyBrokerStartStage
branches: add one test in src/cli/lib/broker-lifecycle.test.ts that passes an
Error with message "Broker did not report API port" (or the exact string used in
broker-lifecycle.ts) and asserts classifyBrokerStartStage(..., message, false)
returns 'spawn'; add another test that constructs an Error whose code is
'ENOENT' and whose message contains "broker" (simulate resolving a missing
binary) and assert classifyBrokerStartStage(..., err.message, false) returns
'resolve_binary'; use the existing test patterns (Object.assign to set .code
when needed) and place them alongside the other it(...) cases referencing
classifyBrokerStartStage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 13b967d6-7736-457e-8f37-43de2c5f06f7

📥 Commits

Reviewing files that changed from the base of the PR and between 0e536f4 and 03c174e.

📒 Files selected for processing (3)
  • packages/sdk/src/client.ts
  • src/cli/lib/broker-lifecycle.test.ts
  • src/cli/lib/broker-lifecycle.ts

@willwashburn willwashburn requested a review from barryollama May 9, 2026 02:00
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

@barryollama
Copy link
Copy Markdown

Code Review Summary

Verdict: Approve

No critical or warning-level issues found. The PR addresses the real bug (losing error context when a broker dies during startup) with a clean, minimal fix.

Review Highlights

packages/sdk/src/client.ts

  • The Promise.race between getSession() and brokerExited elegantly distinguishes between "broker warming up" (case 1) and "broker died" (case 2). This is a clear improvement that surfaces actual broker exit codes instead of cryptic "fetch failed" errors.
  • Suppressing the unhandled rejection with .catch(() => {}) is appropriate here — the same promise is being raced intentionally.
  • The comment block explaining both failure modes is excellent documentation.

src/cli/lib/broker-lifecycle.ts

  • describeError() correctly walks the cause chain to extract network codes (ECONNREFUSED, ENOTFOUND, etc.) — exactly what's needed to make fetch failures actionable.
  • classifyBrokerStartError() and classifyBrokerStartStage() provide good telemetry data that was previously missing.
  • The track('broker_start_failed') call is now properly wired up.

src/cli/lib/broker-lifecycle.test.ts

  • Comprehensive test coverage: bare errors, fetch causes, DNS errors, cycle protection, network code preference, stage classification.
  • All 13 new tests pass.

Tests

  • Full test suite: 784/784 passed
  • New broker-lifecycle tests: 13/13 passed
  • TypeScript compilation clean

Reviewed by Hermes Agent

@willwashburn willwashburn merged commit 622ec7c into main May 9, 2026
47 checks passed
@willwashburn willwashburn deleted the fix/broker-start-error-detail branch May 9, 2026 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants