fix(observability): retry session recording upload#1381
fix(observability): retry session recording upload#1381toubatbrian wants to merge 2 commits intomainfrom
Conversation
Port livekit/agents#5627 to JS. The session recording upload now retries up to 3 times when the server response includes a google.rpc.RetryInfo detail, sleeping for the server-provided delay between attempts. The multipart form is rebuilt on each retry; header / chat history / audio bytes are read once and reused. A small protobuf wire-format reader is added to parse RetryInfo without pulling in google-protos-common.
|
|
🦋 Changeset detectedLatest commit: cb6b677 The changes in this PR will be included in the next version bump. This PR includes changesets to release 28 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb3dd0c4ec
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| offset += skipField(any, offset, wireType); | ||
| } | ||
| } | ||
| if (typeUrl !== RETRY_INFO_TYPE_URL || value === null) return null; |
There was a problem hiding this comment.
Match RetryInfo Any type by suffix, not exact URL
google.protobuf.Any type URLs are only required to end with the fully-qualified message name, so checking for exact equality with type.googleapis.com/google.rpc.RetryInfo can miss valid RetryInfo details that use a different prefix. In those responses, retries are skipped and uploads fail immediately despite the server explicitly returning retry guidance. Please match typeUrl by its last path segment (or suffix) to mirror Any.Unpack semantics.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in cb6b677. The check now compares the segment after the last / in Any.type_url against the fully-qualified message name (google.rpc.RetryInfo), matching the Python Any.Unpack semantics.
Generated by Claude Code
Address review feedback on #1381: google.protobuf.Any.type_url only requires the segment after the last "/" to equal the fully-qualified message name, so an exact match against "type.googleapis.com/google.rpc.RetryInfo" would skip retries when the server uses a different prefix. Match by suffix to mirror Python's Any.Unpack semantics.
Summary
Port of livekit/agents#5627 — fix(observability): retry session recording upload — to agents-js.
When the LiveKit Cloud
/observability/recordings/v0endpoint rejects a session-recording upload with a retryable error, the server returns agoogle.rpc.Statusbody containing aRetryInfodetail with aretry_delay. Before this change, agents-js treated all non-2xx responses as fatal and dropped the recording. After this change, the client honorsRetryInfoand retries the upload up to 3 times.What changed
agents/src/telemetry/traces.ts—uploadSessionReport():buildFormData()factory creates a freshFormDataper attempt (theform-datapackage consumes its inputs on submit, so a new instance is required for each retry).submitOnce()wrapsformData.submit(...)in a promise that always reads the response body to completion and returns{ statusCode, statusMessage, body }. The body is needed both for error reporting and forRetryInfoparsing.maxRetries = 3(4 attempts total). On a non-2xx response it callsparseRetryDelayMs(body). If a delay is returned, it logs a warning andawaits asetTimeoutof that many milliseconds before the next attempt. If noRetryInfois present, or attempts are exhausted, it throws with the status and body for diagnostics.Implementation nuances vs. Python
The Python PR uses
google.rpc.error_details_pb2.RetryInfoandgoogle.rpc.status_pb2.Statusfromgoogleapis-common-protos. There is no equivalent runtime dependency on the JS side, and pulling one in for a single ~30-byte message is overkill. Instead, this PR adds a tiny inline protobuf wire-format reader (~80 lines) that walksStatus → details (Any) → RetryInfo → retry_delay (Duration)and returns the delay.Other notable adaptations:
number) rather than the Python helper's seconds (float). The retry loop sleeps viasetTimeout(..., retryDelayMs). The log message still shows seconds ((retryDelayMs / 1000).toFixed(1)) to match the Python log format.type.googleapis.com/google.rpc.RetryInfo. The Python implementation callsAny.Unpack(retry_info), which checks the same type URL under the hood.Bufferso it can be passed to the parser unchanged.fs.readFile(audioRecordingPath)throws,audioBytesstays asBuffer.alloc(0)(same as before) and theaudiopart is omitted from the form on every retry.int32 nanos. The Duration parser readsnanosas a varint and casts toNumber, matching the protobuf encoding for non-negativeint32values.RetryInfodurations are always non-negative in practice.Files
agents/src/telemetry/traces.ts— refactoruploadSessionReport+ addparseRetryDelayMsand helpers..changeset/retry-recording-upload.md— patch changeset for@livekit/agents.Test plan
pnpm --filter @livekit/agents exec tsc --noEmitpasses.pnpm --filter @livekit/agents buildpasses.pnpm exec vitest run src/telemetry/traces.test.tspasses (5/5).pnpm exec prettier --checkpasses for changed files.RetryInfodetail in a staging environment and confirm the upload retries with the server-provided delay.cc @toubatbrian @livekit/agent-devs
This PR was opened automatically by a Claude Code routine. See the source PR comment on livekit/agents#5627 for context.
https://claude.ai/code/session_01LdAzmRxahVaMMGXAgStdrT
Generated by Claude Code