Skip to content

feat(ARSN-572): Propagate W3C trace context to MongoDB metadata writes#2611

Draft
delthas wants to merge 1 commit intodevelopment/8.3from
improvement/ARSN-572/trace-context
Draft

feat(ARSN-572): Propagate W3C trace context to MongoDB metadata writes#2611
delthas wants to merge 1 commit intodevelopment/8.3from
improvement/ARSN-572/trace-context

Conversation

@delthas
Copy link
Copy Markdown
Contributor

@delthas delthas commented Apr 13, 2026

Not human-reviewed yet. Not asking for reviews at the moment!

Context

Cloudserver recently gained OpenTelemetry tracing (CLDSRV-884, cloudserver PR scality/cloudserver#6140). That gives us a single end-to-end trace per S3 request spanning cloudserver → vault → hdproxy → hyperiod.

But many state mutations in the S3 stack are written to MongoDB and consumed asynchronously by worker services via the oplog:

  • Backbeat — replication, lifecycle expiration/transition, garbage collection
  • Sorbet — cold-storage restore completion
  • Other oplog consumers

Today the oplog entry carries no trace context, so async work performed hours or days after the S3 request cannot be tied back to the request that caused it. A user's POST …?restore produces a trace that ends at the metadata write; the actual cold-tier retrieval by Sorbet is invisible from the original trace.

Solution

Stamp a W3C trace context (traceparent/tracestate) on every mutating metadata write, auto-derived from the currently-active OTEL context.

The field lives on ObjectMDData next to originOp:

traceContext?: {
    traceparent?: string;
    tracestate?: string;
};

A new helper captureCurrentTraceContext() reads the active OTEL context via @opentelemetry/api and serializes it via propagation.inject(). It's invoked at the three MongoClientInterface sites that currently set originOp:

  • internalPutObject
  • repair
  • internalDeleteObject

Consumers (backbeat, sorbet — out of scope for this PR) will later:

  1. Read value.traceContext.traceparent from the oplog entry
  2. propagation.extract() it to reconstruct a parent context
  3. Wrap per-entry processing in context.with(extractedCtx, ...)

This closes the loop on end-to-end tracing across the async boundary.

Design decisions

1. Auto-inject at the arsenal write boundary (vs. param plumbing).
The alternative was to thread a new traceContext parameter through the ~15 cloudserver call sites that already set originOp. Rejected because:

  • @opentelemetry/context-async-hooks (loaded by the OTEL SDK at cloudserver startup) already propagates the request context through arsenal's async call chain for free. Re-reading it at the mongo write is idiomatic OTEL.
  • Zero cloudserver surface change — we can iterate on arsenal independently.
  • Parallels existing behavior where arsenal reads originOp out of params and attaches it to the ObjectMD.
  • Avoids replicating the maintenance cost that makes originOp easy to forget on new API endpoints today.

2. Only @opentelemetry/api as a runtime dep (~10 KB, no SDK).
The api package is specifically designed to be safe for libraries to depend on. When no SDK is registered (today's state for anyone not opting into cloudserver OTEL), context.active() returns a no-op context, trace.getSpan() returns undefined, and captureCurrentTraceContext() returns undefined. setTraceContext(undefined) no-ops; the field stays absent in the write. Zero runtime cost when OTEL is off.

3. Optional field, no schema migration.
traceContext is absent by default — consistent with other optional ObjectMD fields like legalHold, retentionDate. No default-value change in the ObjectMDData construction. Existing consumers that don't understand the field simply ignore it; fully backward compatible.

4. No-op when no traceparent is present.
setTraceContext(undefined) and setTraceContext({ tracestate: 'x' }) (no traceparent) both no-op — the field only appears when the caller had an active trace. This prevents junk entries from any edge cases in the OTEL API.

Interaction with async flows (cold restore as example)

Cold storage restore exercises both a sync and an async metadata write:

  1. ObjectRestore:Post (sync, user-initiated via POST /bucket/key?restore) — cloudserver sets originOp; arsenal now also stamps traceContext pointing to the user's request trace. ✅
  2. ObjectRestore:Completed (async, Sorbet-triggered via cloudserver's /_/backbeat/* route) — behavior depends on Sorbet:
    • If Sorbet forwards traceparent on its HTTP call (ideally extracted from step 1's stored traceContext): arsenal picks it up via context.with(remoteContext, ...) in cloudserver's server.js, and the Completed write ties back to the original user restore. ✅
    • If Sorbet calls without traceparent (today's state): safely no-ops — Completed write has no traceContext, no regression. ✅

Same pattern applies to lifecycle expiration/transition, CRR replication, and GC.

Verification

Unit tests

tests/unit/models/ObjectMD.spec.js — 4 new tests covering:

  • Valid traceparent round-trips through set/getTraceContext
  • setTraceContext(undefined) is a no-op
  • setTraceContext({ tracestate: ... }) without traceparent is a no-op
  • getValue() serializes traceContext only when set

All 128 ObjectMD tests pass (was 124 + 4 new).

End-to-end (follow-up on a test cluster)

  1. Bump cloudserver's arsenal pin to a build of this branch; roll the cloudserver deployment on an Artesca cluster with ENABLE_OTEL=true
  2. Make a PutObject; note the trace ID in Jaeger
  3. Inspect the MongoDB oplog: o.value.traceContext.traceparent should be present, with characters 3-35 matching the Jaeger trace ID
  4. Negative test with ENABLE_OTEL=false → field should be absent

Out of scope (follow-ups)

  • Backbeat consumer — read value.traceContext.traceparent from oplog entries, extract & wrap processing
  • Sorbet consumer (cold storage workers) — same pattern; must also forward traceparent on its callbacks to cloudserver's backbeat route
  • BucketMD — bucket mutations also hit the oplog but don't have originOp. Parallel but separate change if we want trace context on bucket writes
  • Ring 9 path — Ring 9's metadata goes to bucketd, not MongoDB; this PR is Artesca-specific

Size

Count
New files 1
Files modified 3
LOC added (excl. tests) ~60
New runtime deps @opentelemetry/api (~10 KB, API-only)

Issue: ARSN-572

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 13, 2026

Hello delthas,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 13, 2026

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 13, 2026

LGTM

Review by Claude Code

@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from d42f30b to 01693cd Compare April 13, 2026 15:38
Comment thread lib/models/ObjectMD.ts
Comment on lines +1457 to +1459
if (tc && tc.traceparent) {
this._data.traceContext = tc;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this method is called on an ObjectMD constructed from an existing document (as in putObjectNoVerWithOplogUpdate and internalDeleteObject), passing undefined leaves any stale traceContext from the prior write intact. A delete with no active span would carry forward the PUT's trace context to the oplog entry, misleading downstream consumers.

Suggested change
if (tc && tc.traceparent) {
this._data.traceContext = tc;
}
if (tc && tc.traceparent) {
this._data.traceContext = tc;
} else {
delete this._data.traceContext;
}

— Claude Code

Comment on lines +1640 to +1643
const tc = captureCurrentTraceContext();
if (tc) {
objVal.traceContext = tc;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same stale carry-forward issue: if captureCurrentTraceContext() returns undefined, any existing objVal.traceContext from the prior metadata persists into the oplog entry.

Suggested change
const tc = captureCurrentTraceContext();
if (tc) {
objVal.traceContext = tc;
}
const tc = captureCurrentTraceContext();
if (tc) {
objVal.traceContext = tc;
} else {
delete objVal.traceContext;
}

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 13, 2026

  • Stale trace context carry-forward: setTraceContext(undefined) is a no-op, so when ObjectMD is constructed from an existing document that already has a traceContext, a new operation without an active span will carry the old trace context into its oplog entry. Affects putObjectNoVerWithOplogUpdate, repair, and internalDeleteObject.
    • Clear the field when no trace context is provided (delete this._data.traceContext / delete objVal.traceContext)
    • Add a test that sets a trace context, then calls setTraceContext(undefined), and asserts the field is cleared

Review by Claude Code

@scality scality deleted a comment from bert-e Apr 13, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 68.57143% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.46%. Comparing base (600fb06) to head (39233c8).

Files with missing lines Patch % Lines
lib/storage/metadata/captureTraceContext.ts 40.00% 9 Missing ⚠️
...orage/metadata/mongoclient/MongoClientInterface.ts 83.33% 2 Missing ⚠️
Additional details and impacted files
@@                 Coverage Diff                 @@
##           development/8.3    #2611      +/-   ##
===================================================
- Coverage            73.47%   73.46%   -0.01%     
===================================================
  Files                  222      223       +1     
  Lines                18175    18210      +35     
  Branches              3785     3770      -15     
===================================================
+ Hits                 13354    13378      +24     
- Misses                4816     4827      +11     
  Partials                 5        5              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from 01693cd to ca456c9 Compare April 13, 2026 15:56
out.tracestate = carrier.tracestate;
}
return out;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No unit tests for this function. It is the core new logic (OTEL API interaction) and has zero test coverage — the ObjectMD tests only cover the model setter/getter, not the capture logic itself.

Key behaviors to cover:
- Returns undefined when no span is active (OTEL not enabled)
- Returns correct traceparent/tracestate when a span exists
- Returns undefined when propagation.inject yields no traceparent

These are straightforward to test by mocking the @opentelemetry/api exports (context, trace, propagation).

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 13, 2026

  • Missing unit tests for captureCurrentTraceContext (lib/storage/metadata/captureTraceContext.ts)
    • The new function has zero direct test coverage — ObjectMD tests only exercise the model setter/getter
    • Add tests mocking @opentelemetry/api to cover: no active span → undefined, active span → correct traceparent/tracestate, propagation yields no traceparent → undefined

Review by Claude Code

@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from ca456c9 to a30bce8 Compare April 14, 2026 07:19
out.tracestate = carrier.tracestate;
}
return out;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

captureCurrentTraceContext has no unit tests. While the ObjectMD setter/getter is well covered, this function is the core integration point with the OTEL API. A test that registers a mock propagator and verifies the output (and the undefined path when no span is active) would catch regressions if the @opentelemetry/api contract changes.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 14, 2026

LGTM — clean design. One minor gap:

- captureCurrentTraceContext has no unit tests
- Add a test with a mock OTEL propagator to cover both the active-span and no-span paths

Review by Claude Code

Add a traceContext field to ObjectMD carrying the currently-active
OTEL trace context (W3C traceparent/tracestate). Inject it
automatically at the three metadata write chokepoints where originOp
is set today: internalPutObject, repair, and internalDeleteObject.

The value ends up in the MongoDB oplog; downstream consumers
(backbeat, sorbet) can extract it to continue the trace across the
async boundary, closing the loop on end-to-end tracing for flows
that cross the oplog.

When OTEL is not active on the caller (no SDK, or request outside
a traced context), captureCurrentTraceContext returns undefined,
setTraceContext no-ops, and the field stays absent — zero cost.

Only adds @opentelemetry/api as a runtime dependency (the API-only
surface package, becomes a no-op when no SDK is registered).

Issue: ARSN-572
@delthas delthas force-pushed the improvement/ARSN-572/trace-context branch from a30bce8 to 39233c8 Compare April 14, 2026 07:23
out.tracestate = carrier.tracestate;
}
return out;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No unit tests for this function. It contains non-trivial logic (active-span check, propagation injection, traceparent/tracestate filtering) that should be exercised independently of the ObjectMD model tests. Mocking @opentelemetry/api (or using its built-in no-op behavior) would let you cover:
- No active span → returns undefined
- Active span with traceparent → returns { traceparent }
- Active span with traceparent + tracestate → returns both
- Active span but propagation yields no traceparent → returns undefined

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 14, 2026

  • Missing unit tests for captureCurrentTraceContext() — the core new function has non-trivial branching logic (no span, span without traceparent, span with/without tracestate) but no dedicated tests. The ObjectMD tests only cover the model layer.
    - Add a test file (e.g. tests/unit/storage/metadata/captureTraceContext.spec.ts) that mocks or uses the OTEL no-op API to exercise all branches.

    Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants