Skip to content

feat: add TraceLogging ETW telemetry (experimental)#493

Open
RamonArjona4 wants to merge 5 commits into
mainfrom
user/ramonarjona4/tracelogging
Open

feat: add TraceLogging ETW telemetry (experimental)#493
RamonArjona4 wants to merge 5 commits into
mainfrom
user/ramonarjona4/tracelogging

Conversation

@RamonArjona4

@RamonArjona4 RamonArjona4 commented Jun 4, 2026

Copy link
Copy Markdown
Member

📖 Description

Add experimental telemetry support using the tracelogging crate for pure Rust ETW event emission. No C++ shim or WIL dependency required.

  • New mxc_telemetry crate with define_provider! and write_event! for MXC.Execution and MXC.Error events using COMMON_MXC_PARAMS struct fields.
  • build.rs generates provider_def.rs with optional group_id based on MXC_TELEMETRY_PROVIDER_GROUP_GUID environment variable for internal builds.
  • wxc_common::telemetry module includes config resolution, sanitize_error_message() (strips Windows paths, Unix paths, UNC paths, and URL credentials), and event types with a bounded FailureReason enum.
  • Schema, config parser, and SDK types are extended with telemetry.enabled.
  • Telemetry call sites in wxc and lxc binaries emit execution and error events.
  • Experimental feature requires the --experimental flag and telemetry.enabled in the JSON config.
  • On non-Windows platforms, telemetry functions compile as no-ops.
  • Design doc at docs/telemetry.md includes architecture, feature comparison table, and GUID override mechanism.

Build integration (pipeline GUID substitution, license override script) will follow in a separate PR.

🔗 References

#492

🔍 Validation

Built locally and ran unit tests.

Validated that ETW events fire when build configuration is not set to allow telemetry.

Validated that telemetry appears as expected in UTC when telemetry provider is correctly enabled.

✅ Checklist

  • Signed the Contributor License Agreement
  • Linked to an issue
  • Updated documentation (if applicable)
  • Updated Copilot instructions (if build, architecture, or conventions changed)

📋 Issue Type

  • Bug fix
  • Feature
  • Task

Microsoft reviewers: PR builds do not auto-run (ADO policy). Comment /azp run to start MXC-PR-Build. See docs/pull-requests.md.

Microsoft Reviewers: Open in CodeFlow

Copilot AI review requested due to automatic review settings June 4, 2026 00:11

This comment was marked as resolved.

Comment thread .azure-pipelines/templates/Rust.Build.Job.yml Outdated
Comment thread .azure-pipelines/templates/Rust.Build.Job.yml Outdated
Comment thread docs/telemetry-wil-integration.md Outdated
Comment thread docs/telemetry-wil-integration.md Outdated
Comment thread docs/telemetry-wil-integration.md Outdated
Comment thread src/mxc_wil_telemetry/shim/mxc_telemetry_shim.cpp Outdated
Comment thread tests/examples/28_telemetry_enabled.json Outdated
@microsoft-github-policy-service microsoft-github-policy-service Bot added the Needs-Author-Feedback Issue needs attention from issue or PR author label Jun 5, 2026
@MGudgin

This comment was marked as resolved.

@RamonArjona4 RamonArjona4 force-pushed the user/ramonarjona4/tracelogging branch from 2e88af1 to 89a7d69 Compare June 9, 2026 19:29
@RamonArjona4 RamonArjona4 requested a review from Copilot June 9, 2026 19:44
@microsoft-github-policy-service microsoft-github-policy-service Bot added Needs-Attention Issue needs attention from Microsoft and removed Needs-Author-Feedback Issue needs attention from issue or PR author labels Jun 9, 2026
Comment thread docs/telemetry-wil-integration.md Outdated

This comment was marked as resolved.

@RamonArjona4 RamonArjona4 force-pushed the user/ramonarjona4/tracelogging branch 7 times, most recently from d348769 to d6eccb0 Compare June 9, 2026 23:12
@RamonArjona4 RamonArjona4 requested a review from Copilot June 9, 2026 23:23
@RamonArjona4 RamonArjona4 force-pushed the user/ramonarjona4/tracelogging branch from d6eccb0 to 5edf9ed Compare June 9, 2026 23:27

This comment was marked as resolved.

@RamonArjona4 RamonArjona4 force-pushed the user/ramonarjona4/tracelogging branch from 5edf9ed to 27c74b9 Compare June 9, 2026 23:55

This comment was marked as resolved.

@RamonArjona4 RamonArjona4 force-pushed the user/ramonarjona4/tracelogging branch from 87339b9 to 81dc63f Compare June 16, 2026 03:47
@RamonArjona4 RamonArjona4 marked this pull request as ready for review June 16, 2026 17:36
@RamonArjona4 RamonArjona4 requested a review from a team as a code owner June 16, 2026 17:36
@RamonArjona4 RamonArjona4 requested a review from bbonaby June 16, 2026 17:36
@RamonArjona4 RamonArjona4 force-pushed the user/ramonarjona4/tracelogging branch from 81dc63f to 39559fd Compare June 17, 2026 21:38
@RamonArjona4 RamonArjona4 removed the Needs-Attention Issue needs attention from Microsoft label Jun 17, 2026
Add pure Rust TraceLogging ETW telemetry for MXC using the tracelogging
crate. No C++ shim, WIL, or FFI required.

- New mxc_telemetry crate with build.rs-generated provider definition
- MXC.Execution and MXC.Error events with MICROSOFT_KEYWORD_MEASURES
- Common event fields (Version, Channel, IsDebugging, UTCReplace_AppSessionGuid)
- PII sanitization for error messages (path stripping, URL credential removal)
- Optional provider group GUID via MXC_TELEMETRY_PROVIDER_GROUP_GUID env var
- Telemetry gated behind experimental.telemetry.enabled config + --experimental flag
- Cross-platform: full ETW on Windows, no-op stubs on Linux/macOS
- ETW smoke test script and example config

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@RamonArjona4 RamonArjona4 force-pushed the user/ramonarjona4/tracelogging branch from 39559fd to b6e02db Compare June 17, 2026 21:44
Comment thread src/core/lxc/src/main.rs Outdated
RamonArjona4 and others added 2 commits June 23, 2026 11:35
Address open PR review feedback on the MXC.Error telemetry event:

- Data minimization (PII): MXC.Error no longer emits the free-form error
  message. The event now carries only the bounded `error_type` category and
  the numeric `exit_code`, so paths, usernames, and credentials cannot leak
  through telemetry. Removes the best-effort `sanitize_error_message()`
  char-walker (and its tests) entirely, closing the class of sanitizer gaps.
- De-duplication (MGudgin): hoist the ~55-line telemetry emit block that was
  duplicated verbatim in `wxc` and `lxc` into a single shared
  `wxc_common::telemetry::emit_completion()`; both binaries now call it.
- Docs: reconcile README and the telemetry doc with the stronger guarantee
  (no free-form error text emitted; only category + exit code).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The doc describes the pure-Rust TraceLogging architecture; the old
"wil-integration" filename was a stale leftover from the WIL C++ shim
design that was removed during the pivot. Update the README link too.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@RamonArjona4 RamonArjona4 changed the title feat: add WIL-based TraceLogging ETW telemetry (experimental) feat: add TraceLogging ETW telemetry (experimental) Jun 23, 2026
Comment thread docs/telemetry.md
Comment thread docs/telemetry.md
… and address review #493

Addresses Tiers 1-3 + items 8-9 of the PR #493 adversarial review:

- build.rs: gate MXC_TELEMETRY_PROVIDER_GROUP_GUID validation on Windows
  targets only so a stray/malformed env var can't break cross-platform
  builds; extract pure code-gen into provider_codegen.rs and unit-test it.
- mxc_telemetry Cargo.toml: inherit tracelogging from [workspace.dependencies]
  (pinned 1.2) instead of a crate-local '1'.
- telemetry/mod.rs: use ContainmentBackend::wire_name() instead of a
  duplicated backend_name() map; add emit_early_exit() and a classify_failure
  mapping unit test.
- mxc_telemetry lib.rs: serialize register/unregister under a Mutex and honor
  register()'s status (return false on failure) so wrapper state can't diverge.
- wxc/lxc main.rs: route post-init config/policy/init early-exit failures
  through telemetry::emit_early_exit so they emit MXC.Execution/MXC.Error
  (makes ConfigError/PolicyError reachable).
- docs/telemetry.md: narrow the 'every execution' claim (state-aware not
  instrumented), mark the pipeline-GUID wiring as a follow-up.
- smoke test: derive the provider GUID from the name, fix the init/emit
  comment, and fail (not 'inconclusive') on zero captured events.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@RamonArjona4

Copy link
Copy Markdown
Member Author

Follow-up: seed racelogging into the ADO MxcDependencies feed (review item #4)

This PR adds two new transitive crates that resolve from crates.io on GitHub Actions (hence all GH checks are green), but will 401 on the anonymous ADO MxcDependencies feed until an authenticated fetch seeds them:

  • racelogging 1.2.4
  • racelogging_macros 1.2.3

Both are gated behind [target.'cfg(windows)'.dependencies], so only Windows ADO jobs pull them. The official build uses the separate internal Mxc-Azure-Feed and is unaffected; the fork-PR / Unofficial / Lint / Mac jobs that use the anonymous public feed are the ones that need the seed.

@bbonaby — could you trigger a seeding fetch (or confirm the feed already mirrors these) before this merges, so the ADO MXC-PR-Build pipeline doesn't 401 on the new crates? No code change is required here.

@RamonArjona4

Copy link
Copy Markdown
Member Author

Deferred to a separate PR: docs/schema.md version-table drift (review item #2)

The review flagged that docs/schema.md's supported-version table is out of sync with the canonical schemas/schema-version.json. This drift is pre-existing — this PR only added 3 telemetry-related lines and did not touch the version table — so I'm intentionally not fixing it here to keep the telemetry change focused.

Tracking it as a follow-up doc PR rather than expanding the scope of this one.

/// Privacy data tag for Product and Service Usage.
/// Same value as WIL's `MicrosoftTelemetry.h` `PDT_ProductAndServiceUsage`.
/// Applied via a `PartA_PrivTags` field per the `TelemetryPrivacyDataTag` pattern.
pub(crate) const PDT_PRODUCT_AND_SERVICE_USAGE: u64 = 0x0000_0000_0200_0000;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only privacy tag that will be used?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now yes, given that we've only got a very few events that conform to this type. Given that the pure Rust approach was encouraged, I'm trying to thread a needle between having too little of WIL and reimplementing all of WIL on top of the Rust TraceLogging crates.

Future work should probably including doing a translation of WIL into Rust, but that's probably more than we want to bite off here.

str8("Version", &state.version),
str8("Channel", &state.channel),
bool8("IsDebugging", &is_debug_build),
bool8("UTCReplace_AppSessionGuid", &true),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these needed here? These were explicitly requested for WinAppSDK, I don't even know if Channel is a concept in MXC. Similar question about these other common params, maybe there are other common params that are needed for MXC

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional in that Channel and IsDebug build are taken from the WinAppSDK because it seemed like a good way to differentiate between telemetry being emitted from an a build under test versus telemetry being emitted from a build that's published to the world.

tracelogging::write_event!(
MXC_PROVIDER,
"MXC.Error",
level(Warning),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an error even type but logged in as warning. Are there any critical events that need to be ALWAYS logged in MXC (e.g., crashes)? If yes, you should log them accordingly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment around the intention behind the levels we're choosing. The name "Error" here is admittedly non-descript and is meant as an indication of infrastructure failure distinct from workload exit status.
A couple of things that will need to be captured in future work:

  • We'll need to have a future PR to capture the lifecycle of IsolationSession.
  • We'll need to install std::panic::hook to capture MXC panics in telemetry.

Add comments explaining why MXC.Execution is Informational (routine completion
record) and MXC.Error is Warning rather than Error/Critical (it reports an
expected operational failure of a sandboxed run, not an MXC defect; downstream
pipelines treat Error/Critical as product faults). No behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread docs/telemetry.md
@@ -0,0 +1,177 @@
# MXC Telemetry — Pure Rust TraceLogging Architecture

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: can we move this into a docs/telemetry folder?

| `Version` | string | MXC crate version from `CARGO_PKG_VERSION` |
| `Channel` | string | `"dev"` for debug builds, `"release"` for release |
| `IsDebugging` | bool | `cfg!(debug_assertions)` — true for debug builds |
| `UTCReplace_AppSessionGuid` | bool | Always `true` — tells UTC to replace the app session GUID with a per-session identifier for privacy |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: app session guid and per-session identifier what's the difference?

/// Validates that `s` is a well-formed GUID (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx).
/// Prevents code injection via the environment variable since the value is
/// interpolated into generated Rust source that is `include!()`'d.
fn is_valid_guid(s: &str) -> bool {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually replace this function by using the uuid crate directly for this.

// Example returns true if valid and false otherwise
Uuid::parse_str("550e8400-e29b-41d4-a716-446655440000").is_ok() 

Comment thread README.md

### Data collection

This project may collect usage data and send it to Microsoft to help improve our products and services. Note, however, that **no data collection is performed for local or open-source builds by default** — the public source code ships without a TraceLogging provider group GUID, so events are emitted to the local ETW subsystem only and are not routed to any Microsoft collection pipeline. Internal builds that set the `MXC_TELEMETRY_PROVIDER_GROUP_GUID` environment variable at build time will route events through the UTC pipeline.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHould we make sure we clearly say that shipped production code will send data to Micorsoft?

"commandLine": "cmd.exe /c echo Hello from telemetry-enabled sandbox"
},
"processContainer": {
"capabilities": ["permissiveLearningMode"]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is the permissive learning mode capability needed in here?

// executable (not a DLL), so unload ordering is not a concern.
let status = unsafe { MXC_PROVIDER.register() };
if status != 0 {
// Registration failed; leave `*registered` false so the wrapper

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: should we use the logger we have in this code base to log the error to the console if available?

Comment on lines +111 to +113
// The presence of an error message signals an infrastructure error (as
// opposed to a script that merely exited non-zero). We use it only as a
// boolean signal — the message text itself is never emitted.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Not for this PR, but I think we should probably make sure MXC never exits with an error code without first printing an error message. Nothing we can do about not printing before panics but better to solidify our error handling to prevent any second guessing the telemetry we output.

Comment thread docs/telemetry.md
| Provider name `"Microsoft.MXC"` | ✅ | Standard ETW naming |
| Provider GUID `{7f10def4-a258-5fea-510e-2c3bb976687f}` | ✅ | Derived from the name; identifies the provider, harmless |
| `build.rs` env var mechanism | ✅ | Mechanism is public |
| `MXC_TELEMETRY_PROVIDER_GROUP_GUID` env var name | ✅ | Key is public; value is private |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just set this up as a pipeline variable for our official pipeline here you should be able to click the variables button at the top right and see/confirm the guid I added.

Based on what you have in your build.rs below I believe this should actually work out the box as that environment variable should be visible to cargo at build time. I'll send you a link to the sharepoint which will show you how you can run an official build manually using this local branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants