feat: add TraceLogging ETW telemetry (experimental)#493
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
2e88af1 to
89a7d69
Compare
d348769 to
d6eccb0
Compare
d6eccb0 to
5edf9ed
Compare
5edf9ed to
27c74b9
Compare
87339b9 to
81dc63f
Compare
81dc63f to
39559fd
Compare
Add pure Rust TraceLogging ETW telemetry for MXC using the tracelogging crate. No C++ shim, WIL, or FFI required. - New mxc_telemetry crate with build.rs-generated provider definition - MXC.Execution and MXC.Error events with MICROSOFT_KEYWORD_MEASURES - Common event fields (Version, Channel, IsDebugging, UTCReplace_AppSessionGuid) - PII sanitization for error messages (path stripping, URL credential removal) - Optional provider group GUID via MXC_TELEMETRY_PROVIDER_GROUP_GUID env var - Telemetry gated behind experimental.telemetry.enabled config + --experimental flag - Cross-platform: full ETW on Windows, no-op stubs on Linux/macOS - ETW smoke test script and example config Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
39559fd to
b6e02db
Compare
Address open PR review feedback on the MXC.Error telemetry event: - Data minimization (PII): MXC.Error no longer emits the free-form error message. The event now carries only the bounded `error_type` category and the numeric `exit_code`, so paths, usernames, and credentials cannot leak through telemetry. Removes the best-effort `sanitize_error_message()` char-walker (and its tests) entirely, closing the class of sanitizer gaps. - De-duplication (MGudgin): hoist the ~55-line telemetry emit block that was duplicated verbatim in `wxc` and `lxc` into a single shared `wxc_common::telemetry::emit_completion()`; both binaries now call it. - Docs: reconcile README and the telemetry doc with the stronger guarantee (no free-form error text emitted; only category + exit code). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The doc describes the pure-Rust TraceLogging architecture; the old "wil-integration" filename was a stale leftover from the WIL C++ shim design that was removed during the pivot. Update the README link too. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… and address review #493 Addresses Tiers 1-3 + items 8-9 of the PR #493 adversarial review: - build.rs: gate MXC_TELEMETRY_PROVIDER_GROUP_GUID validation on Windows targets only so a stray/malformed env var can't break cross-platform builds; extract pure code-gen into provider_codegen.rs and unit-test it. - mxc_telemetry Cargo.toml: inherit tracelogging from [workspace.dependencies] (pinned 1.2) instead of a crate-local '1'. - telemetry/mod.rs: use ContainmentBackend::wire_name() instead of a duplicated backend_name() map; add emit_early_exit() and a classify_failure mapping unit test. - mxc_telemetry lib.rs: serialize register/unregister under a Mutex and honor register()'s status (return false on failure) so wrapper state can't diverge. - wxc/lxc main.rs: route post-init config/policy/init early-exit failures through telemetry::emit_early_exit so they emit MXC.Execution/MXC.Error (makes ConfigError/PolicyError reachable). - docs/telemetry.md: narrow the 'every execution' claim (state-aware not instrumented), mark the pipeline-GUID wiring as a follow-up. - smoke test: derive the provider GUID from the name, fix the init/emit comment, and fail (not 'inconclusive') on zero captured events. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Follow-up: seed racelogging into the ADO MxcDependencies feed (review item #4) This PR adds two new transitive crates that resolve from crates.io on GitHub Actions (hence all GH checks are green), but will 401 on the anonymous ADO MxcDependencies feed until an authenticated fetch seeds them:
Both are gated behind [target.'cfg(windows)'.dependencies], so only Windows ADO jobs pull them. The official build uses the separate internal Mxc-Azure-Feed and is unaffected; the fork-PR / Unofficial / Lint / Mac jobs that use the anonymous public feed are the ones that need the seed. @bbonaby — could you trigger a seeding fetch (or confirm the feed already mirrors these) before this merges, so the ADO MXC-PR-Build pipeline doesn't 401 on the new crates? No code change is required here. |
|
Deferred to a separate PR: docs/schema.md version-table drift (review item #2) The review flagged that docs/schema.md's supported-version table is out of sync with the canonical schemas/schema-version.json. This drift is pre-existing — this PR only added 3 telemetry-related lines and did not touch the version table — so I'm intentionally not fixing it here to keep the telemetry change focused. Tracking it as a follow-up doc PR rather than expanding the scope of this one. |
| /// Privacy data tag for Product and Service Usage. | ||
| /// Same value as WIL's `MicrosoftTelemetry.h` `PDT_ProductAndServiceUsage`. | ||
| /// Applied via a `PartA_PrivTags` field per the `TelemetryPrivacyDataTag` pattern. | ||
| pub(crate) const PDT_PRODUCT_AND_SERVICE_USAGE: u64 = 0x0000_0000_0200_0000; |
There was a problem hiding this comment.
Is this the only privacy tag that will be used?
There was a problem hiding this comment.
For now yes, given that we've only got a very few events that conform to this type. Given that the pure Rust approach was encouraged, I'm trying to thread a needle between having too little of WIL and reimplementing all of WIL on top of the Rust TraceLogging crates.
Future work should probably including doing a translation of WIL into Rust, but that's probably more than we want to bite off here.
| str8("Version", &state.version), | ||
| str8("Channel", &state.channel), | ||
| bool8("IsDebugging", &is_debug_build), | ||
| bool8("UTCReplace_AppSessionGuid", &true), |
There was a problem hiding this comment.
Why are these needed here? These were explicitly requested for WinAppSDK, I don't even know if Channel is a concept in MXC. Similar question about these other common params, maybe there are other common params that are needed for MXC
There was a problem hiding this comment.
This is intentional in that Channel and IsDebug build are taken from the WinAppSDK because it seemed like a good way to differentiate between telemetry being emitted from an a build under test versus telemetry being emitted from a build that's published to the world.
| tracelogging::write_event!( | ||
| MXC_PROVIDER, | ||
| "MXC.Error", | ||
| level(Warning), |
There was a problem hiding this comment.
This is an error even type but logged in as warning. Are there any critical events that need to be ALWAYS logged in MXC (e.g., crashes)? If yes, you should log them accordingly.
There was a problem hiding this comment.
Added a comment around the intention behind the levels we're choosing. The name "Error" here is admittedly non-descript and is meant as an indication of infrastructure failure distinct from workload exit status.
A couple of things that will need to be captured in future work:
- We'll need to have a future PR to capture the lifecycle of IsolationSession.
- We'll need to install std::panic::hook to capture MXC panics in telemetry.
Add comments explaining why MXC.Execution is Informational (routine completion record) and MXC.Error is Warning rather than Error/Critical (it reports an expected operational failure of a sandboxed run, not an MXC defect; downstream pipelines treat Error/Critical as product faults). No behavior change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| @@ -0,0 +1,177 @@ | |||
| # MXC Telemetry — Pure Rust TraceLogging Architecture | |||
There was a problem hiding this comment.
suggestion: can we move this into a docs/telemetry folder?
| | `Version` | string | MXC crate version from `CARGO_PKG_VERSION` | | ||
| | `Channel` | string | `"dev"` for debug builds, `"release"` for release | | ||
| | `IsDebugging` | bool | `cfg!(debug_assertions)` — true for debug builds | | ||
| | `UTCReplace_AppSessionGuid` | bool | Always `true` — tells UTC to replace the app session GUID with a per-session identifier for privacy | |
There was a problem hiding this comment.
question: app session guid and per-session identifier what's the difference?
| /// Validates that `s` is a well-formed GUID (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx). | ||
| /// Prevents code injection via the environment variable since the value is | ||
| /// interpolated into generated Rust source that is `include!()`'d. | ||
| fn is_valid_guid(s: &str) -> bool { |
There was a problem hiding this comment.
We can actually replace this function by using the uuid crate directly for this.
// Example returns true if valid and false otherwise
Uuid::parse_str("550e8400-e29b-41d4-a716-446655440000").is_ok() |
|
||
| ### Data collection | ||
|
|
||
| This project may collect usage data and send it to Microsoft to help improve our products and services. Note, however, that **no data collection is performed for local or open-source builds by default** — the public source code ships without a TraceLogging provider group GUID, so events are emitted to the local ETW subsystem only and are not routed to any Microsoft collection pipeline. Internal builds that set the `MXC_TELEMETRY_PROVIDER_GROUP_GUID` environment variable at build time will route events through the UTC pipeline. |
There was a problem hiding this comment.
SHould we make sure we clearly say that shipped production code will send data to Micorsoft?
| "commandLine": "cmd.exe /c echo Hello from telemetry-enabled sandbox" | ||
| }, | ||
| "processContainer": { | ||
| "capabilities": ["permissiveLearningMode"] |
There was a problem hiding this comment.
question: is the permissive learning mode capability needed in here?
| // executable (not a DLL), so unload ordering is not a concern. | ||
| let status = unsafe { MXC_PROVIDER.register() }; | ||
| if status != 0 { | ||
| // Registration failed; leave `*registered` false so the wrapper |
There was a problem hiding this comment.
question: should we use the logger we have in this code base to log the error to the console if available?
| // The presence of an error message signals an infrastructure error (as | ||
| // opposed to a script that merely exited non-zero). We use it only as a | ||
| // boolean signal — the message text itself is never emitted. |
There was a problem hiding this comment.
thought: Not for this PR, but I think we should probably make sure MXC never exits with an error code without first printing an error message. Nothing we can do about not printing before panics but better to solidify our error handling to prevent any second guessing the telemetry we output.
| | Provider name `"Microsoft.MXC"` | ✅ | Standard ETW naming | | ||
| | Provider GUID `{7f10def4-a258-5fea-510e-2c3bb976687f}` | ✅ | Derived from the name; identifies the provider, harmless | | ||
| | `build.rs` env var mechanism | ✅ | Mechanism is public | | ||
| | `MXC_TELEMETRY_PROVIDER_GROUP_GUID` env var name | ✅ | Key is public; value is private | |
There was a problem hiding this comment.
I just set this up as a pipeline variable for our official pipeline here you should be able to click the variables button at the top right and see/confirm the guid I added.
Based on what you have in your build.rs below I believe this should actually work out the box as that environment variable should be visible to cargo at build time. I'll send you a link to the sharepoint which will show you how you can run an official build manually using this local branch.
📖 Description
Add experimental telemetry support using the tracelogging crate for pure Rust ETW event emission. No C++ shim or WIL dependency required.
Build integration (pipeline GUID substitution, license override script) will follow in a separate PR.
🔗 References
#492
🔍 Validation
Built locally and ran unit tests.
Validated that ETW events fire when build configuration is not set to allow telemetry.
Validated that telemetry appears as expected in UTC when telemetry provider is correctly enabled.
✅ Checklist
📋 Issue Type
Microsoft reviewers: PR builds do not auto-run (ADO policy). Comment /azp run to start MXC-PR-Build. See docs/pull-requests.md.
Microsoft Reviewers: Open in CodeFlow