fix(sources): prevent windows_event_log permanent freeze from signal-event lost wakeup#25195
fix(sources): prevent windows_event_log permanent freeze from signal-event lost wakeup#25195tot19 wants to merge 4 commits intovectordotdev:masterfrom
Conversation
…rden regression test Address review feedback on vectordotdev#25195: - Re-SetEvent on the EvtNext error-return path. Without this, a transient EvtNext failure leaves the pre-drain-reset signal cleared with events possibly still un-drained — the same class of lost-wakeup the PR fixes for the success path. - Replace the primitive-level Barrier test with an integration-style test that drives real pull_events against a real EvtSubscribe handle. A new cfg(test) DRAIN_STEP_HOOK fires SetEvent on the subscription's signal handle inside the drain loop, simulating the OS signaling a fresh event mid-drain. After pull_events returns, wait_for_events_blocking must return EventsAvailable; under the old post-drain ResetEvent order it would time out. This test would now actually catch a reversion.
c1f8c06 to
8fe9083
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c1f8c06263
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…ription Addresses Codex review feedback on vectordotdev#25195. DRAIN_STEP_HOOK is a process-global and cargo runs tests in parallel, so a concurrent test's pull_events could trigger this test's one-shot hook first, flip `fired`, and SetEvent on the wrong signal handle — intermittently timing out test_pull_events_preserves_setevent_during_drain and perturbing the sibling pull_events tests. Capture the subscription's own signal handle up front and make the hook no-op when invoked for any other handle. Add a cfg(test)-only first_channel_signal_raw accessor for the test to read it.
Fixes check-fmt CI failure on vectordotdev#25195.
|
All contributors have signed the CLA ✍️ ✅ |
00c9882 to
cc6cedb
Compare
…rden regression test Address review feedback on vectordotdev#25195: - Re-SetEvent on the EvtNext error-return path. Without this, a transient EvtNext failure leaves the pre-drain-reset signal cleared with events possibly still un-drained — the same class of lost-wakeup the PR fixes for the success path. - Replace the primitive-level Barrier test with an integration-style test that drives real pull_events against a real EvtSubscribe handle. A new cfg(test) DRAIN_STEP_HOOK fires SetEvent on the subscription's signal handle inside the drain loop, simulating the OS signaling a fresh event mid-drain. After pull_events returns, wait_for_events_blocking must return EventsAvailable; under the old post-drain ResetEvent order it would time out. This test would now actually catch a reversion.
…ription Addresses Codex review feedback on vectordotdev#25195. DRAIN_STEP_HOOK is a process-global and cargo runs tests in parallel, so a concurrent test's pull_events could trigger this test's one-shot hook first, flip `fired`, and SetEvent on the wrong signal handle — intermittently timing out test_pull_events_preserves_setevent_during_drain and perturbing the sibling pull_events tests. Capture the subscription's own signal handle up front and make the hook no-op when invoked for any other handle. Add a cfg(test)-only first_channel_signal_raw accessor for the test to read it.
Fixes check-fmt CI failure on vectordotdev#25195.
cc6cedb to
bc62f20
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc62f20803
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
bc62f20 to
6a04ccf
Compare
…keup regression tests The Windows Event Log service signals the pull-mode wait handle via SetEvent each time a new matching event is recorded. Because the handle is manual-reset, any SetEvent that fires between the last EvtNext call and the post-drain ResetEvent is silently lost — the subscription then hangs until the next OS event arrives (vectordotdev#25194). Fix: reset the handle *before* entering the drain loop. Signals raised during the drain are preserved because SetEvent on an already-signaled handle is a no-op. Re-arm (SetEvent) on early exits so the next pull_events revisits the channel without waiting for a fresh OS notification: - budget exhaustion - bookmark failure mid-batch - transient EvtNext error Regression tests: - test_pull_events_preserves_setevent_during_drain: installs DRAIN_STEP_HOOK to fire SetEvent mid-drain and asserts wait_for_events_blocking returns EventsAvailable, not Timeout. - test_speculative_pull_recovers_without_signal: manually clears the channel signal via ResetEvent, confirms wait times out, then asserts pull_events still returns events — proving the speculative timeout pull in mod.rs self-heals independently of signal state. Also: comment re-subscription break paths (ERROR_EVT_QUERY_RESULT_STALE and INVALID_POSITION) noting the speculative pull as a safety net if the re-subbed channel does not immediately re-signal; add serialization note to DRAIN_STEP_HOOK.
6a04ccf to
873d0c8
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 40ef9b5641
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1b16c28e09
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
5bf9fc5 to
1a90ee1
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a90ee1ef9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…cessing, fix error handling Four related improvements to mod.rs: 1. Speculative pull on WaitResult::Timeout: call pull_events on every timeout cycle as a belt-and-suspenders self-heal. EvtNext returns ERROR_NO_MORE_ITEMS immediately on an empty channel (near-zero cost). If events are recovered a warning is emitted. Guarantees recovery within one timeout period regardless of the root cause of the lost wakeup. 2. Extract with_subscription_blocking helper: wraps the spawn_blocking ownership-transfer pattern (move subscription in, run blocking fn, return subscription + result). All three blocking calls (wait, normal pull, speculative pull) now use this helper instead of inlining spawn_blocking. 3. Extract process_event_batch helper: the parse/emit/send_batch/finalize sequence was duplicated verbatim between the EventsAvailable arm and the speculative-timeout arm. Extracted into a shared free async function. Rate limiting is applied consistently in both paths via the helper. 4. Fix error-handling asymmetry: the speculative pull Err branch previously only logged warn! and continued, so a non-recoverable error (access denied, channel not found) would spam warnings indefinitely. Now mirrors the EventsAvailable path: emit WindowsEventLogQueryError, break on non-recoverable errors, apply exponential backoff on recoverable ones.
1a90ee1 to
c87bb82
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c87bb825b4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
c87bb82 to
0e8b88a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0e8b88ae66
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…gainst flakiness Address two independent flakiness sources in the vectordotdev#25194 regression tests so the suite is stable on real Windows CI runners. test_pull_events_preserves_setevent_during_drain: - Replaced a 1000ms blocking wait with an immediate 0ms poll after pull_events returns, so the check measures only the reset/preserve behavior of pull_events and is not contaminated by unrelated Windows system events signaling the handle during a nonzero wait window. - Keyed the DRAIN_STEP_HOOK fire-once to the subscription's own signal handle so concurrent pull_events calls from other tests can't flip the hook first and SetEvent the wrong handle. test_speculative_pull_recovers_without_signal: - Same 500ms→0ms poll change, opposite direction: real events arriving during the wait would have re-signaled the manually-cleared handle and flipped the expected Timeout into a real signal result. - Seed a deterministic record via 'eventcreate' before subscription creation so the non-empty-events assertion is independent of whatever backlog the runner happens to have. Freshly provisioned images can have an empty Application log, which would otherwise cause pull_events(100) to legitimately return empty and false-fail the test.
0e8b88a to
bb48d3a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bb48d3a124
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…sfy Send bound on source future EventLogSubscription and ChannelSubscription had unsafe impl Send but no Sync impl. Since &T: Send requires T: Sync, process_event_batch holding &EventLogSubscription across an .await made the entire source future !Send, causing a compile error (ICE + future-not-Send) in release builds. All mutation on these types requires &mut self; &self methods are read-only or delegate to already-Sync types (RateLimiter). The underlying Windows kernel handles are safe for concurrent access. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hey @pront — wanted to flag this one when you have a moment. This PR fixes the No rush at all — just wanted to make sure it's on your radar or you can point me to the right reviewer if this isn't your area. Thanks! |
Summary
Fixes a lost-wakeup race in the
windows_event_logsource that causes it to permanently freeze after a brief period of inactivity (reported in #25194 by @SwimSalt).The Windows Event Log service signals pull-mode subscriptions by calling
SetEventon a manual-reset event handle each time a new matching event is recorded. BecauseSetEventon an already-signaled handle is a no-op, the current "drain then reset" order inpull_eventsis racy:EvtNextreturnsERROR_NO_MORE_ITEMS— signal still set from before the drain.SetEvent— no-op (already set). OS considers its notification delivered.ResetEvent. Signal cleared. The notification from step 2 is lost.WaitForMultipleObjectskeeps timing out — the event sits unconsumed until the next one arrives.The fix reorders the signal handling to the well-known-correct Win32 pattern: reset before the drain so any
SetEventraised during the drain window is preserved, causing the nextWaitForMultipleObjectsto return immediately. To preserve today's "stay awake to revisit this channel" semantic when we break out of the drain early (channel budget exhausted or bookmark update failed mid-batch), we re-SetEventin those paths.Microsoft's canonical pull-subscription sample uses the same racy "drain then reset" order and would benefit from the same correction.
Vector configuration
Same config as in #25194:
How did you test this PR?
cargo check --no-default-features --features sources-windows_event_log --target x86_64-pc-windows-gnu— passes.cargo check --tests ...— tests compile for the Windows target.test_signal_not_lost_when_set_during_draininsrc/sources/windows_event_log/subscription.rsthat pins the invariant usingstd::sync::Barrierfor deterministic thread ordering. With the fix applied the finalWaitForSingleObjectreturnsWAIT_OBJECT_0; under the old post-drain reset order it returnsWAIT_TIMEOUT.Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References