Skip to content

fix(sources): prevent windows_event_log permanent freeze from signal-event lost wakeup#25195

Open
tot19 wants to merge 4 commits intovectordotdev:masterfrom
tot19:tot19/fix-windows-event-log-lost-wakeup-25194
Open

fix(sources): prevent windows_event_log permanent freeze from signal-event lost wakeup#25195
tot19 wants to merge 4 commits intovectordotdev:masterfrom
tot19:tot19/fix-windows-event-log-lost-wakeup-25194

Conversation

@tot19
Copy link
Copy Markdown
Contributor

@tot19 tot19 commented Apr 15, 2026

Summary

Fixes a lost-wakeup race in the windows_event_log source that causes it to permanently freeze after a brief period of inactivity (reported in #25194 by @SwimSalt).

The Windows Event Log service signals pull-mode subscriptions by calling SetEvent on a manual-reset event handle each time a new matching event is recorded. Because SetEvent on an already-signaled handle is a no-op, the current "drain then reset" order in pull_events is racy:

  1. EvtNext returns ERROR_NO_MORE_ITEMS — signal still set from before the drain.
  2. A new event arrives. OS calls SetEvent — no-op (already set). OS considers its notification delivered.
  3. Vector calls ResetEvent. Signal cleared. The notification from step 2 is lost.
  4. WaitForMultipleObjects keeps timing out — the event sits unconsumed until the next one arrives.

The fix reorders the signal handling to the well-known-correct Win32 pattern: reset before the drain so any SetEvent raised during the drain window is preserved, causing the next WaitForMultipleObjects to return immediately. To preserve today's "stay awake to revisit this channel" semantic when we break out of the drain early (channel budget exhausted or bookmark update failed mid-batch), we re-SetEvent in those paths.

Microsoft's canonical pull-subscription sample uses the same racy "drain then reset" order and would benefit from the same correction.

Vector configuration

Same config as in #25194:

sources:
  windows_events:
    type: windows_event_log
    channels: [Application, System, Security]

How did you test this PR?

  • cargo check --no-default-features --features sources-windows_event_log --target x86_64-pc-windows-gnu — passes.
  • cargo check --tests ... — tests compile for the Windows target.
  • Added regression test test_signal_not_lost_when_set_during_drain in src/sources/windows_event_log/subscription.rs that pins the invariant using std::sync::Barrier for deterministic thread ordering. With the fix applied the final WaitForSingleObject returns WAIT_OBJECT_0; under the old post-drain reset order it returns WAIT_TIMEOUT.
  • Would appreciate a manual confirmation from @SwimSalt on a real Windows host (see the issue for reproduction steps).

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

@tot19 tot19 requested a review from a team as a code owner April 15, 2026 07:37
@github-actions github-actions Bot added the domain: sources Anything related to the Vector's sources label Apr 15, 2026
@tot19 tot19 changed the title fix(windows_event_log source): prevent permanent freeze from signal-event lost wakeup fix(sources): prevent windows_event_log permanent freeze from signal-event lost wakeup Apr 15, 2026
tot19 added a commit to tot19/vector that referenced this pull request Apr 15, 2026
…rden regression test

Address review feedback on vectordotdev#25195:

- Re-SetEvent on the EvtNext error-return path. Without this, a transient
  EvtNext failure leaves the pre-drain-reset signal cleared with events
  possibly still un-drained — the same class of lost-wakeup the PR fixes
  for the success path.

- Replace the primitive-level Barrier test with an integration-style test
  that drives real pull_events against a real EvtSubscribe handle. A new
  cfg(test) DRAIN_STEP_HOOK fires SetEvent on the subscription's signal
  handle inside the drain loop, simulating the OS signaling a fresh event
  mid-drain. After pull_events returns, wait_for_events_blocking must
  return EventsAvailable; under the old post-drain ResetEvent order it
  would time out. This test would now actually catch a reversion.
@github-actions github-actions Bot added the domain: ci Anything related to Vector's CI environment label Apr 15, 2026
@tot19 tot19 force-pushed the tot19/fix-windows-event-log-lost-wakeup-25194 branch from c1f8c06 to 8fe9083 Compare April 15, 2026 07:55
@github-actions github-actions Bot removed the domain: ci Anything related to Vector's CI environment label Apr 15, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c1f8c06263

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sources/windows_event_log/subscription.rs
tot19 added a commit to tot19/vector that referenced this pull request Apr 15, 2026
…ription

Addresses Codex review feedback on vectordotdev#25195. DRAIN_STEP_HOOK is a
process-global and cargo runs tests in parallel, so a concurrent test's
pull_events could trigger this test's one-shot hook first, flip `fired`,
and SetEvent on the wrong signal handle — intermittently timing out
test_pull_events_preserves_setevent_during_drain and perturbing the
sibling pull_events tests.

Capture the subscription's own signal handle up front and make the hook
no-op when invoked for any other handle. Add a cfg(test)-only
first_channel_signal_raw accessor for the test to read it.
tot19 added a commit to tot19/vector that referenced this pull request Apr 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 15, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@tot19 tot19 force-pushed the tot19/fix-windows-event-log-lost-wakeup-25194 branch 3 times, most recently from 00c9882 to cc6cedb Compare April 15, 2026 09:22
tot19 added a commit to tot19/vector that referenced this pull request Apr 15, 2026
…rden regression test

Address review feedback on vectordotdev#25195:

- Re-SetEvent on the EvtNext error-return path. Without this, a transient
  EvtNext failure leaves the pre-drain-reset signal cleared with events
  possibly still un-drained — the same class of lost-wakeup the PR fixes
  for the success path.

- Replace the primitive-level Barrier test with an integration-style test
  that drives real pull_events against a real EvtSubscribe handle. A new
  cfg(test) DRAIN_STEP_HOOK fires SetEvent on the subscription's signal
  handle inside the drain loop, simulating the OS signaling a fresh event
  mid-drain. After pull_events returns, wait_for_events_blocking must
  return EventsAvailable; under the old post-drain ResetEvent order it
  would time out. This test would now actually catch a reversion.
tot19 added a commit to tot19/vector that referenced this pull request Apr 15, 2026
…ription

Addresses Codex review feedback on vectordotdev#25195. DRAIN_STEP_HOOK is a
process-global and cargo runs tests in parallel, so a concurrent test's
pull_events could trigger this test's one-shot hook first, flip `fired`,
and SetEvent on the wrong signal handle — intermittently timing out
test_pull_events_preserves_setevent_during_drain and perturbing the
sibling pull_events tests.

Capture the subscription's own signal handle up front and make the hook
no-op when invoked for any other handle. Add a cfg(test)-only
first_channel_signal_raw accessor for the test to read it.
tot19 added a commit to tot19/vector that referenced this pull request Apr 15, 2026
@tot19 tot19 force-pushed the tot19/fix-windows-event-log-lost-wakeup-25194 branch from cc6cedb to bc62f20 Compare April 15, 2026 09:57
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc62f20803

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sources/windows_event_log/mod.rs Outdated
Comment thread src/sources/windows_event_log/mod.rs
@tot19 tot19 force-pushed the tot19/fix-windows-event-log-lost-wakeup-25194 branch from bc62f20 to 6a04ccf Compare April 15, 2026 10:16
…keup regression tests

The Windows Event Log service signals the pull-mode wait handle via SetEvent
each time a new matching event is recorded. Because the handle is manual-reset,
any SetEvent that fires between the last EvtNext call and the post-drain
ResetEvent is silently lost — the subscription then hangs until the next OS
event arrives (vectordotdev#25194).

Fix: reset the handle *before* entering the drain loop. Signals raised during
the drain are preserved because SetEvent on an already-signaled handle is a
no-op. Re-arm (SetEvent) on early exits so the next pull_events revisits the
channel without waiting for a fresh OS notification:
  - budget exhaustion
  - bookmark failure mid-batch
  - transient EvtNext error

Regression tests:
- test_pull_events_preserves_setevent_during_drain: installs DRAIN_STEP_HOOK to
  fire SetEvent mid-drain and asserts wait_for_events_blocking returns
  EventsAvailable, not Timeout.
- test_speculative_pull_recovers_without_signal: manually clears the channel
  signal via ResetEvent, confirms wait times out, then asserts pull_events still
  returns events — proving the speculative timeout pull in mod.rs self-heals
  independently of signal state.

Also: comment re-subscription break paths (ERROR_EVT_QUERY_RESULT_STALE and
INVALID_POSITION) noting the speculative pull as a safety net if the re-subbed
channel does not immediately re-signal; add serialization note to DRAIN_STEP_HOOK.
@tot19 tot19 force-pushed the tot19/fix-windows-event-log-lost-wakeup-25194 branch from 6a04ccf to 873d0c8 Compare April 15, 2026 10:18
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 40ef9b5641

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sources/windows_event_log/subscription.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1b16c28e09

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sources/windows_event_log/subscription.rs
@tot19 tot19 force-pushed the tot19/fix-windows-event-log-lost-wakeup-25194 branch 2 times, most recently from 5bf9fc5 to 1a90ee1 Compare April 17, 2026 08:05
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a90ee1ef9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sources/windows_event_log/mod.rs
…cessing, fix error handling

Four related improvements to mod.rs:

1. Speculative pull on WaitResult::Timeout: call pull_events on every timeout
   cycle as a belt-and-suspenders self-heal. EvtNext returns ERROR_NO_MORE_ITEMS
   immediately on an empty channel (near-zero cost). If events are recovered a
   warning is emitted. Guarantees recovery within one timeout period regardless
   of the root cause of the lost wakeup.

2. Extract with_subscription_blocking helper: wraps the spawn_blocking
   ownership-transfer pattern (move subscription in, run blocking fn, return
   subscription + result). All three blocking calls (wait, normal pull,
   speculative pull) now use this helper instead of inlining spawn_blocking.

3. Extract process_event_batch helper: the parse/emit/send_batch/finalize
   sequence was duplicated verbatim between the EventsAvailable arm and the
   speculative-timeout arm. Extracted into a shared free async function.
   Rate limiting is applied consistently in both paths via the helper.

4. Fix error-handling asymmetry: the speculative pull Err branch previously
   only logged warn! and continued, so a non-recoverable error (access denied,
   channel not found) would spam warnings indefinitely. Now mirrors the
   EventsAvailable path: emit WindowsEventLogQueryError, break on
   non-recoverable errors, apply exponential backoff on recoverable ones.
@tot19 tot19 force-pushed the tot19/fix-windows-event-log-lost-wakeup-25194 branch from 1a90ee1 to c87bb82 Compare April 17, 2026 08:22
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c87bb825b4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sources/windows_event_log/subscription.rs
Comment thread src/sources/windows_event_log/subscription.rs Outdated
@tot19 tot19 force-pushed the tot19/fix-windows-event-log-lost-wakeup-25194 branch from c87bb82 to 0e8b88a Compare April 17, 2026 08:33
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0e8b88ae66

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sources/windows_event_log/mod.rs
…gainst flakiness

Address two independent flakiness sources in the vectordotdev#25194 regression tests
so the suite is stable on real Windows CI runners.

test_pull_events_preserves_setevent_during_drain:
- Replaced a 1000ms blocking wait with an immediate 0ms poll after
  pull_events returns, so the check measures only the reset/preserve
  behavior of pull_events and is not contaminated by unrelated Windows
  system events signaling the handle during a nonzero wait window.
- Keyed the DRAIN_STEP_HOOK fire-once to the subscription's own signal
  handle so concurrent pull_events calls from other tests can't flip
  the hook first and SetEvent the wrong handle.

test_speculative_pull_recovers_without_signal:
- Same 500ms→0ms poll change, opposite direction: real events arriving
  during the wait would have re-signaled the manually-cleared handle
  and flipped the expected Timeout into a real signal result.
- Seed a deterministic record via 'eventcreate' before subscription
  creation so the non-empty-events assertion is independent of whatever
  backlog the runner happens to have. Freshly provisioned images can
  have an empty Application log, which would otherwise cause
  pull_events(100) to legitimately return empty and false-fail the test.
@tot19 tot19 force-pushed the tot19/fix-windows-event-log-lost-wakeup-25194 branch from 0e8b88a to bb48d3a Compare April 17, 2026 11:04
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb48d3a124

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/sources/windows_event_log/subscription.rs
…sfy Send bound on source future

EventLogSubscription and ChannelSubscription had unsafe impl Send but no
Sync impl. Since &T: Send requires T: Sync, process_event_batch holding
&EventLogSubscription across an .await made the entire source future !Send,
causing a compile error (ICE + future-not-Send) in release builds.

All mutation on these types requires &mut self; &self methods are read-only
or delegate to already-Sync types (RateLimiter). The underlying Windows
kernel handles are safe for concurrent access.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@tot19
Copy link
Copy Markdown
Contributor Author

tot19 commented Apr 21, 2026

Hey @pront — wanted to flag this one when you have a moment.

This PR fixes the windows_event_log lost-wakeup freeze reported in #25194. A community member (@SwimSalt) has been testing it on Windows Server and confirmed the freeze is gone. The compilation issues that came up during their testing are now resolved as of the latest commit.

No rush at all — just wanted to make sure it's on your radar or you can point me to the right reviewer if this isn't your area. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: sources Anything related to the Vector's sources

Projects

None yet

Development

Successfully merging this pull request may close these issues.

windows_event_log source permanently freezes after ~3 minutes of inactivity (PR #24305)

1 participant