Skip to content

fix(prometheus_remote_write sink): prevent deadlock from finalizer leak in metric normalization cache#25214

Open
GreyLilac09 wants to merge 3 commits intovectordotdev:masterfrom
GreyLilac09:dhz/strip-finalizer
Open

fix(prometheus_remote_write sink): prevent deadlock from finalizer leak in metric normalization cache#25214
GreyLilac09 wants to merge 3 commits intovectordotdev:masterfrom
GreyLilac09:dhz/strip-finalizer

Conversation

@GreyLilac09
Copy link
Copy Markdown
Contributor

@GreyLilac09 GreyLilac09 commented Apr 16, 2026

Summary

The prometheus_remote_write sink deadlocks after 10-15 minutes of operation when using a disk buffer. This is a regression introduced in 0.49.0 when the MetricSet normalization cache was rewritten to use LruCache<MetricSeries, MetricEntry> to support TTL-based
expiration (expire_metrics_secs).

Root cause: The normalization methods incremental_to_absolute() and absolute_to_incremental() store metrics in the MetricSet cache with their Arc<EventFinalizer> references intact. The disk buffer can only acknowledge an event once ALL Arc references to
its finalizer are dropped. The cache holds one reference per unique series indefinitely (until replaced by a newer metric for the same series). Once the disk buffer fills and drop_newest starts dropping new events, no new metrics enter the normalizer, so no cache
replacements occur, no finalizers are released, no buffer space is reclaimed, and the pipeline permanently stalls.

Fix 1 (deadlock): Strip finalizers from metrics before storing them in the normalization cache. This is done at the call sites in incremental_to_absolute (where clones are cached) and absolute_to_incremental (where consumed metrics are cached as baselines),
rather than in the generic MetricSet::insert helper — because insert is also used by the MetricsBuffer batching path (sematext, influxdb, aws_cloudwatch_metrics) where cached metrics are later retrieved with their finalizers and sent through the pipeline.

The stripping is safe because:

  • In incremental_to_absolute: the original metric retains its finalizers and flows through the pipeline normally; only the clone stored in the cache is stripped.
  • In absolute_to_incremental: the metric is consumed (function returns None) to establish a baseline — it never enters the pipeline, so dropping a finalizer with Dropped status (the default) does not change the batch's Delivered outcome.

Fix 2 (panic): Setting expire_metrics_secs: 0 causes a panic at normalize.rs:154 because NormalizerConfig::validate() rejects time_to_live: Some(0). The fix filters out zero/negative values in normalized_with_ttl, treating them as "no TTL" instead of
propagating them to a panicking unwrap.

This fix benefits all metric sinks that use make_absolute/make_incremental (prometheus_remote_write, datadog, influxdb, sematext, gcp stackdriver, appsignal, aws cloudwatch, statsd, greptimedb).

Vector configuration

How did you test this PR?

  • make check-clippy — passes clean
  • cargo test --lib sinks::util::buffer::metrics — all 21 tests pass
  • Code review tracing the Arc lifecycle through the normalizer, batch, driver, and disk buffer acknowledgement path
  • New test normalizer_does_not_hold_finalizer_references: creates metrics with BatchNotifier finalizers, normalizes them, simulates the Driver acking delivery, and asserts the BatchStatusReceiver resolves to Delivered. Verified it fails without the fix (try_recv returns
    Err(Empty)) and passes with the fix.

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ef956ee786

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// Arc<EventFinalizer> references, as that prevents the disk buffer
// from acknowledging events — leading to a deadlock once the buffer
// fills up and no new events can replace cache entries.
metric.metadata_mut().take_finalizers();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict finalizer stripping to normalization-only state

MetricSet::insert is shared by both normalization and batching paths, but this change now drops metadata finalizers for every insert. In particular, MetricsBuffer::push calls MetricSet::insert_update (src/sinks/util/buffer/metrics/mod.rs:52-65), which reaches this insert path for absolute/new series entries, so those events lose their finalizers immediately instead of carrying them through batching. That can cause acknowledgements to be finalized before request results are known, and failed sends for those metrics cannot be reflected in finalizer status. Please avoid stripping finalizers in the generic insert helper and apply it only to the normalization cache path.

Useful? React with 👍 / 👎.

@GreyLilac09 GreyLilac09 changed the title bugfix(prometheus_remote_write sink): prevent deadlock from finalizer leak in metric normalization cache fix(prometheus_remote_write sink): prevent deadlock from finalizer leak in metric normalization cache Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

prometheus_remote_write sink deadlocks after ~10-15 minutes with high cardinality metrics (regression from 0.48.0)

1 participant