fix(prometheus_remote_write sink): prevent deadlock from finalizer leak in metric normalization cache by GreyLilac09 · Pull Request #25214 · vectordotdev/vector

GreyLilac09 · 2026-04-16T16:22:12Z

Summary

The prometheus_remote_write sink deadlocks after 10-15 minutes of operation when using a disk buffer. This is a regression introduced in 0.49.0 when the MetricSet normalization cache was rewritten to use LruCache<MetricSeries, MetricEntry> to support TTL-based
expiration (expire_metrics_secs).

Root cause: The normalization methods incremental_to_absolute() and absolute_to_incremental() store metrics in the MetricSet cache with their Arc<EventFinalizer> references intact. The disk buffer can only acknowledge an event once ALL Arc references to
its finalizer are dropped. The cache holds one reference per unique series indefinitely (until replaced by a newer metric for the same series). Once the disk buffer fills and drop_newest starts dropping new events, no new metrics enter the normalizer, so no cache
replacements occur, no finalizers are released, no buffer space is reclaimed, and the pipeline permanently stalls.

Fix 1 (deadlock): Strip finalizers from metrics before storing them in the normalization cache. This is done at the call sites in incremental_to_absolute (where clones are cached) and absolute_to_incremental (where consumed metrics are cached as baselines),
rather than in the generic MetricSet::insert helper — because insert is also used by the MetricsBuffer batching path (sematext, influxdb, aws_cloudwatch_metrics) where cached metrics are later retrieved with their finalizers and sent through the pipeline.

The stripping is safe because:

In incremental_to_absolute: the original metric retains its finalizers and flows through the pipeline normally; only the clone stored in the cache is stripped.
In absolute_to_incremental: the metric is consumed (function returns None) to establish a baseline — it never enters the pipeline, so dropping a finalizer with Dropped status (the default) does not change the batch's Delivered outcome.

Fix 2 (panic): Setting expire_metrics_secs: 0 causes a panic at normalize.rs:154 because NormalizerConfig::validate() rejects time_to_live: Some(0). The fix filters out zero/negative values in normalized_with_ttl, treating them as "no TTL" instead of
propagating them to a panicking unwrap.

This fix benefits all metric sinks that use make_absolute/make_incremental (prometheus_remote_write, datadog, influxdb, sematext, gcp stackdriver, appsignal, aws cloudwatch, statsd, greptimedb).

Vector configuration

How did you test this PR?

make check-clippy — passes clean
cargo test --lib sinks::util::buffer::metrics — all 21 tests pass
Code review tracing the Arc lifecycle through the normalizer, batch, driver, and disk buffer acknowledgement path
New test normalizer_does_not_hold_finalizer_references: creates metrics with BatchNotifier finalizers, normalizes them, simulates the Driver acking delivery, and asserts the BatchStatusReceiver resolves to Delivered. Verified it fails without the fix (try_recv returns
Err(Empty)) and passes with the fix.

Change Type

Is this a breaking change?

Yes
No

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the no-changelog label to this PR.

References

Closes: #<issue/PR number or link>

Notes

Please read our Vector contributor resources.
Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
Some CI checks run only after we manually approve them.
- We recommend adding a pre-push hook, please see this template.
- Alternatively, we recommend running the following locally before pushing to the remote branch:
  - make fmt
  - make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
  - make test
After a review is requested, please avoid force pushes to help us review incrementally.
- Feel free to push as many commits as you want. They will be squashed into one before merging.
- For example, you can run git merge origin master and git push.
If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ef956ee786

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-16T16:30:54Z

+        // Arc<EventFinalizer> references, as that prevents the disk buffer
+        // from acknowledging events — leading to a deadlock once the buffer
+        // fills up and no new events can replace cache entries.
+        metric.metadata_mut().take_finalizers();


Restrict finalizer stripping to normalization-only state

MetricSet::insert is shared by both normalization and batching paths, but this change now drops metadata finalizers for every insert. In particular, MetricsBuffer::push calls MetricSet::insert_update (src/sinks/util/buffer/metrics/mod.rs:52-65), which reaches this insert path for absolute/new series entries, so those events lose their finalizers immediately instead of carrying them through batching. That can cause acknowledgements to be finalized before request results are known, and failed sends for those metrics cannot be reflected in finalizer status. Please avoid stripping finalizers in the generic insert helper and apply it only to the normalization cache path.

Useful? React with 👍 / 👎.

remove finalizer from cache

ef956ee

GreyLilac09 requested a review from a team as a code owner April 16, 2026 16:22

github-actions bot added the domain: sinks Anything related to the Vector's sinks label Apr 16, 2026

GreyLilac09 mentioned this pull request Apr 16, 2026

prometheus_remote_write sink deadlocks after ~10-15 minutes with high cardinality metrics (regression from 0.48.0) #24896

Open

chatgpt-codex-connector bot reviewed Apr 16, 2026

View reviewed changes

remove PR check

2739bab

GreyLilac09 changed the title ~~bugfix(prometheus_remote_write sink): prevent deadlock from finalizer leak in metric normalization cache~~ fix(prometheus_remote_write sink): prevent deadlock from finalizer leak in metric normalization cache Apr 16, 2026

adjust comment

6fc74fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(prometheus_remote_write sink): prevent deadlock from finalizer leak in metric normalization cache#25214

fix(prometheus_remote_write sink): prevent deadlock from finalizer leak in metric normalization cache#25214
GreyLilac09 wants to merge 3 commits intovectordotdev:masterfrom
GreyLilac09:dhz/strip-finalizer

GreyLilac09 commented Apr 16, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GreyLilac09 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Vector configuration

How did you test this PR?

Change Type

Is this a breaking change?

Does this PR include user facing changes?

References

Notes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GreyLilac09 commented Apr 16, 2026 •

edited

Loading