fix(prometheus_remote_write sink): prevent deadlock from finalizer leak in metric normalization cache#25214
fix(prometheus_remote_write sink): prevent deadlock from finalizer leak in metric normalization cache#25214GreyLilac09 wants to merge 3 commits intovectordotdev:masterfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef956ee786
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Arc<EventFinalizer> references, as that prevents the disk buffer | ||
| // from acknowledging events — leading to a deadlock once the buffer | ||
| // fills up and no new events can replace cache entries. | ||
| metric.metadata_mut().take_finalizers(); |
There was a problem hiding this comment.
Restrict finalizer stripping to normalization-only state
MetricSet::insert is shared by both normalization and batching paths, but this change now drops metadata finalizers for every insert. In particular, MetricsBuffer::push calls MetricSet::insert_update (src/sinks/util/buffer/metrics/mod.rs:52-65), which reaches this insert path for absolute/new series entries, so those events lose their finalizers immediately instead of carrying them through batching. That can cause acknowledgements to be finalized before request results are known, and failed sends for those metrics cannot be reflected in finalizer status. Please avoid stripping finalizers in the generic insert helper and apply it only to the normalization cache path.
Useful? React with 👍 / 👎.
Summary
The
prometheus_remote_writesink deadlocks after 10-15 minutes of operation when using a disk buffer. This is a regression introduced in 0.49.0 when theMetricSetnormalization cache was rewritten to useLruCache<MetricSeries, MetricEntry>to support TTL-basedexpiration (
expire_metrics_secs).Root cause: The normalization methods
incremental_to_absolute()andabsolute_to_incremental()store metrics in theMetricSetcache with theirArc<EventFinalizer>references intact. The disk buffer can only acknowledge an event once ALLArcreferences toits finalizer are dropped. The cache holds one reference per unique series indefinitely (until replaced by a newer metric for the same series). Once the disk buffer fills and
drop_neweststarts dropping new events, no new metrics enter the normalizer, so no cachereplacements occur, no finalizers are released, no buffer space is reclaimed, and the pipeline permanently stalls.
Fix 1 (deadlock): Strip finalizers from metrics before storing them in the normalization cache. This is done at the call sites in
incremental_to_absolute(where clones are cached) andabsolute_to_incremental(where consumed metrics are cached as baselines),rather than in the generic
MetricSet::inserthelper — becauseinsertis also used by theMetricsBufferbatching path (sematext, influxdb, aws_cloudwatch_metrics) where cached metrics are later retrieved with their finalizers and sent through the pipeline.The stripping is safe because:
incremental_to_absolute: the original metric retains its finalizers and flows through the pipeline normally; only the clone stored in the cache is stripped.absolute_to_incremental: the metric is consumed (function returnsNone) to establish a baseline — it never enters the pipeline, so dropping a finalizer withDroppedstatus (the default) does not change the batch'sDeliveredoutcome.Fix 2 (panic): Setting
expire_metrics_secs: 0causes a panic atnormalize.rs:154becauseNormalizerConfig::validate()rejectstime_to_live: Some(0). The fix filters out zero/negative values innormalized_with_ttl, treating them as "no TTL" instead ofpropagating them to a panicking
unwrap.This fix benefits all metric sinks that use
make_absolute/make_incremental(prometheus_remote_write, datadog, influxdb, sematext, gcp stackdriver, appsignal, aws cloudwatch, statsd, greptimedb).Vector configuration
How did you test this PR?
Err(Empty)) and passes with the fix.
Change Type
Is this a breaking change?
Does this PR include user facing changes?
no-changeloglabel to this PR.References
Notes
@vectordotdev/vectorto reach out to us regarding this PR.pre-pushhook, please see this template.make fmtmake check-clippy(if there are failures it's possible some of them can be fixed withmake clippy-fix)make testgit merge origin masterandgit push.Cargo.lock), pleaserun
make build-licensesto regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.