monitor lifecycle conductor by benzekrimaha · Pull Request #2723 · scality/backbeat

benzekrimaha · 2026-03-02T16:23:04Z

bert-e · 2026-03-02T16:23:08Z

Hello benzekrimaha,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e · 2026-03-02T16:23:13Z

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

9.3.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

9.3.1

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

codecov · 2026-03-02T16:48:01Z

Codecov Report

❌ Patch coverage is 90.27778% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.42%. Comparing base (145f7a6) to head (7f45175).
⚠️ Report is 33 commits behind head on development/9.3.

Files with missing lines	Patch %	Lines
...ecycle/bucketProcessor/LifecycleBucketProcessor.js	70.00%	3 Missing ⚠️
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js	66.66%	1 Missing ⚠️
extensions/lifecycle/tasks/LifecycleTask.js	66.66%	1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js	66.66%	1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js	66.66%	1 Missing ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
extensions/lifecycle/LifecycleMetrics.js	`98.70% <100.00%> (+0.48%)`	⬆️
...tensions/lifecycle/conductor/LifecycleConductor.js	`84.16% <100.00%> (+0.45%)`	⬆️
extensions/lifecycle/tasks/LifecycleTaskV2.js	`88.88% <ø> (ø)`
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js	`92.25% <66.66%> (-0.51%)`	⬇️
extensions/lifecycle/tasks/LifecycleTask.js	`91.41% <66.66%> (-0.14%)`	⬇️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js	`80.51% <66.66%> (-0.57%)`	⬇️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js	`91.17% <66.66%> (-0.75%)`	⬇️
...ecycle/bucketProcessor/LifecycleBucketProcessor.js	`79.28% <70.00%> (-0.59%)`	⬇️

... and 5 files with indirect coverage changes

Components	Coverage Δ
Bucket Notification	`80.37% <ø> (ø)`
Core Library	`80.57% <ø> (-0.13%)`	⬇️
Ingestion	`70.53% <ø> (-0.62%)`	⬇️
Lifecycle	`78.78% <90.27%> (+0.16%)`	⬆️
Oplog Populator	`85.83% <ø> (ø)`
Replication	`59.61% <ø> (-0.04%)`	⬇️
Bucket Scanner	`85.76% <ø> (ø)`

@@                 Coverage Diff                 @@
##           development/9.3    #2723      +/-   ##
===================================================
- Coverage            74.48%   74.42%   -0.06%     
===================================================
  Files                  200      200              
  Lines                13603    13659      +56     
===================================================
+ Hits                 10132    10166      +34     
- Misses                3461     3483      +22     
  Partials                10       10

Flag	Coverage Δ
api:retry	`9.10% <0.00%> (-0.04%)`	⬇️
api:routes	`8.92% <0.00%> (-0.04%)`	⬇️
bucket-scanner	`85.76% <ø> (ø)`
ft_test:queuepopulator	`9.06% <8.33%> (-0.97%)`	⬇️
ingestion	`12.44% <0.00%> (-0.11%)`	⬇️
lib	`7.58% <0.00%> (-0.04%)`	⬇️
lifecycle	`18.95% <69.44%> (+0.10%)`	⬆️
notification	`1.02% <0.00%> (-0.01%)`	⬇️
oplogPopulator	`0.14% <0.00%> (-0.01%)`	⬇️
replication	`18.46% <8.33%> (-0.04%)`	⬇️
unit	`51.16% <84.72%> (+0.14%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

delthas · 2026-03-18T16:29:03Z

        const log = this.logger.newRequestLogger();
-        const start = new Date();
+        const start = Date.now();
+        this._scanId = uuid();


Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?

claude · 2026-04-15T17:41:49Z

Unbounded Prometheus label cardinality: bucketProcessorScanStartTime and bucketProcessorScanBucketDoneCount use conductor_scan_id (a per-scan UUID) as a label. Old scan time series are never cleaned up, so cardinality grows without bound over time.
- Remove the conductor_scan_id label and overwrite a single gauge per scan, or call .remove() on the previous scan's labels when a new scan starts
- scanStartTimestamp is never set by the conductor: _taskToMessage puts conductorScanId but not scanStartTimestamp into contextInfo, so the bucket processor always falls back to Date.now(). The Grafana panel will not show true end-to-end scan elapsed time.
- Add scanStartTimestamp to contextInfo in _taskToMessage, or document that this is bucket-processor-local timing

Review by Claude Code

claude · 2026-04-16T06:32:49Z

Unbounded Prometheus label cardinality on bucketProcessorScanStartTime and bucketProcessorScanBucketDoneCount: using conductor_scan_id (UUID) as a label creates new time series every scan cycle with no cleanup, causing unbounded cardinality growth.
- Remove old label values when a new scan starts, or drop the per-scan-id label in favor of a single gauge that resets each scan (keep the scan ID in logs only).

Review by Claude Code

benzekrimaha · 2026-04-16T06:36:52Z

current intent is:

keep per-scan metric series (conductor_scan_id) for observability,
keep scan ID in logs as well,
accept this temporary cardinality risk for now.
=> If we need a stricter bound, we can address it in follow-up with an agreed policy (bounded retention or a separate aggregate metric without scan label).

bert-e · 2026-04-22T11:33:07Z

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

9.3.3
9.4.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

9.3.4
9.4.0

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

francoisferrand · 2026-04-22T11:53:13Z

+const bucketProcessorScanBucketDoneCount = ZenkoMetrics.createGauge({
+    name: 's3_lifecycle_bucket_processor_scan_bucket_done_count',


this is not a gauge: it only ever increases, and counts the "total number of bucket scanned" for a specific Scan → this is really a counter

this metrics looks redundant with conductorBucketListings and especially s3_lifecycle_conductor_bucket_list_success_total

francoisferrand · 2026-04-22T11:55:05Z

+const conductorLatestBatchSize = ZenkoMetrics.createGauge({
+    name: 's3_lifecycle_latest_batch_size',
+    help: 'Size of the latest lifecycle batch for each type',
+    labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_TYPE],


LIFECYCLE_LABEL_TYPE is the kind of lifecycle operation... So should not be reused for another kind of value, this just create confusion.

Do we actually need to have both the number of bucket and number of workflow ? We really process buckets, not workflows : so I don't see what the extra metrics provides...

For consistency, this metrics should be moved along with the other conductorLatestBatch* metrics

francoisferrand · 2026-04-22T12:02:39Z

+    }
+
+    /**
+     * Called when the first message of a new conductorScanId arrives.


c.f. common above (un bucket processor) : we need to design something which will report relevant information when it works properly (one scan after the other) but also when it is not the case.

→ the idea to track the number of buckets processed per scanID is good in that sense, as it will always work: normally with a single serie, but with more if we have some bug.
→ adding some logic to detect "this is the start" and "this is the end" however is not stable, and will work only if everything is working....and eventually not allow use to have meaningful debugging info (as we debug when we have an issue)

Co-authored-by: Francois Ferrand <francois.ferrand@scality.com>

In LifecycleMetrics: - Replace conductorLatestBatchSize gauge (s3_lifecycle_latest_batch_size, labeled by type) with conductorLatestBatchBucketCount gauge (s3_lifecycle_latest_batch_bucket_count, origin label only), removing the redundant workflow count series. - Change bucketProcessorScanBucketDoneCount from Gauge to Counter and rename to s3_lifecycle_bucket_processor_scan_bucket_done_total. - Remove onBucketProcessorScanStart and the scan start time gauge (s3_lifecycle_bucket_processor_scan_start_time) which relied on fragile first-message detection. - Rename onBucketProcessorBucketDone to onBucketProcessorBucketProcessed. - Simplify onConductorScanComplete to accept only bucketCount. In LifecycleConductor: - Update onConductorScanComplete call to pass only totalBucketsListed. In LifecycleBucketProcessor: - Remove scanStartTimestamp and onBucketProcessorScanStart call. - Update to use renamed onBucketProcessorBucketProcessed method.

Issue: BB-740

claude · 2026-04-23T04:49:11Z

+    name: 's3_lifecycle_bucket_processor_scan_bucket_done_total',
+    help: 'Total number of buckets processed by this bucket processor, per scan',
+    labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID],
+});


The conductor_scan_id label on this counter creates a new Prometheus time series for every scan (a fresh UUID each time). Since counters are never pruned within a process lifetime, label cardinality grows without bound — e.g. ~288 new series/day at 5-min scan intervals, per bucket processor pod. This is a classic cardinality bomb that will eventually pressure Prometheus memory and storage.

Consider tracking scan progress with a gauge (resettable per scan) instead of a per-scan-ID counter, or dropping the conductor_scan_id label from the metric and relying on logs for per-scan correlation.

— Claude Code

claude · 2026-04-23T04:49:19Z

+                [LIFECYCLE_LABEL_ORIGIN]: 'bucket_processor',
+                [LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID]: conductorScanId,
+            });
+        } catch (err) {


Error handler references wrong method name. Should match the enclosing method.

```suggestion
'LifecycleMetrics.onBucketProcessorBucketProcessed',

claude · 2026-04-23T04:49:29Z

Unbounded Prometheus label cardinality: bucketProcessorScanBucketDoneCount uses conductor_scan_id (a per-scan UUID) as a counter label, creating a new time series on every scan that is never pruned within the process lifetime. This will cause unbounded cardinality growth in Prometheus.
- Use a gauge (resettable per scan) instead of a per-scan-ID counter, or drop the conductor_scan_id label from the metric and rely on logs for per-scan correlation.
Method name mismatch in error handler: onBucketProcessorBucketProcessed references 'LifecycleMetrics.onBucketProcessorBucketDone' in its error handler instead of the actual method name.
- Fix to 'LifecycleMetrics.onBucketProcessorBucketProcessed'.

Review by Claude Code

Rename s3_lifecycle_bucket_processor_scan_bucket_done_total to s3_lifecycle_bucket_processor_messages_processed_total and remove the conductor_scan_id label. The previous name was misleading: the counter is incremented once per bucket-tasks topic message processed, not once per unique bucket. In v1 a bucket can produce multiple truncated-listing continuations, and in v2 a bucket can be split across current/noncurrent/orphan listing chains; each slice is a separate Kafka message and each was double-counted under the old name. The conductor_scan_id label is removed because there is no reliable way to bound its cardinality in practice. The bucket processor never sees the "last message" of a scan: continuation messages can be re-routed across partitions to other instances, so each bucket-processor instance only ever observes a partial subset of the scan and cannot decide when to stop emitting its label series. prom-client has no native series TTL, so labelling here would let the in-process series count grow unbounded across pods over time. Method renamed: onBucketProcessorBucketProcessed → onBucketProcessorMessageProcessed. Dashboard panel and Grafana query updated to aggregate by pod only. Made-with: Cursor

Add s3_lifecycle_bucket_processor_last_conductor_scan_id_info, an info-style gauge (value always 1) carrying the most recent conductor_scan_id observed by the bucket processor as a label. Cardinality is bounded to a single active series per pod by calling reset() before every set(); Prometheus then expires the previous series naturally. This deliberately avoids the unbounded-cardinality pitfall that a per-scan label on a counter would have, while preserving the ability to detect more than one conductor scan in flight simultaneously: aggregating count(s3_lifecycle_bucket_processor_last_conductor_scan_id_info) by (conductor_scan_id) across pods yields the number of distinct active scans, and a value > 1 is the alerting signal. The bucket processor calls onBucketProcessorScanIdChanged whenever the conductor_scan_id observed on an incoming message differs from the previous one. A Grafana panel "Bucket Processor Active Conductor Scans" visualises the aggregated count. Made-with: Cursor

Add s3_lifecycle_conductor_last_batch_duration_seconds, a gauge set by the conductor at scan completion to the elapsed time of the just-finished scan in seconds. Computing the duration as (end_time - start_time) in PromQL is fragile: when a new scan starts, latest_batch_start_time is refreshed before latest_batch_end_time, so any dashboard subtracting them returns a negative value for the duration of the in-progress scan. Letting the conductor publish the finished-scan duration directly removes that whole class of mid-scan artifacts and matches the standard Prometheus pattern of "<thing>_last_<measurement>_seconds" for last-completed-event metrics. The conductor passes elapsedMs / 1000 to onConductorScanComplete; the parameter is optional so the metric is only set when a finite, non- negative value is supplied. The "Full Scan Elapsed Time" Grafana stat panel is renamed to "Last Conductor Batch Duration" and now reads from the new gauge directly. Made-with: Cursor

Clarify in the help text and JSDoc that s3_lifecycle_latest_batch_start_time is the conductor's scheduling heartbeat (consumed by the LifecycleLateScan alert) and must not be used to derive scan duration: while a scan is in progress, end_time is from the previous run and start_time has just been refreshed, so end_time - start_time is negative. s3_lifecycle_conductor_last_batch_duration_seconds (added in the preceding commit) is now the only sanctioned way to chart scan duration. The latest_batch_start_time and latest_batch_end_time gauges are kept because they are needed by the LifecycleLateScan alert ("conductor stopped scheduling") and as a "last completion" timestamp respectively, but their help text now warns against the previous end - start pattern. No code behaviour change. Made-with: Cursor

claude · 2026-04-24T07:52:59Z

LGTM — clean, well-structured monitoring addition. Scan ID propagation is thorough, metric cardinality is properly bounded (reset-before-set on the info gauge), error handling follows existing patterns, and tests cover all new methods including edge cases. The heartbeat move to scan start (from completion) is the correct semantic for detecting a stalled conductor.

Review by Claude Code

francoisferrand · 2026-04-24T15:17:37Z

+        const conductorScanId = result.contextInfo && result.contextInfo.conductorScanId;
+        const scanStartTimestamp = (result.contextInfo
+            && result.contextInfo.scanStartTimestamp) || Date.now();
+        if (conductorScanId && conductorScanId !== this._metricsScanId) {


@benzekrimaha ?

resetting the current scan ID when it changes WILL break if we have a bug, and make the metrics mostly useless...

francoisferrand · 2026-04-24T15:19:03Z

+                this.listBuckets(messageSendQueue, scanId, log, (err, listedBucketsCount) => {
                    LifecycleMetrics.onBucketListing(log, err);
-                    return next(err, nBucketsListed);
+                    nBucketsListed = listedBucketsCount;
+                    return next(err, listedBucketsCount);
                });
            },
-            (nBucketsListed, next) => {
+            (listedBucketsCount, next) => {
                async.until(
-                    () => nBucketsQueued === nBucketsListed,
+                    () => nBucketsQueued === listedBucketsCount,


@benzekrimaha ?

francoisferrand · 2026-04-24T15:21:02Z

+});
+
+const bucketProcessorMessagesProcessed = ZenkoMetrics.createCounter({
+    name: 's3_lifecycle_bucket_processor_messages_processed_total',


I think we already have this metric, in BackbeatConsumer ?

esp. since there is no "scanId" parameter, it does not seem to add much information

francoisferrand · 2026-04-24T15:23:28Z

+    labelNames: [LIFECYCLE_LABEL_ORIGIN],
+});
+
+// Info-style gauge: a single active series per bucket-processor instance,


if we keep an unbounded series, why keep this one (with always has the value 1) instead of the previous one (number of message for a specific scanID) ?

francoisferrand · 2026-04-24T15:26:40Z

+            bucketProcessorLastConductorScanId.reset();
+            bucketProcessorLastConductorScanId.set({
+                [LIFECYCLE_LABEL_ORIGIN]: 'bucket_processor',
+                [LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID]: conductorScanId,
+            }, 1);


storing a metric does not mean it is scraped by prometheus.
with this approach, if we have 2 scans (ScanID) in parallel, depending on workload pattern we will have a very high probability that a single ScanID will appear in the metrics

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 380069a to 25ea9d5 Compare March 2, 2026 16:32

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch 5 times, most recently from 8316f88 to 408c96c Compare March 11, 2026 16:03

benzekrimaha marked this pull request as ready for review March 11, 2026 16:35

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 408c96c to e1c5b13 Compare March 11, 2026 16:48

benzekrimaha requested review from a team, SylvainSenechal and francoisferrand March 13, 2026 08:49

francoisferrand requested a review from delthas March 18, 2026 09:19

benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from e1c5b13 to aefb677 Compare March 18, 2026 10:04

benzekrimaha changed the title ~~Improvement/bb 740 monitor lifecycle conductor~~ Improvement/BB-740 monitor lifecycle conductor Mar 18, 2026

francoisferrand changed the title ~~Improvement/BB-740 monitor lifecycle conductor~~ monitor lifecycle conductor Mar 18, 2026

delthas reviewed Mar 18, 2026

View reviewed changes