Skip to content

monitor lifecycle conductor#2723

Open
benzekrimaha wants to merge 25 commits intodevelopment/9.3from
improvement/BB-740-monitor-lifecycle-conductor
Open

monitor lifecycle conductor#2723
benzekrimaha wants to merge 25 commits intodevelopment/9.3from
improvement/BB-740-monitor-lifecycle-conductor

Conversation

@benzekrimaha
Copy link
Copy Markdown
Contributor

@benzekrimaha benzekrimaha commented Mar 2, 2026

Issue: BB-740

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 2, 2026

Hello benzekrimaha,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Mar 2, 2026

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

  • 9.3.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.3.1

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 380069a to 25ea9d5 Compare March 2, 2026 16:32
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 90.27778% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.42%. Comparing base (145f7a6) to head (7f45175).
⚠️ Report is 33 commits behind head on development/9.3.

Files with missing lines Patch % Lines
...ecycle/bucketProcessor/LifecycleBucketProcessor.js 70.00% 3 Missing ⚠️
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js 66.66% 1 Missing ⚠️
extensions/lifecycle/tasks/LifecycleTask.js 66.66% 1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js 66.66% 1 Missing ⚠️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js 66.66% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

Files with missing lines Coverage Δ
extensions/lifecycle/LifecycleMetrics.js 98.70% <100.00%> (+0.48%) ⬆️
...tensions/lifecycle/conductor/LifecycleConductor.js 84.16% <100.00%> (+0.45%) ⬆️
extensions/lifecycle/tasks/LifecycleTaskV2.js 88.88% <ø> (ø)
...sions/lifecycle/tasks/LifecycleDeleteObjectTask.js 92.25% <66.66%> (-0.51%) ⬇️
extensions/lifecycle/tasks/LifecycleTask.js 91.41% <66.66%> (-0.14%) ⬇️
...s/lifecycle/tasks/LifecycleUpdateExpirationTask.js 80.51% <66.66%> (-0.57%) ⬇️
...s/lifecycle/tasks/LifecycleUpdateTransitionTask.js 91.17% <66.66%> (-0.75%) ⬇️
...ecycle/bucketProcessor/LifecycleBucketProcessor.js 79.28% <70.00%> (-0.59%) ⬇️

... and 5 files with indirect coverage changes

Components Coverage Δ
Bucket Notification 80.37% <ø> (ø)
Core Library 80.57% <ø> (-0.13%) ⬇️
Ingestion 70.53% <ø> (-0.62%) ⬇️
Lifecycle 78.78% <90.27%> (+0.16%) ⬆️
Oplog Populator 85.83% <ø> (ø)
Replication 59.61% <ø> (-0.04%) ⬇️
Bucket Scanner 85.76% <ø> (ø)
@@                 Coverage Diff                 @@
##           development/9.3    #2723      +/-   ##
===================================================
- Coverage            74.48%   74.42%   -0.06%     
===================================================
  Files                  200      200              
  Lines                13603    13659      +56     
===================================================
+ Hits                 10132    10166      +34     
- Misses                3461     3483      +22     
  Partials                10       10              
Flag Coverage Δ
api:retry 9.10% <0.00%> (-0.04%) ⬇️
api:routes 8.92% <0.00%> (-0.04%) ⬇️
bucket-scanner 85.76% <ø> (ø)
ft_test:queuepopulator 9.06% <8.33%> (-0.97%) ⬇️
ingestion 12.44% <0.00%> (-0.11%) ⬇️
lib 7.58% <0.00%> (-0.04%) ⬇️
lifecycle 18.95% <69.44%> (+0.10%) ⬆️
notification 1.02% <0.00%> (-0.01%) ⬇️
oplogPopulator 0.14% <0.00%> (-0.01%) ⬇️
replication 18.46% <8.33%> (-0.04%) ⬇️
unit 51.16% <84.72%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch 5 times, most recently from 8316f88 to 408c96c Compare March 11, 2026 16:03
@benzekrimaha benzekrimaha marked this pull request as ready for review March 11, 2026 16:35
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 408c96c to e1c5b13 Compare March 11, 2026 16:48
@francoisferrand francoisferrand requested a review from delthas March 18, 2026 09:19
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from e1c5b13 to aefb677 Compare March 18, 2026 10:04
@benzekrimaha benzekrimaha changed the title Improvement/bb 740 monitor lifecycle conductor Improvement/BB-740 monitor lifecycle conductor Mar 18, 2026
@francoisferrand francoisferrand changed the title Improvement/BB-740 monitor lifecycle conductor monitor lifecycle conductor Mar 18, 2026
const log = this.logger.newRequestLogger();
const start = new Date();
const start = Date.now();
this._scanId = uuid();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?

Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from 725c3df to 11a94ea Compare March 19, 2026 09:36
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
@benzekrimaha benzekrimaha force-pushed the improvement/BB-740-monitor-lifecycle-conductor branch from a2128cf to a464b39 Compare March 19, 2026 09:43
Comment thread extensions/lifecycle/conductor/LifecycleConductor.js Outdated
Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js Outdated
Comment thread tests/unit/lifecycle/LifecycleConductor.spec.js Outdated
Comment thread extensions/lifecycle/bucketProcessor/LifecycleBucketProcessor.js Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Apr 15, 2026

  • Unbounded Prometheus label cardinality: bucketProcessorScanStartTime and bucketProcessorScanBucketDoneCount use conductor_scan_id (a per-scan UUID) as a label. Old scan time series are never cleaned up, so cardinality grows without bound over time.
    - Remove the conductor_scan_id label and overwrite a single gauge per scan, or call .remove() on the previous scan's labels when a new scan starts
    - scanStartTimestamp is never set by the conductor: _taskToMessage puts conductorScanId but not scanStartTimestamp into contextInfo, so the bucket processor always falls back to Date.now(). The Grafana panel will not show true end-to-end scan elapsed time.
    - Add scanStartTimestamp to contextInfo in _taskToMessage, or document that this is bucket-processor-local timing

    Review by Claude Code

Comment thread extensions/lifecycle/LifecycleMetrics.js
@claude
Copy link
Copy Markdown

claude Bot commented Apr 16, 2026

  • Unbounded Prometheus label cardinality on bucketProcessorScanStartTime and bucketProcessorScanBucketDoneCount: using conductor_scan_id (UUID) as a label creates new time series every scan cycle with no cleanup, causing unbounded cardinality growth.
    - Remove old label values when a new scan starts, or drop the per-scan-id label in favor of a single gauge that resets each scan (keep the scan ID in logs only).

    Review by Claude Code

@benzekrimaha
Copy link
Copy Markdown
Contributor Author

current intent is:

  • keep per-scan metric series (conductor_scan_id) for observability,
  • keep scan ID in logs as well,
  • accept this temporary cardinality risk for now.
    => If we need a stricter bound, we can address it in follow-up with an agreed policy (bounded retention or a separate aggregate metric without scan label).

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 22, 2026

Incorrect fix version

The Fix Version/s in issue BB-740 contains:

  • 9.3.3

  • 9.4.0

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 9.3.4

  • 9.4.0

Please check the Fix Version/s of BB-740, or the target
branch of this pull request.

Comment thread extensions/lifecycle/tasks/LifecycleTask.js Outdated
Comment on lines +72 to +73
const bucketProcessorScanBucketDoneCount = ZenkoMetrics.createGauge({
name: 's3_lifecycle_bucket_processor_scan_bucket_done_count',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • this is not a gauge: it only ever increases, and counts the "total number of bucket scanned" for a specific Scan → this is really a counter
  • this metrics looks redundant with conductorBucketListings and especially s3_lifecycle_conductor_bucket_list_success_total

Comment thread extensions/lifecycle/LifecycleMetrics.js Outdated
const conductorLatestBatchSize = ZenkoMetrics.createGauge({
name: 's3_lifecycle_latest_batch_size',
help: 'Size of the latest lifecycle batch for each type',
labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_TYPE],
Copy link
Copy Markdown
Contributor

@francoisferrand francoisferrand Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • LIFECYCLE_LABEL_TYPE is the kind of lifecycle operation... So should not be reused for another kind of value, this just create confusion.
  • Do we actually need to have both the number of bucket and number of workflow ? We really process buckets, not workflows : so I don't see what the extra metrics provides...
  • For consistency, this metrics should be moved along with the other conductorLatestBatch* metrics

}

/**
* Called when the first message of a new conductorScanId arrives.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.f. common above (un bucket processor) : we need to design something which will report relevant information when it works properly (one scan after the other) but also when it is not the case.

→ the idea to track the number of buckets processed per scanID is good in that sense, as it will always work: normally with a single serie, but with more if we have some bug.
→ adding some logic to detect "this is the start" and "this is the end" however is not stable, and will work only if everything is working....and eventually not allow use to have meaningful debugging info (as we debug when we have an issue)

Comment thread extensions/lifecycle/LifecycleMetrics.js Outdated
benzekrimaha and others added 6 commits April 22, 2026 15:04
Co-authored-by: Francois Ferrand <francois.ferrand@scality.com>
Co-authored-by: Francois Ferrand <francois.ferrand@scality.com>
Co-authored-by: Francois Ferrand <francois.ferrand@scality.com>
In LifecycleMetrics:
- Replace conductorLatestBatchSize gauge (s3_lifecycle_latest_batch_size,
  labeled by type) with conductorLatestBatchBucketCount gauge
  (s3_lifecycle_latest_batch_bucket_count, origin label only), removing
  the redundant workflow count series.
- Change bucketProcessorScanBucketDoneCount from Gauge to Counter and
  rename to s3_lifecycle_bucket_processor_scan_bucket_done_total.
- Remove onBucketProcessorScanStart and the scan start time gauge
  (s3_lifecycle_bucket_processor_scan_start_time) which relied on
  fragile first-message detection.
- Rename onBucketProcessorBucketDone to onBucketProcessorBucketProcessed.
- Simplify onConductorScanComplete to accept only bucketCount.

In LifecycleConductor:
- Update onConductorScanComplete call to pass only totalBucketsListed.

In LifecycleBucketProcessor:
- Remove scanStartTimestamp and onBucketProcessorScanStart call.
- Update to use renamed onBucketProcessorBucketProcessed method.
name: 's3_lifecycle_bucket_processor_scan_bucket_done_total',
help: 'Total number of buckets processed by this bucket processor, per scan',
labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID],
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conductor_scan_id label on this counter creates a new Prometheus time series for every scan (a fresh UUID each time). Since counters are never pruned within a process lifetime, label cardinality grows without bound — e.g. ~288 new series/day at 5-min scan intervals, per bucket processor pod. This is a classic cardinality bomb that will eventually pressure Prometheus memory and storage.

Consider tracking scan progress with a gauge (resettable per scan) instead of a per-scan-ID counter, or dropping the conductor_scan_id label from the metric and relying on logs for per-scan correlation.

— Claude Code

[LIFECYCLE_LABEL_ORIGIN]: 'bucket_processor',
[LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID]: conductorScanId,
});
} catch (err) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handler references wrong method name. Should match the enclosing method.

```suggestion
'LifecycleMetrics.onBucketProcessorBucketProcessed',

@claude
Copy link
Copy Markdown

claude Bot commented Apr 23, 2026

  • Unbounded Prometheus label cardinality: bucketProcessorScanBucketDoneCount uses conductor_scan_id (a per-scan UUID) as a counter label, creating a new time series on every scan that is never pruned within the process lifetime. This will cause unbounded cardinality growth in Prometheus.
    • Use a gauge (resettable per scan) instead of a per-scan-ID counter, or drop the conductor_scan_id label from the metric and rely on logs for per-scan correlation.
  • Method name mismatch in error handler: onBucketProcessorBucketProcessed references 'LifecycleMetrics.onBucketProcessorBucketDone' in its error handler instead of the actual method name.
    • Fix to 'LifecycleMetrics.onBucketProcessorBucketProcessed'.

Review by Claude Code

Rename s3_lifecycle_bucket_processor_scan_bucket_done_total to
s3_lifecycle_bucket_processor_messages_processed_total and remove the
conductor_scan_id label.

The previous name was misleading: the counter is incremented once per
bucket-tasks topic message processed, not once per unique bucket. In v1
a bucket can produce multiple truncated-listing continuations, and in v2
a bucket can be split across current/noncurrent/orphan listing chains;
each slice is a separate Kafka message and each was double-counted under
the old name.

The conductor_scan_id label is removed because there is no reliable way
to bound its cardinality in practice. The bucket processor never sees
the "last message" of a scan: continuation messages can be re-routed
across partitions to other instances, so each bucket-processor instance
only ever observes a partial subset of the scan and cannot decide when
to stop emitting its label series. prom-client has no native series TTL,
so labelling here would let the in-process series count grow unbounded
across pods over time.

Method renamed: onBucketProcessorBucketProcessed → onBucketProcessorMessageProcessed.
Dashboard panel and Grafana query updated to aggregate by pod only.

Made-with: Cursor
Add s3_lifecycle_bucket_processor_last_conductor_scan_id_info, an
info-style gauge (value always 1) carrying the most recent
conductor_scan_id observed by the bucket processor as a label.

Cardinality is bounded to a single active series per pod by calling
reset() before every set(); Prometheus then expires the previous series
naturally. This deliberately avoids the unbounded-cardinality pitfall
that a per-scan label on a counter would have, while preserving the
ability to detect more than one conductor scan in flight simultaneously:
aggregating count(s3_lifecycle_bucket_processor_last_conductor_scan_id_info)
by (conductor_scan_id) across pods yields the number of distinct active
scans, and a value > 1 is the alerting signal.

The bucket processor calls onBucketProcessorScanIdChanged whenever the
conductor_scan_id observed on an incoming message differs from the
previous one. A Grafana panel "Bucket Processor Active Conductor Scans"
visualises the aggregated count.

Made-with: Cursor
Add s3_lifecycle_conductor_last_batch_duration_seconds, a gauge set by
the conductor at scan completion to the elapsed time of the just-finished
scan in seconds.

Computing the duration as (end_time - start_time) in PromQL is fragile:
when a new scan starts, latest_batch_start_time is refreshed before
latest_batch_end_time, so any dashboard subtracting them returns a
negative value for the duration of the in-progress scan. Letting the
conductor publish the finished-scan duration directly removes that whole
class of mid-scan artifacts and matches the standard Prometheus pattern
of "<thing>_last_<measurement>_seconds" for last-completed-event metrics.

The conductor passes elapsedMs / 1000 to onConductorScanComplete; the
parameter is optional so the metric is only set when a finite, non-
negative value is supplied. The "Full Scan Elapsed Time" Grafana stat
panel is renamed to "Last Conductor Batch Duration" and now reads from
the new gauge directly.

Made-with: Cursor
Clarify in the help text and JSDoc that
s3_lifecycle_latest_batch_start_time is the conductor's scheduling
heartbeat (consumed by the LifecycleLateScan alert) and must not be
used to derive scan duration: while a scan is in progress, end_time is
from the previous run and start_time has just been refreshed, so
end_time - start_time is negative.

s3_lifecycle_conductor_last_batch_duration_seconds (added in the
preceding commit) is now the only sanctioned way to chart scan
duration. The latest_batch_start_time and latest_batch_end_time gauges
are kept because they are needed by the LifecycleLateScan alert
("conductor stopped scheduling") and as a "last completion" timestamp
respectively, but their help text now warns against the previous
end - start pattern.

No code behaviour change.

Made-with: Cursor
@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

LGTM — clean, well-structured monitoring addition. Scan ID propagation is thorough, metric cardinality is properly bounded (reset-before-set on the info gauge), error handling follows existing patterns, and tests cover all new methods including edge cases. The heartbeat move to scan start (from completion) is the correct semantic for detecting a stalled conductor.

Review by Claude Code

const conductorScanId = result.contextInfo && result.contextInfo.conductorScanId;
const scanStartTimestamp = (result.contextInfo
&& result.contextInfo.scanStartTimestamp) || Date.now();
if (conductorScanId && conductorScanId !== this._metricsScanId) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benzekrimaha ?

resetting the current scan ID when it changes WILL break if we have a bug, and make the metrics mostly useless...

Comment on lines +455 to +463
this.listBuckets(messageSendQueue, scanId, log, (err, listedBucketsCount) => {
LifecycleMetrics.onBucketListing(log, err);
return next(err, nBucketsListed);
nBucketsListed = listedBucketsCount;
return next(err, listedBucketsCount);
});
},
(nBucketsListed, next) => {
(listedBucketsCount, next) => {
async.until(
() => nBucketsQueued === nBucketsListed,
() => nBucketsQueued === listedBucketsCount,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

});

const bucketProcessorMessagesProcessed = ZenkoMetrics.createCounter({
name: 's3_lifecycle_bucket_processor_messages_processed_total',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have this metric, in BackbeatConsumer ?

esp. since there is no "scanId" parameter, it does not seem to add much information

labelNames: [LIFECYCLE_LABEL_ORIGIN],
});

// Info-style gauge: a single active series per bucket-processor instance,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we keep an unbounded series, why keep this one (with always has the value 1) instead of the previous one (number of message for a specific scanID) ?

Comment on lines +333 to +337
bucketProcessorLastConductorScanId.reset();
bucketProcessorLastConductorScanId.set({
[LIFECYCLE_LABEL_ORIGIN]: 'bucket_processor',
[LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID]: conductorScanId,
}, 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storing a metric does not mean it is scraped by prometheus.
with this approach, if we have 2 scans (ScanID) in parallel, depending on workload pattern we will have a very high probability that a single ScanID will appear in the metrics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants