Conversation
Hello benzekrimaha,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
380069a to
25ea9d5
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files
... and 5 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.3 #2723 +/- ##
===================================================
- Coverage 74.48% 74.42% -0.06%
===================================================
Files 200 200
Lines 13603 13659 +56
===================================================
+ Hits 10132 10166 +34
- Misses 3461 3483 +22
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
8316f88 to
408c96c
Compare
408c96c to
e1c5b13
Compare
e1c5b13 to
aefb677
Compare
| const log = this.logger.newRequestLogger(); | ||
| const start = new Date(); | ||
| const start = Date.now(); | ||
| this._scanId = uuid(); |
There was a problem hiding this comment.
Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?
725c3df to
11a94ea
Compare
a2128cf to
a464b39
Compare
|
|
|
current intent is:
|
| const bucketProcessorScanBucketDoneCount = ZenkoMetrics.createGauge({ | ||
| name: 's3_lifecycle_bucket_processor_scan_bucket_done_count', |
There was a problem hiding this comment.
- this is not a gauge: it only ever increases, and counts the "total number of bucket scanned" for a specific Scan → this is really a counter
- this metrics looks redundant with conductorBucketListings and especially
s3_lifecycle_conductor_bucket_list_success_total
| const conductorLatestBatchSize = ZenkoMetrics.createGauge({ | ||
| name: 's3_lifecycle_latest_batch_size', | ||
| help: 'Size of the latest lifecycle batch for each type', | ||
| labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_TYPE], |
There was a problem hiding this comment.
LIFECYCLE_LABEL_TYPEis the kind of lifecycle operation... So should not be reused for another kind of value, this just create confusion.- Do we actually need to have both the number of
bucketand number ofworkflow? We really process buckets, not workflows : so I don't see what the extra metrics provides... - For consistency, this metrics should be moved along with the other
conductorLatestBatch*metrics
| } | ||
|
|
||
| /** | ||
| * Called when the first message of a new conductorScanId arrives. |
There was a problem hiding this comment.
c.f. common above (un bucket processor) : we need to design something which will report relevant information when it works properly (one scan after the other) but also when it is not the case.
→ the idea to track the number of buckets processed per scanID is good in that sense, as it will always work: normally with a single serie, but with more if we have some bug.
→ adding some logic to detect "this is the start" and "this is the end" however is not stable, and will work only if everything is working....and eventually not allow use to have meaningful debugging info (as we debug when we have an issue)
Co-authored-by: Francois Ferrand <francois.ferrand@scality.com>
Co-authored-by: Francois Ferrand <francois.ferrand@scality.com>
Co-authored-by: Francois Ferrand <francois.ferrand@scality.com>
In LifecycleMetrics: - Replace conductorLatestBatchSize gauge (s3_lifecycle_latest_batch_size, labeled by type) with conductorLatestBatchBucketCount gauge (s3_lifecycle_latest_batch_bucket_count, origin label only), removing the redundant workflow count series. - Change bucketProcessorScanBucketDoneCount from Gauge to Counter and rename to s3_lifecycle_bucket_processor_scan_bucket_done_total. - Remove onBucketProcessorScanStart and the scan start time gauge (s3_lifecycle_bucket_processor_scan_start_time) which relied on fragile first-message detection. - Rename onBucketProcessorBucketDone to onBucketProcessorBucketProcessed. - Simplify onConductorScanComplete to accept only bucketCount. In LifecycleConductor: - Update onConductorScanComplete call to pass only totalBucketsListed. In LifecycleBucketProcessor: - Remove scanStartTimestamp and onBucketProcessorScanStart call. - Update to use renamed onBucketProcessorBucketProcessed method.
Issue: BB-740
Issue: BB-740
| name: 's3_lifecycle_bucket_processor_scan_bucket_done_total', | ||
| help: 'Total number of buckets processed by this bucket processor, per scan', | ||
| labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID], | ||
| }); |
There was a problem hiding this comment.
The conductor_scan_id label on this counter creates a new Prometheus time series for every scan (a fresh UUID each time). Since counters are never pruned within a process lifetime, label cardinality grows without bound — e.g. ~288 new series/day at 5-min scan intervals, per bucket processor pod. This is a classic cardinality bomb that will eventually pressure Prometheus memory and storage.
Consider tracking scan progress with a gauge (resettable per scan) instead of a per-scan-ID counter, or dropping the conductor_scan_id label from the metric and relying on logs for per-scan correlation.
— Claude Code
| [LIFECYCLE_LABEL_ORIGIN]: 'bucket_processor', | ||
| [LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID]: conductorScanId, | ||
| }); | ||
| } catch (err) { |
There was a problem hiding this comment.
Error handler references wrong method name. Should match the enclosing method.
```suggestion
'LifecycleMetrics.onBucketProcessorBucketProcessed',
Review by Claude Code |
Rename s3_lifecycle_bucket_processor_scan_bucket_done_total to s3_lifecycle_bucket_processor_messages_processed_total and remove the conductor_scan_id label. The previous name was misleading: the counter is incremented once per bucket-tasks topic message processed, not once per unique bucket. In v1 a bucket can produce multiple truncated-listing continuations, and in v2 a bucket can be split across current/noncurrent/orphan listing chains; each slice is a separate Kafka message and each was double-counted under the old name. The conductor_scan_id label is removed because there is no reliable way to bound its cardinality in practice. The bucket processor never sees the "last message" of a scan: continuation messages can be re-routed across partitions to other instances, so each bucket-processor instance only ever observes a partial subset of the scan and cannot decide when to stop emitting its label series. prom-client has no native series TTL, so labelling here would let the in-process series count grow unbounded across pods over time. Method renamed: onBucketProcessorBucketProcessed → onBucketProcessorMessageProcessed. Dashboard panel and Grafana query updated to aggregate by pod only. Made-with: Cursor
Add s3_lifecycle_bucket_processor_last_conductor_scan_id_info, an info-style gauge (value always 1) carrying the most recent conductor_scan_id observed by the bucket processor as a label. Cardinality is bounded to a single active series per pod by calling reset() before every set(); Prometheus then expires the previous series naturally. This deliberately avoids the unbounded-cardinality pitfall that a per-scan label on a counter would have, while preserving the ability to detect more than one conductor scan in flight simultaneously: aggregating count(s3_lifecycle_bucket_processor_last_conductor_scan_id_info) by (conductor_scan_id) across pods yields the number of distinct active scans, and a value > 1 is the alerting signal. The bucket processor calls onBucketProcessorScanIdChanged whenever the conductor_scan_id observed on an incoming message differs from the previous one. A Grafana panel "Bucket Processor Active Conductor Scans" visualises the aggregated count. Made-with: Cursor
Add s3_lifecycle_conductor_last_batch_duration_seconds, a gauge set by the conductor at scan completion to the elapsed time of the just-finished scan in seconds. Computing the duration as (end_time - start_time) in PromQL is fragile: when a new scan starts, latest_batch_start_time is refreshed before latest_batch_end_time, so any dashboard subtracting them returns a negative value for the duration of the in-progress scan. Letting the conductor publish the finished-scan duration directly removes that whole class of mid-scan artifacts and matches the standard Prometheus pattern of "<thing>_last_<measurement>_seconds" for last-completed-event metrics. The conductor passes elapsedMs / 1000 to onConductorScanComplete; the parameter is optional so the metric is only set when a finite, non- negative value is supplied. The "Full Scan Elapsed Time" Grafana stat panel is renamed to "Last Conductor Batch Duration" and now reads from the new gauge directly. Made-with: Cursor
Clarify in the help text and JSDoc that
s3_lifecycle_latest_batch_start_time is the conductor's scheduling
heartbeat (consumed by the LifecycleLateScan alert) and must not be
used to derive scan duration: while a scan is in progress, end_time is
from the previous run and start_time has just been refreshed, so
end_time - start_time is negative.
s3_lifecycle_conductor_last_batch_duration_seconds (added in the
preceding commit) is now the only sanctioned way to chart scan
duration. The latest_batch_start_time and latest_batch_end_time gauges
are kept because they are needed by the LifecycleLateScan alert
("conductor stopped scheduling") and as a "last completion" timestamp
respectively, but their help text now warns against the previous
end - start pattern.
No code behaviour change.
Made-with: Cursor
|
LGTM — clean, well-structured monitoring addition. Scan ID propagation is thorough, metric cardinality is properly bounded (reset-before-set on the info gauge), error handling follows existing patterns, and tests cover all new methods including edge cases. The heartbeat move to scan start (from completion) is the correct semantic for detecting a stalled conductor. |
| const conductorScanId = result.contextInfo && result.contextInfo.conductorScanId; | ||
| const scanStartTimestamp = (result.contextInfo | ||
| && result.contextInfo.scanStartTimestamp) || Date.now(); | ||
| if (conductorScanId && conductorScanId !== this._metricsScanId) { |
There was a problem hiding this comment.
resetting the current scan ID when it changes WILL break if we have a bug, and make the metrics mostly useless...
| this.listBuckets(messageSendQueue, scanId, log, (err, listedBucketsCount) => { | ||
| LifecycleMetrics.onBucketListing(log, err); | ||
| return next(err, nBucketsListed); | ||
| nBucketsListed = listedBucketsCount; | ||
| return next(err, listedBucketsCount); | ||
| }); | ||
| }, | ||
| (nBucketsListed, next) => { | ||
| (listedBucketsCount, next) => { | ||
| async.until( | ||
| () => nBucketsQueued === nBucketsListed, | ||
| () => nBucketsQueued === listedBucketsCount, |
| }); | ||
|
|
||
| const bucketProcessorMessagesProcessed = ZenkoMetrics.createCounter({ | ||
| name: 's3_lifecycle_bucket_processor_messages_processed_total', |
There was a problem hiding this comment.
I think we already have this metric, in BackbeatConsumer ?
esp. since there is no "scanId" parameter, it does not seem to add much information
| labelNames: [LIFECYCLE_LABEL_ORIGIN], | ||
| }); | ||
|
|
||
| // Info-style gauge: a single active series per bucket-processor instance, |
There was a problem hiding this comment.
if we keep an unbounded series, why keep this one (with always has the value 1) instead of the previous one (number of message for a specific scanID) ?
| bucketProcessorLastConductorScanId.reset(); | ||
| bucketProcessorLastConductorScanId.set({ | ||
| [LIFECYCLE_LABEL_ORIGIN]: 'bucket_processor', | ||
| [LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID]: conductorScanId, | ||
| }, 1); |
There was a problem hiding this comment.
storing a metric does not mean it is scraped by prometheus.
with this approach, if we have 2 scans (ScanID) in parallel, depending on workload pattern we will have a very high probability that a single ScanID will appear in the metrics
Issue: BB-740