Skip to content

WIP: THREESCALE-14224 - Fix perpetual reconcile loops: deployment churn, status writes, HPA deletes#1167

Draft
urbanikb wants to merge 5 commits into3scale:masterfrom
urbanikb:THREESCALE-14224
Draft

WIP: THREESCALE-14224 - Fix perpetual reconcile loops: deployment churn, status writes, HPA deletes#1167
urbanikb wants to merge 5 commits into3scale:masterfrom
urbanikb:THREESCALE-14224

Conversation

@urbanikb
Copy link
Copy Markdown
Contributor

What

Fixes three independent sources of unnecessary reconcile work that caused either continuous API writes or missed reconcile triggers.

1. Deployment update churn (K8s default normalization)

The K8s API server applies defaults (probe integers, HTTPGet scheme, container termination fields, volume DefaultModes, pod-level policies, deployment strategy) to stored Deployments. Without normalization, every reconcile pass detected a false-positive diff between the in-memory desired state and the API-server-enriched existing state, issuing an unnecessary update on every cycle.

Introduces normalizeDeploymentDefaults (and its call chain through normalizePodSpecDefaults, normalizeContainerDefaults, normalizeProbeDefaults, normalizeVolumeDefaults) in pkg/reconcilers/k8s_defaults.go, called at the top of DeploymentMutator before any DMutateFn runs.

2. Status reconciler write loop and unreliable route watch

  • Replaced unconditional Status().Update() with a no-op when newStatus equals the current status; the previous && apiManagerAvailable guard prevented the short-circuit when routes were not ready, causing a write every reconcile cycle even when nothing had changed.
  • Replaced Requeue: true (feeds the rate-limiter backoff queue) with RequeueAfter: 30s when Available is False, as a safety net for missed watch events.
  • Replaced the namespace-wide route informer and unreliable ownerRef-chain mapper with a label-filtered watch (3scale.net/created-by=zync) and a domain-name mapper that enqueues the APIManager whose expected tenant hosts overlap with the changed route's host.

3. HPA reconciler unnecessary delete calls

ReconcileHpa() was calling DeleteResource() directly when HPA is disabled, issuing a Client().Delete() API call every reconcile cycle regardless of whether the HPA still existed. After the first successful delete, subsequent cycles hit a 404 indefinitely. Replaced with TagObjectToDelete + ReconcileResource, consistent with how PDB and monitoring resources are handled.

Testing

  • Added test/envtest/ suite with 21 tests — one per normalized field — using a real API server to confirm each default is handled and no false-positive update is produced.
  • Added make test-envtest target included in the GitHub Actions unit-tests job.
  • Integration test (live cluster) confirmed: 0 deployment updates after initial create.

Status

  • All three fixes implemented and tested
  • Observed baseline churn on 2.16 cluster — verifying fix eliminates churn before marking ready for review

🤖 Generated with Claude Code

urbanikb and others added 5 commits April 16, 2026 08:03
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ReconcileHpa() was calling DeleteResource() directly when HPA is disabled,
which issues a Client().Delete() API call every reconcile cycle regardless
of whether the HPA exists. After the first successful delete, subsequent
cycles hit a 404 on every cycle indefinitely.

Replace with TagObjectToDelete + ReconcileResource, which does a
cache-backed Get first and is a no-op when the object is already gone.
This is consistent with how PDB and monitoring resources are handled.

Add table-driven tests covering all three HPA targets (backend-listener,
backend-worker, apicast-production) across enabled, disabled, and
async-disable annotation scenarios.

Related: THREESCALE-14224

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace unconditional Status().Update() loop: skip write when
  newStatus equals current status; previously the && apiManagerAvailable
  guard prevented the short-circuit when routes were not ready, causing
  a write every reconcile cycle even when nothing had changed

- Replace Requeue:true (feeds rate limiter AddRateLimited) with
  RequeueAfter:30s (uses AddAfter, no backoff accumulation) when
  Available is False, as a safety net for missed watch events

- Replace namespace-wide route informer and unreliable ownerRef-chain
  mapper with a label-filtered watch (3scale.net/created-by=zync) and
  a domain-name mapper that enqueues the APIManager whose expected
  tenant hosts overlap with the changed route's host

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… reconcile

Probe fields left at zero (TimeoutSeconds, PeriodSeconds, SuccessThreshold,
FailureThreshold) are defaulted by the API server on write. DeploymentProbesMutator
uses reflect.DeepEqual, so the desired spec (zeros) never matched the live spec
(K8s defaults), causing an update every reconcile cycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 22, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.00687% with 64 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.97%. Comparing base (4963add) to head (e938384).
⚠️ Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
pkg/reconcilers/k8s_defaults.go 78.94% 16 Missing and 8 partials ⚠️
controllers/apps/apimanager_controller.go 0.00% 16 Missing ⚠️
controllers/apps/apimanager_status_reconciler.go 85.32% 12 Missing and 4 partials ⚠️
...lers/apps/zync_route_to_apimanager_event_mapper.go 74.07% 5 Missing and 2 partials ⚠️
pkg/helper/route.go 92.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1167      +/-   ##
==========================================
- Coverage   41.84%   40.97%   -0.88%     
==========================================
  Files         203      204       +1     
  Lines       20859    26067    +5208     
==========================================
+ Hits         8729    10681    +1952     
- Misses      11350    14570    +3220     
- Partials      780      816      +36     
Flag Coverage Δ
unit 40.97% <78.00%> (-0.88%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
apis/apps/v1alpha1 (u) 62.50% <ø> (+3.94%) ⬆️
apis/capabilities/v1alpha1 (u) 2.73% <ø> (-0.77%) ⬇️
apis/capabilities/v1beta1 (u) 19.77% <ø> (-0.45%) ⬇️
controllers (i) 12.30% <74.34%> (+2.98%) ⬆️
pkg (u) 60.35% <82.01%> (-1.36%) ⬇️
Files with missing lines Coverage Δ
apis/apps/v1alpha1/apimanager_types.go 63.40% <ø> (+3.82%) ⬆️
pkg/3scale/amp/component/apicast.go 64.90% <ø> (-3.59%) ⬇️
pkg/3scale/amp/component/backend.go 81.75% <ø> (-2.83%) ⬇️
pkg/3scale/amp/component/memcached.go 100.00% <ø> (ø)
pkg/3scale/amp/component/system.go 72.60% <ø> (-2.08%) ⬇️
pkg/3scale/amp/component/system_searchd.go 45.22% <ø> (-2.96%) ⬇️
pkg/3scale/amp/component/zync.go 72.71% <100.00%> (-1.99%) ⬇️
pkg/3scale/amp/operator/apicast_reconciler.go 64.59% <100.00%> (-2.29%) ⬇️
...e/amp/operator/base_apimanager_logic_reconciler.go 56.88% <100.00%> (+4.24%) ⬆️
pkg/3scale/amp/operator/memcached_reconciler.go 66.66% <100.00%> (-6.67%) ⬇️
... and 8 more

... and 170 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants