Skip to content

OCPBUGS-88732: Treat missing cloud config outages as transient#472

Open
RadekManak wants to merge 1 commit into
openshift:mainfrom
RadekManak:regression-ci-failure-investigation
Open

OCPBUGS-88732: Treat missing cloud config outages as transient#472
RadekManak wants to merge 1 commit into
openshift:mainfrom
RadekManak:regression-ci-failure-investigation

Conversation

@RadekManak

@RadekManak RadekManak commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • keep missing source cloud configmaps on the transient retry path instead of immediately degrading CCCMO
  • preserve terminal behavior for real bad-key misconfigurations in infra.spec.cloudConfig
  • add a controller regression test that covers source config disappearance and recovery

Test plan

  • KUBEBUILDER_ASSETS=\"$(pwd)/$(go run ./vendor/sigs.k8s.io/controller-runtime/tools/setup-envtest use 1.33.2 -p path --bin-dir ./bin --index https://raw.githubusercontent.com/openshift/api/master/envtest-releases.yaml)\" go test ./pkg/controllers

Summary by CodeRabbit

  • Bug Fixes

    • Improved cloud config synchronization when the infrastructure source ConfigMap is missing: for managed platforms it now initializes a minimal, platform-appropriate default; for non-managed scenarios it now surfaces a terminal error and clears the related ClusterOperator status.
  • Tests

    • Updated reconciliation tests for better cleanup/verification.
    • Added an AWS-focused test covering: missing source ConfigMap → error and ClusterOperator removal, then recreation → successful resync and restored ClusterOperator health.
    • Adjusted BareMetal assertions to confirm the synced ConfigMap is absent when expected.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Walkthrough

In CloudConfigReconciler.Reconcile, the not-found path for the source cloud-config ConfigMap is split: if the platform is managed by CCCMO, sourceCM.Data is initialized with a minimal platform config; otherwise, a terminal error is returned. A new test covers the AWS case for this missing-ConfigMap path.

Changes

Missing source ConfigMap handling

Layer / File(s) Summary
Reconcile bifurcation for missing ConfigMap
pkg/controllers/cloud_config_sync_controller.go
When the source ConfigMap is absent from openshift-config, the controller bifurcates: if shouldManageManagedConfigMap is true, sourceCM.Data is initialized with minimal platform-specific config; otherwise, a terminal error is returned with the ConfigMap's namespace/name.
Test updates and coverage for bifurcation
pkg/controllers/cloud_config_sync_controller_test.go
Test setup cleanup is updated to directly delete pre-existing ConfigMaps in the managed namespace. A new AWS test verifies that deleting the infra source ConfigMap causes reconcile error and absent ClusterOperator, and that recreating it restores success with cloudConfigControllerDegraded=false and cloudConfigControllerAvailable=true. BareMetal test assertion changed to expect Get returning IsNotFound for the synced ConfigMap.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning The new test at lines 560-583 fails requirement #3 (Timeouts): it performs cluster operations (cl.Delete, cl.Create, cl.Get) without timeouts on Eventually/Consistently, creating potential race con... Add Eventually(...).Should(MatchError(apierrors.IsNotFound)) after each cl.Delete in BeforeEach, and wrap cl.Get/List operations in the new test in Eventually() calls with appropriate timeouts to prevent races.
Microshift Test Compatibility ⚠️ Warning New test "should treat a missing source configmap as transient until it reappears" (line 560-583) uses configv1.ClusterOperator (config.openshift.io/v1) without MicroShift protection markers. Add [apigroup:config.openshift.io] tag to test name or [Skipped:MicroShift] label to skip on MicroShift.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly refers to the main change: treating missing cloud config as transient failures rather than terminal errors, which is the core objective of this PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All 46 Ginkgo test declarations in the modified test file use static, deterministic names with no dynamic content (no timestamps, UUIDs, generated suffixes, variables, fmt.Sprintf, or concatenation...
Single Node Openshift (Sno) Test Compatibility ✅ Passed The PR modifies unit tests in pkg/controllers/, not e2e tests. The custom check applies only to "new Ginkgo e2e tests," and this repository's e2e tests are in openshift-tests/, not pkg/controllers/...
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only controller Go code and tests, introduces no deployment manifests or scheduling constraints (affinity, nodeSelector, topology spread, PDBs, replica scaling).
Ote Binary Stdout Contract ✅ Passed PR modifications only affect controller unit tests (pkg/controllers/), not OTE binaries. No stdout writes found in process-level code (main, init, BeforeSuite) that would violate the OTE Binary Std...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The custom check targets Ginkgo e2e tests; however, the test file is a unit/integration test located in pkg/controllers/, not an e2e test. It uses controller-runtime testing (manager.New), not e2e...
No-Weak-Crypto ✅ Passed No weak cryptography (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons found in the code changes.
Container-Privileges ✅ Passed PR modifies only Go source files (controller and test code) with no changes to container/K8s manifests or security configurations; no privileged settings introduced.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data (passwords, tokens, API keys, PII, SSNs, credit cards, session IDs) is logged in the PR changes. Logging includes namespace/name info and error details, which are standard diagnos...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@RadekManak: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test fmt
/test images
/test lint
/test okd-scos-images
/test unit
/test vendor
/test verify-deps
/test vet

The following commands are available to trigger optional jobs:

/test e2e-aws-ovn-techpreview
/test e2e-azure-manual-oidc
/test e2e-azure-ovn
/test e2e-azure-ovn-upgrade
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-ibmcloud-ovn
/test e2e-nutanix-ovn
/test e2e-openstack-ovn
/test e2e-vsphere-ovn
/test level0-clusterinfra-azure-ipi-proxy-tests
/test okd-scos-e2e-aws-ovn
/test regression-clusterinfra-vsphere-ipi-ccm

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-cluster-cloud-controller-manager-operator-main-e2e-aws-ovn
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-e2e-aws-ovn-upgrade
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-fmt
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-images
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-level0-clusterinfra-azure-ipi-proxy-tests
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-lint
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-okd-scos-images
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-unit
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-vendor
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-verify-deps
pull-ci-openshift-cluster-cloud-controller-manager-operator-main-vet
Details

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot requested review from nrb and racheljpg June 16, 2026 13:10
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mdbooth for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@RadekManak RadekManak changed the title Treat missing cloud config outages as transient OCPBUGS-88732: Treat missing cloud config outages as transient Jun 16, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 16, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@RadekManak: This pull request references Jira Issue OCPBUGS-88732, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • keep missing source cloud configmaps on the transient retry path instead of immediately degrading CCCMO
  • preserve terminal behavior for real bad-key misconfigurations in infra.spec.cloudConfig
  • add a controller regression test that covers source config disappearance and recovery

Test plan

  • KUBEBUILDER_ASSETS=\"$(pwd)/$(go run ./vendor/sigs.k8s.io/controller-runtime/tools/setup-envtest use 1.33.2 -p path --bin-dir ./bin --index https://raw.githubusercontent.com/openshift/api/master/envtest-releases.yaml)\" go test ./pkg/controllers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot

Copy link
Copy Markdown

@RadekManak: This pull request references Jira Issue OCPBUGS-88732, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

  • keep missing source cloud configmaps on the transient retry path instead of immediately degrading CCCMO
  • preserve terminal behavior for real bad-key misconfigurations in infra.spec.cloudConfig
  • add a controller regression test that covers source config disappearance and recovery

Test plan

  • KUBEBUILDER_ASSETS=\"$(pwd)/$(go run ./vendor/sigs.k8s.io/controller-runtime/tools/setup-envtest use 1.33.2 -p path --bin-dir ./bin --index https://raw.githubusercontent.com/openshift/api/master/envtest-releases.yaml)\" go test ./pkg/controllers

Summary by CodeRabbit

  • Bug Fixes

  • Cloud config sync controller now cleanly handles missing source ConfigMap by either initializing with platform-specific defaults (for managed platforms) or returning appropriate error status via ClusterOperator.

  • Tests

  • Added test case verifying controller properly handles missing infrastructure cloud ConfigMap scenario with correct ClusterOperator status reporting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Keep brief source ConfigMap outages on the retry path so CCCMO does not immediately degrade during recovery, while preserving terminal behavior for real bad-key misconfigurations.
@RadekManak RadekManak force-pushed the regression-ci-failure-investigation branch from b53992f to 042c11c Compare June 17, 2026 11:26
@RadekManak

Copy link
Copy Markdown
Contributor Author

/testwith openshift/cloud-provider-aws/main/regression-clusterinfra-aws-ipi-ccm openshift/cloud-provider-aws#158

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controllers/cloud_config_sync_controller_test.go`:
- Around line 431-438: The code in the deletion loop deletes ConfigMaps but does
not wait for the deletions to be observed before proceeding, which can cause
race conditions with subsequent creates. After the deletion loop that calls
cl.Delete() on each ConfigMap in existingCMs.Items, add an Eventually block that
polls to verify all ConfigMaps have been deleted from the namespace by listing
the ConfigMaps again and confirming the list is empty. This ensures the
deletions are fully observed before the test continues, preventing flaky
AlreadyExists or conflict errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ddd7f818-5aa2-4d7a-8264-0e8a3d087a27

📥 Commits

Reviewing files that changed from the base of the PR and between b53992f and 042c11c.

📒 Files selected for processing (2)
  • pkg/controllers/cloud_config_sync_controller.go
  • pkg/controllers/cloud_config_sync_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/controllers/cloud_config_sync_controller.go

Comment on lines +431 to +438
deleteOptions := &client.DeleteOptions{
GracePeriodSeconds: ptr.To[int64](0),
}
existingCMs := &corev1.ConfigMapList{}
Expect(cl.List(ctx, existingCMs, &client.ListOptions{Namespace: targetNamespaceName})).To(Succeed())
for _, cm := range existingCMs.Items {
Expect(cl.Delete(ctx, cm.DeepCopy(), deleteOptions)).To(Succeed())
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Wait for ConfigMap deletions to be observed before proceeding.

Line 431-438 deletes resources but doesn’t verify they’re gone; this can race with subsequent creates in the same namespace and cause flakes (AlreadyExists/conflicts).

As per coding guidelines, “Operations interacting with clusters must include timeouts” and tests should keep setup/cleanup reliable in BeforeEach/AfterEach.

Suggested fix
 		existingCMs := &corev1.ConfigMapList{}
 		Expect(cl.List(ctx, existingCMs, &client.ListOptions{Namespace: targetNamespaceName})).To(Succeed())
 		for _, cm := range existingCMs.Items {
 			Expect(cl.Delete(ctx, cm.DeepCopy(), deleteOptions)).To(Succeed())
+			Eventually(func() error {
+				return cl.Get(ctx, client.ObjectKeyFromObject(cm.DeepCopy()), &corev1.ConfigMap{})
+			}, timeout).Should(MatchError(apierrors.IsNotFound, "IsNotFound"))
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controllers/cloud_config_sync_controller_test.go` around lines 431 - 438,
The code in the deletion loop deletes ConfigMaps but does not wait for the
deletions to be observed before proceeding, which can cause race conditions with
subsequent creates. After the deletion loop that calls cl.Delete() on each
ConfigMap in existingCMs.Items, add an Eventually block that polls to verify all
ConfigMaps have been deleted from the namespace by listing the ConfigMaps again
and confirming the list is empty. This ensures the deletions are fully observed
before the test continues, preventing flaky AlreadyExists or conflict errors.

Source: Coding guidelines

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@RadekManak: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants