Skip to content

CNTRLPLANE-2549:test/library/controlplane: add WaitForControlPlaneRolloutAll helper#2137

Open
wangke19 wants to merge 1 commit intoopenshift:masterfrom
wangke19:add-control-plane-rollout-wait-helper
Open

CNTRLPLANE-2549:test/library/controlplane: add WaitForControlPlaneRolloutAll helper#2137
wangke19 wants to merge 1 commit intoopenshift:masterfrom
wangke19:add-control-plane-rollout-wait-helper

Conversation

@wangke19
Copy link
Contributor

@wangke19 wangke19 commented Mar 5, 2026

Summary

Add a reusable test helper package test/library/controlplane for waiting on control-plane static-pod operator revision rollouts after disruptive cluster changes (e.g. CA rotation, cert regeneration).

Motivation

After a disruptive CA rotation, the three control-plane static-pod operators (kube-apiserver, kube-controller-manager, kube-scheduler) each trigger a node-by-node revision rollout. Tests that need to wait for the cluster to stabilize before asserting monitor health had no reusable utility for this in library-go.

New API: test/library/controlplane

// One-call shortcut: wait for all three operators sequentially.
WaitForControlPlaneRolloutAll(ctx, t, cfgClient, opClient) error

// Single-operator variant.
WaitForControlPlaneRollout(ctx, t, cfgClient, opClient, op) error

// Wait for a single ClusterOperator to reach Available=True, Progressing=False, Degraded=False.
WaitForClusterOperatorStable(ctx, t, cfgClient, name) error

Design

  • Reads LatestAvailableRevision and nodeStatuses from the operator/v1 resource (KubeAPIServer, KubeControllerManager, KubeScheduler) — the canonical source for static pod rollout progress on OCP 4.x (no node annotations required).
  • Re-reads LatestAvailableRevision each poll interval so mid-rollout re-revisions are chased automatically.
  • Logs only on state transitions (not every poll tick) to keep test output readable.
  • Accepts context.Context for cancellation; callers control the deadline.
  • Uses library.LoggingT interface (compatible with both *testing.T and Ginkgo's GinkgoTB()).

Usage

ctx, cancel := context.WithTimeout(context.Background(), 15*time.Minute)
defer cancel()
if err := controlplane.WaitForControlPlaneRolloutAll(ctx, t, cfgClient, opClient); err != nil {
    t.Logf("WARNING: control plane did not stabilize: %v", err)
}

Test plan

  • Validated live against OCP 4.21 nightly cluster (GCP, 3-master) as part of service-ca-operator's refresh-CA e2e test
  • go vet ./test/library/controlplane/... passes

@openshift-ci openshift-ci bot requested review from bertinatto and deads2k March 5, 2026 10:04
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wangke19
Once this PR has been reviewed and has the lgtm label, please assign bertinatto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wangke19 added a commit to wangke19/service-ca-operator that referenced this pull request Mar 5, 2026
Migrate the refresh-CA e2e test to be compatible with the OTE
(openshift-tests-extension) framework while retaining the existing
go-test path for backwards compatibility.

Key changes:
- Extract testRefreshCA(testing.TB) for dual Ginkgo/go-test usage
- Add [Disruptive][Timeout:20m] labels so openshift-tests grants a
  20-minute per-test timeout (vs the default 15min) for this
  deliberately disruptive CA rotation test
- Add waitForControlPlaneRolloutAll() to wait for kube-apiserver,
  kube-controller-manager, and kube-scheduler static-pod revision
  rollouts to complete after CA rotation, reducing monitor test
  failures from expected transient disruption

The stabilization wait tracks rollout via operator/v1 nodeStatuses
(not node annotations, which are absent on OCP 4.21+). It re-reads
LatestAvailableRevision each poll to chase mid-rollout re-revisions
automatically. The wait is best-effort: if operators don't fully
stabilize, the CA rotation test still passes.

Relates to: openshift/library-go#2137
@wangke19 wangke19 force-pushed the add-control-plane-rollout-wait-helper branch from f01795a to 597220d Compare March 5, 2026 10:23
@wangke19 wangke19 marked this pull request as draft March 5, 2026 12:23
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 5, 2026
wangke19 added a commit to wangke19/service-ca-operator that referenced this pull request Mar 5, 2026
Migrate the refresh-CA e2e test to be compatible with the OTE
(openshift-tests-extension) framework while retaining the existing
go-test path for backwards compatibility.

Key changes:
- Extract testRefreshCA(testing.TB) for dual Ginkgo/go-test usage
- Add [Disruptive][Timeout:20m] labels so openshift-tests grants a
  20-minute per-test timeout (vs the default 15min) for this
  deliberately disruptive CA rotation test
- Add waitForControlPlaneRolloutAll() to wait for kube-apiserver,
  kube-controller-manager, and kube-scheduler static-pod revision
  rollouts to complete after CA rotation, reducing monitor test
  failures from expected transient disruption

The stabilization wait tracks rollout via operator/v1 nodeStatuses
(not node annotations, which are absent on OCP 4.21+). It re-reads
LatestAvailableRevision each poll to chase mid-rollout re-revisions
automatically. The wait is best-effort: if operators don't fully
stabilize, the CA rotation test still passes.

Relates to: openshift/library-go#2137
Add WaitForControlPlaneRolloutAll, WaitForControlPlaneRollout, and
WaitForClusterOperatorStable to the test/library package for use by
operators that need to wait for control-plane stabilization after
disruptive cluster changes (e.g. CA rotation).

Tracks rollout via operator/v1 nodeStatuses[].currentRevision (not
node annotations, which are absent on OCP 4.21+). Re-reads
LatestAvailableRevision each poll so mid-rollout re-revisions are
chased automatically. Logs only on state transitions.
@wangke19 wangke19 force-pushed the add-control-plane-rollout-wait-helper branch from 597220d to 716944f Compare March 5, 2026 14:38
@wangke19 wangke19 marked this pull request as ready for review March 5, 2026 14:38
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 5, 2026
@wangke19 wangke19 changed the title test/library/controlplane: add WaitForControlPlaneRolloutAll helper CNTRLPLANE-2549:test/library/controlplane: add WaitForControlPlaneRolloutAll helper Mar 5, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 5, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 5, 2026

@wangke19: This pull request references CNTRLPLANE-2549 which is a valid jira issue.

Details

In response to this:

Summary

Add a reusable test helper package test/library/controlplane for waiting on control-plane static-pod operator revision rollouts after disruptive cluster changes (e.g. CA rotation, cert regeneration).

Motivation

After a disruptive CA rotation, the three control-plane static-pod operators (kube-apiserver, kube-controller-manager, kube-scheduler) each trigger a node-by-node revision rollout. Tests that need to wait for the cluster to stabilize before asserting monitor health had no reusable utility for this in library-go.

New API: test/library/controlplane

// One-call shortcut: wait for all three operators sequentially.
WaitForControlPlaneRolloutAll(ctx, t, cfgClient, opClient) error

// Single-operator variant.
WaitForControlPlaneRollout(ctx, t, cfgClient, opClient, op) error

// Wait for a single ClusterOperator to reach Available=True, Progressing=False, Degraded=False.
WaitForClusterOperatorStable(ctx, t, cfgClient, name) error

Design

  • Reads LatestAvailableRevision and nodeStatuses from the operator/v1 resource (KubeAPIServer, KubeControllerManager, KubeScheduler) — the canonical source for static pod rollout progress on OCP 4.x (no node annotations required).
  • Re-reads LatestAvailableRevision each poll interval so mid-rollout re-revisions are chased automatically.
  • Logs only on state transitions (not every poll tick) to keep test output readable.
  • Accepts context.Context for cancellation; callers control the deadline.
  • Uses library.LoggingT interface (compatible with both *testing.T and Ginkgo's GinkgoTB()).

Usage

ctx, cancel := context.WithTimeout(context.Background(), 15*time.Minute)
defer cancel()
if err := controlplane.WaitForControlPlaneRolloutAll(ctx, t, cfgClient, opClient); err != nil {
   t.Logf("WARNING: control plane did not stabilize: %v", err)
}

Test plan

  • Validated live against OCP 4.21 nightly cluster (GCP, 3-master) as part of service-ca-operator's refresh-CA e2e test
  • go vet ./test/library/controlplane/... passes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

@wangke19: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants