Skip to content

OCPEDGE-1829: Add e2e tests for pacemaker out-of-service taint and annotation on two node fencing clusters#31304

Open
vimauro wants to merge 3 commits into
openshift:mainfrom
vimauro:taintint-test-suite
Open

OCPEDGE-1829: Add e2e tests for pacemaker out-of-service taint and annotation on two node fencing clusters#31304
vimauro wants to merge 3 commits into
openshift:mainfrom
vimauro:taintint-test-suite

Conversation

@vimauro

@vimauro vimauro commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • Tests
    • Added a dual-node end-to-end validation for the fencing taint lifecycle, including alert registration checks, disruption recovery observation, and post-recovery taint/annotation cleanup verification.
    • Added journal/service confirmation steps to ensure fencing and untaint actions are properly recorded.
  • Bug Fixes
    • Strengthened taint/untaint lifecycle assertions using more reliable evidence from node taints, annotations, and log markers.
  • Chores
    • Added shared utilities for detecting out-of-service taints/annotations, capturing time context, grepping journal output, and performing best-effort taint cleanup.
    • Added a version-gated test skip helper based on minimum supported cluster version.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: f1ef4d53-b9da-4a0f-a6e1-2699b06df015

📥 Commits

Reviewing files that changed from the base of the PR and between b7582d4 and 04bb50e.

📒 Files selected for processing (2)
  • test/extended/edge_topologies/tnf_taint.go
  • test/extended/edge_topologies/utils/common.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/extended/edge_topologies/tnf_taint.go

Walkthrough

Adds utility module test/extended/edge_topologies/utils/services/taint.go with TNF taint/annotation inspection predicates and debug-based command helpers, plus test suite test/extended/edge_topologies/tnf_taint.go with pacemaker alert validation and disruptive fencing taint lifecycle test using concurrent taint observer. Also adds version precondition check helper to utils/common.go.

Changes

TNF Fencing Taint Implementation

Layer / File(s) Summary
Taint utility constants, predicates, and debug/API helpers
test/extended/edge_topologies/utils/services/taint.go
Defines exported taint/annotation keys and values, pacemaker alert IDs, alert script paths, log tags, and systemd unit name templates; adds HasOutOfServiceTaint and HasOutOfServiceAnnotation node predicates; provides debug-context command helpers (JournalGrepViaDebug, GetTimestampViaDebug, PcsAlertConfigViaDebug, SystemdServiceJournalGrep) and Kubernetes API helpers (FetchNodeObject, RemoveTaintAndAnnotation) for node inspection and cleanup.
Test package infrastructure and taint observer
test/extended/edge_topologies/tnf_taint.go
Establishes test package constants for fencing and journal polling timeouts; implements checkJournalOnNodes for multi-node journal scanning within a time window; defines mutex-protected taintObserver struct with background goroutine that continuously polls both nodes at fixed intervals to detect and record the first out-of-service taint occurrence.
Pacemaker alert configuration validation test
test/extended/edge_topologies/tnf_taint.go
Non-disruptive Ginkgo test that retrieves a control-plane node, queries pacemaker alert configuration via debug, verifies both taint and untaint alert IDs are registered, and confirms both alert script paths exist on disk with executable permissions.
Disruptive fencing taint lifecycle test
test/extended/edge_topologies/tnf_taint.go
DualReplica-gated disruptive test that selects two nodes, initializes etcd client factory, verifies cluster health/topology/version preconditions, establishes deferred cleanup, captures baseline timestamp, starts taint observer, triggers network disruption, validates etcd recovery state, determines fenced vs. survived node via taint state and observer evidence, verifies survived node has no taint/annotation, confirms fencing alert and script success journaling (scoped by baseline), awaits learner rejoin and voting promotion, validates taint/annotation removal, confirms untaint journaling, and checks systemd service completion on recovered node.
Version precondition check helper
test/extended/edge_topologies/utils/common.go
Adds SkipIfVersionBelow utility that logs a precondition check, retrieves cluster version via exutil.GetCurrentVersion, parses it as semver, and skips the test when the parsed version is below the provided minimum major/minor (with graceful skip behavior on version retrieval or parse failures).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Test has 3 assertions missing meaningful failure messages (lines 229, 284, 350: error checks without messages), violating requirement #4 that assertions should include helpful diagnostic messages. Add messages to lines 229, 284, and 350: e.g., 'Expected to fetch node object', 'Expected to fetch survived node', 'Expected to fetch fenced node'.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly and specifically describes the main change: adding end-to-end tests for pacemaker out-of-service taint and annotation on two-node fencing clusters, which aligns with the file additions and test suite introduction.
Docstring Coverage ✅ Passed Docstring coverage is 91.67% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo test titles (Describe and It) use static, deterministic strings with no dynamic values like node names, timestamps, UUIDs, or IP addresses. The By() statements contain node names but are...
Microshift Test Compatibility ✅ Passed Both test suites have [apigroup:config.openshift.io] tags that will automatically skip on MicroShift CI. The tests reference config.openshift.io API, pacemaker, etcd operator, and require multi-nod...
Single Node Openshift (Sno) Test Compatibility ✅ Passed Both test suites in tnf_taint.go call SkipIfNotTopology(oc, DualReplicaTopologyMode) in BeforeEach, which properly skips tests on SNO (SingleReplicaTopologyMode) via runtime topology check.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds only test code (Ginkgo test suites and test helper functions) in test/extended/, not deployment manifests, operator code, or controllers that would introduce scheduling constraints.
Ote Binary Stdout Contract ✅ Passed All new code uses framework.Logf for logging (correct Ginkgo pattern). No process-level stdout writes, no klog, fmt.Print/Println, or log output detected. No init/main/TestMain functions at module...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Tests contain no hardcoded IPv4 addresses, IPv4-only URL construction, external connectivity requirements, or IP family assumptions. All operations are cluster-internal (journalctl, pcs, Kubernetes...
No-Weak-Crypto ✅ Passed PR contains only test infrastructure code for pacemaker fencing validation. No weak crypto (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto, or insecure secret comparisons found.
Container-Privileges ✅ Passed The PR adds test code only (tnf_taint.go, taint.go helper, SkipIfVersionBelow utility). No container/K8s manifests with privileged settings (privileged:true, hostPID, hostNetwork, hostIPC, SYS_ADMI...
No-Sensitive-Data-In-Logs ✅ Passed PR contains no sensitive data logging. All 17 Logf statements log only node names, error messages, timestamps, taint status, network IPs, pacemaker alert config, and systemd journal patterns - no p...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot

openshift-ci-robot commented Jun 15, 2026

Copy link
Copy Markdown

@vimauro: This pull request references OCPEDGE-1829 which is a valid jira issue.

Details

In response to this:

Summary by CodeRabbit

  • Tests
  • Added end-to-end test suite for dual-node fencing taint lifecycle validation, including network disruption scenarios, pacemaker alert verification, and node recovery confirmation.
  • Added utility functions for systemd journal inspection, node taint/annotation detection, and cluster state verification in edge topology scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 15, 2026
@vimauro

vimauro commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

/verified by @vimauro on a live TNF cluster with the code changes from those PRs included: openshift/machine-config-operator#6092 and openshift/cluster-etcd-operator#1625

@vimauro

vimauro commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

/label tide/merge-method-squash

@openshift-ci openshift-ci Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 15, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/edge_topologies/tnf_taint.go`:
- Around line 161-166: Add explicit validation of the node count after
retrieving nodes with utils.GetNodes. Before accessing nodes.Items with
rand.Intn() and the modulo arithmetic, verify that the slice contains at least 2
nodes using an expectation check (similar to the existing error check pattern).
This ensures a clear failure message rather than a panic if nodes.Items is
empty, and confirms there are enough nodes for the randomIndex and targetNode
selection logic that follows.
- Around line 95-97: The Stop() method in the taintObserver type will panic if
called more than once because it attempts to close t.stopCh without guarding
against multiple invocations. To fix this, protect the close operation using
sync.Once by adding a sync.Once field to the taintObserver struct and wrapping
the close(t.stopCh) call in the Stop() method with a once.Do() guard to ensure
it only executes on the first call.

In `@test/extended/edge_topologies/utils/services/taint.go`:
- Around line 167-169: The node Update call in the cleanup path is susceptible
to resourceVersion conflicts on this high-churn resource, and a single failed
update can leave taints/annotations behind and contaminate subsequent tests.
Wrap the oc.AdminKubeClient().CoreV1().Nodes().Update() call in a retry loop
that retries on conflict errors (check for ResourceVersion conflict scenarios),
fetching the latest node state before each retry attempt. This ensures the
cleanup operation is more robust and reduces the chance of test contamination
from failed cleanup attempts.
- Around line 120-123: Replace context.Background() with timeout-bounded
contexts in the Kubernetes API calls to prevent indefinite blocking during
control-plane degradation. In the FetchNodeObject function at lines 120-123,
replace the context.Background() passed to the Get() call with a context that
includes a timeout (such as context.WithTimeout). Additionally, apply the same
fix at lines 167-168 where an Update() call also uses context.Background(). Both
calls should use a reasonably bounded timeout context (typically a few seconds)
instead of the unbounded background context to ensure the test can complete
within acceptable timeframes.
- Around line 87-90: The JournalGrepViaDebug and SystemdServiceJournalGrep
functions construct bash commands by directly interpolating unvalidated
variables (tag, unitName, sinceTimestamp, pattern) into fmt.Sprintf calls that
are passed to bash -c commands, creating a potential shell injection vector.
Harden this by adding allow-list validation for each interpolated variable
before use in command construction. Define validation rules (e.g., regex
patterns or character allowlists) appropriate for each variable type to ensure
only expected formats are accepted, and reject any input that does not match the
allow-list. Apply this validation consistently at both the initial
JournalGrepViaDebug function (around the journalctl command construction at
lines 87-90) and the SystemdServiceJournalGrep function (around the systemctl
command construction at lines 128-130).
- Around line 99-100: The date command in the bash expression used by
exutil.DebugNodeRetryWithOptionsAndChroot currently returns a UTC timestamp
without timezone information (using '+%Y-%m-%d %H:%M:%S'), which can be
misinterpreted as local time by journalctl --since. Update the date format
string to include timezone qualification by appending '+00:00' or another
timezone marker to indicate UTC, ensuring journalctl correctly interprets the
timestamp regardless of the host's local timezone setting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a1b9cc77-3a79-473d-8757-c387c9f66bca

📥 Commits

Reviewing files that changed from the base of the PR and between f0a8565 and 523df25.

📒 Files selected for processing (2)
  • test/extended/edge_topologies/tnf_taint.go
  • test/extended/edge_topologies/utils/services/taint.go

Comment thread test/extended/edge_topologies/tnf_taint.go
Comment thread test/extended/edge_topologies/tnf_taint.go
Comment thread test/extended/edge_topologies/utils/services/taint.go
Comment thread test/extended/edge_topologies/utils/services/taint.go
Comment thread test/extended/edge_topologies/utils/services/taint.go
Comment thread test/extended/edge_topologies/utils/services/taint.go Outdated
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 15, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@vimauro: This PR has been marked as verified by @vimauro on a live TNF cluster with the code changes from those PRs included: https://github.com/openshift/machine-config-operator/pull/6092 and https://github.com/openshift/cluster-etcd-operator/pull/1625.

Details

In response to this:

/verified by @vimauro on a live TNF cluster with the code changes from those PRs included: openshift/machine-config-operator#6092 and openshift/cluster-etcd-operator#1625

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Jun 15, 2026
@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 15, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 15, 2026

Copy link
Copy Markdown

@vimauro: This pull request references OCPEDGE-1829 which is a valid jira issue.

Details

In response to this:

Summary by CodeRabbit

  • Tests
  • Added an end-to-end test suite that validates the full fencing taint lifecycle in a dual-node scenario, including pacemaker fencing alerts, journal verification, disruptive network behavior, and post-recovery taint/annotation cleanup.
  • Bug Fixes
  • Improved taint/untaint lifecycle validation coverage by adding more robust journal and service completion checks across both nodes.
  • Chores
  • Added helper utilities to detect out-of-service taints/annotations and to query node journal/service logs for expected markers.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vimauro

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-metal-ovn-two-node-fencing-recovery

@vimauro

vimauro commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-metal-ovn-two-node-fencing-recovery

@openshift-trt

openshift-trt Bot commented Jun 22, 2026

Copy link
Copy Markdown

Job Failure Risk Analysis for sha: 04bb50e

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-metal-ovn-two-node-arbiter Low
[Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation UserDefinedNetwork CRD Controller for primary UDN without required namespace label should not be able to update the namespace and add the UDN label [Suite:openshift/conformance/parallel]
This test has passed 0.00% of 2 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:candidate Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:metal Procedure:none SecurityMode:default Topology:two-node-arbiter Upgrade:major] in the last week.
---
[Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation a user defined primary network created using ClusterUserDefinedNetwork creates a networkStatus Annotation with UDN interface L3 primary UDN [Suite:openshift/conformance/parallel]
This test has passed 0.00% of 2 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:candidate Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:metal Procedure:none SecurityMode:default Topology:two-node-arbiter Upgrade:major] in the last week.
---
[Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation a user defined primary network created using UserDefinedNetwork can perform east/west traffic between nodes two pods connected over a L3 primary UDN [Suite:openshift/conformance/parallel]
This test has passed 0.00% of 2 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:candidate Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:metal Procedure:none SecurityMode:default Topology:two-node-arbiter Upgrade:major] in the last week.
---
[Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation a user defined primary network created using UserDefinedNetwork creates a networkStatus Annotation with UDN interface L3 primary UDN [Suite:openshift/conformance/parallel]
This test has passed 0.00% of 2 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:candidate Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:metal Procedure:none SecurityMode:default Topology:two-node-arbiter Upgrade:major] in the last week.
---
Showing 4 of 10 test results

@vimauro

vimauro commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-ci

openshift-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@vimauro: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ovn-two-node-fencing 04bb50e link false /test e2e-metal-ovn-two-node-fencing
ci/prow/e2e-metal-ovn-two-node-fencing-recovery 04bb50e link false /test e2e-metal-ovn-two-node-fencing-recovery
ci/prow/e2e-vsphere-ovn-upi 04bb50e link true /test e2e-vsphere-ovn-upi

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants