Skip to content

OCPBUGS-88490: fix etcd operator deadlock when etcd-endpoints configmap is stale#1631

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
dpateriya:fix/etcd-stale-endpoints-fallback
Jun 21, 2026
Merged

OCPBUGS-88490: fix etcd operator deadlock when etcd-endpoints configmap is stale#1631
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
dpateriya:fix/etcd-stale-endpoints-fallback

Conversation

@dpateriya

@dpateriya dpateriya commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes OCPBUGS-88490

When the etcd-endpoints configmap contains stale IPs (e.g. after VM migration by vSphere DRS), the etcd client pool cannot reach any member, causing MemberList() to fail with context deadline exceeded. This creates a circular dependency: the operator cannot update the configmap without MemberList, but MemberList fails because the configmap has stale addresses.

Root Cause

After VMs migrate to new ESXi hosts, node IPs may change, but the etcd-endpoints configmap retains the old IPs. EtcdEndpointsController.syncConfigMap() calls MemberList() to discover current members, but the etcd client is initialized from the stale configmap and cannot connect. The controller returns an error and retries indefinitely with the same stale endpoints.

Fix

In EtcdEndpointsController.syncConfigMap(), when MemberList() fails, fall back to discovering control-plane node internal IPs via the node lister and network config (already available in the operator). This populates the configmap with reachable node IPs, allowing the etcd client to reconnect on the next cycle. Once connectivity is restored, MemberList() succeeds and overwrites the configmap with authoritative member data (member ID keys instead of node name keys).

Changes

  • pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go: Added nodeLister and networkLister fields; added endpointsFromNodeLister() method; modified syncConfigMap() to fall back to node IPs when MemberList fails.
  • pkg/operator/starter.go: Passed controlPlaneNodeLister and networkInformer.Lister() to NewEtcdEndpointsController.
  • pkg/etcdcli/helpers.go: Added WithMemberListError option to FakeEtcdClient for testing.
  • pkg/operator/etcdendpointscontroller/etcdendpointscontroller_test.go: Added TestMemberListFallbackToNodeIPs with cases for successful fallback and dual failure.

Test Plan

  • go build ./... compiles cleanly
  • go vet ./... passes
  • All existing TestBootstrapAnnotationRemoval tests pass
  • New TestMemberListFallbackToNodeIPs tests pass
  • Reproduced on real OCP cluster: stale configmap leads to operator deadlock (before), self-healing via node IP fallback (after)

Summary by CodeRabbit

  • Bug Fixes
    • Etcd endpoint configuration now includes a fallback mechanism: when member discovery fails, the system uses control-plane node IP addresses to populate the endpoint configuration instead of failing completely.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

EtcdEndpointsController gains nodeLister and networkLister fields. When MemberList fails during syncConfigMap, the controller now logs a warning and falls back to populating the etcd-endpoints ConfigMap from control-plane node internal IPs via a new endpointsFromNodeLister helper. A WithMemberListError option is added to the fake etcd client for testing, and RunOperator is updated to wire the new listers.

Changes

MemberList Fallback to Node IPs

Layer / File(s) Summary
Fake etcd client MemberListError option
pkg/etcdcli/helpers.go
Adds memberListError field to FakeClientOptions and WithMemberListError constructor option; fakeEtcdClient.MemberList returns early with the configured error when set.
Controller struct, constructor, syncConfigMap fallback, and endpointsFromNodeLister
pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go
Extends EtcdEndpointsController with nodeLister and networkLister; updates constructor to wire them; changes syncConfigMap to fall back to endpointsFromNodeLister on MemberList failure; adds endpointsFromNodeLister that resolves master node internal IPs from the cluster network.
Operator wiring and fallback tests
pkg/operator/starter.go, pkg/operator/etcdendpointscontroller/etcdendpointscontroller_test.go
RunOperator passes controlPlaneNodeLister and networkInformer.Lister() to the controller constructor; TestMemberListFallbackToNodeIPs covers successful node-IP fallback and error cases when both member listing and node fallback fail.

Sequence Diagram(s)

sequenceDiagram
  participant syncConfigMap
  participant etcdClient
  participant endpointsFromNodeLister
  participant networkLister
  participant nodeLister

  syncConfigMap->>etcdClient: MemberList()
  alt MemberList succeeds
    etcdClient-->>syncConfigMap: members (filter learners/etcd-bootstrap)
  else MemberList fails
    etcdClient-->>syncConfigMap: error
    syncConfigMap->>syncConfigMap: log warning, clear endpoints
    syncConfigMap->>endpointsFromNodeLister: endpointsFromNodeLister()
    endpointsFromNodeLister->>networkLister: Get("cluster")
    networkLister-->>endpointsFromNodeLister: network config
    endpointsFromNodeLister->>nodeLister: List(node-role.kubernetes.io/master)
    nodeLister-->>endpointsFromNodeLister: control-plane nodes
    endpointsFromNodeLister-->>syncConfigMap: endpoint map keyed by node name
  end
  syncConfigMap->>syncConfigMap: apply updated etcd-endpoints ConfigMap
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 12 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Topology-Aware Scheduling Compatibility ⚠️ Warning The endpointsFromNodeLister() method explicitly filters for only node-role.kubernetes.io/master labeled nodes (line 180), missing arbiter nodes in Two-Node Arbiter (TNA) topology. The test only... Update line 180 to include both master and arbiter nodes via a multi-selector approach or explicit query for both labels, consistent with TNA topology support (see review comment for suggested patch).
Test Structure And Quality ❓ Inconclusive Custom check specifies "Review Ginkgo test code" but the test added (TestMemberListFallbackToNodeIPs) uses standard Go testing.T with testify, not Ginkgo. Check scope unclear. Clarify whether check applies to standard Go testing.T tests using testify/assert or only Ginkgo tests with Describe/It blocks. If standard Go tests are in scope: test lacks consistent assertion messages (some lack descriptions) and uses...
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: fixing an etcd operator deadlock by implementing fallback endpoint discovery when the ConfigMap is stale.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR uses standard Go testing framework (*testing.T), not Ginkgo. All test scenario names are static and deterministic with no dynamic identifiers, IP addresses, timestamps, node/pod names, or genera...
Microshift Test Compatibility ✅ Passed PR adds no Ginkgo e2e tests. The new test TestMemberListFallbackToNodeIPs is a standard Go unit test using testing package, not Ginkgo. Custom check scope does not apply.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added. Tests added are standard Go unit tests (TestMemberListFallbackToNodeIPs, Test_IsBootstrapComplete variants) using testing.T, not Ginkgo framework.
Ote Binary Stdout Contract ✅ Passed PR modifies only library code and controller logic—no process-level code (main, init, TestMain, BeforeSuite) that could write to stdout. All klog calls are inside function bodies. OTE binary remain...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. Only standard Go unit tests (TestMemberListFallbackToNodeIPs, Test_IsBootstrapComplete) using testing.T framework were added. Check is not applicable.
No-Weak-Crypto ✅ Passed No weak cryptography, custom crypto implementations, or insecure secret comparisons detected in the modified files (helpers.go, etcdendpointscontroller.go/.go/.test.go, starter.go).
Container-Privileges ✅ Passed No Kubernetes manifests or container configurations were modified; PR contains only Go source code changes without privilege-related settings.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data (passwords, tokens, keys, PII, session IDs, or customer data) is logged. New logging statements log only error types, endpoint counts, and node names without exposing actual IP ad...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from benluddy and ironcladlou June 12, 2026 14:33
@dpateriya dpateriya changed the title Bug 88490: fix etcd operator deadlock when etcd-endpoints configmap is stale OCPBUGS-88490: fix etcd operator deadlock when etcd-endpoints configmap is stale Jun 12, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jun 12, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@dpateriya: This pull request references Jira Issue OCPBUGS-88490, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Fixes a deadlock in the etcd operator that occurs when all IPs in the etcd-endpoints configmap become unreachable (e.g. after VM migration). The operator cannot connect to etcd to update the configmap, and cannot get a working client because the configmap is its only endpoint source.

  • Fix 1 — Pool-level fallback to node-based endpoint discovery: After exhausting 3 retries with configmap-derived endpoints, the client pool creates a new client that dials control-plane node IPs directly via a dedicated newFuncWithEndpoints factory, bypassing the stale configmap entirely.
  • Fix 2 — Resilient IsBootstrapComplete: When the kube-system/bootstrap configmap says complete but MemberList fails, return (true, nil) with a warning instead of blocking recovery controllers.
  • Controller hardening: DefragController, EtcdMembersController, ClusterMemberController, EtcdCertSignerController, and ScriptController are hardened against transient etcd connectivity failures to prevent unnecessary degraded status reporting during recovery.

Bug: https://redhat.atlassian.net/browse/OCPBUGS-88490

Test plan

  • go build passes for all modified packages and the operator binary
  • go vet clean on pkg/etcdcli/... and pkg/operator/ceohelpers/...
  • New unit tests added and passing:
  • TestClientGetFallsBackToNodeEndpoints — primary newFunc fails (simulates stale configmap dial timeout), fallback newFuncWithEndpoints creates a working client with node-derived endpoints
  • TestClientGetFallbackAlsoFails — both primary and fallback paths fail, error message confirms fallback was attempted
  • TestClientGetNoFallbackFunc — no fallback configured, confirms existing behavior is preserved
  • Test_IsBootstrapComplete/bootstrap_complete,_etcd_unreachable — configmap says complete + etcd down → returns (true, nil)
  • Test_IsBootstrapComplete/bootstrap_incomplete,_etcd_unreachable — configmap says progressing + etcd down → returns (false, nil)
  • e2e: simulate stale configmap by patching etcd-endpoints to unreachable IPs, confirm operator self-heals

Made with Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 12, 2026
@dpateriya

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 12, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@dpateriya: This pull request references Jira Issue OCPBUGS-88490, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
pkg/etcdcli/etcdcli_pool.go (1)

237-239: 💤 Low value

Consider documenting the thread-safety contract for SetNewFuncWithEndpoints.

This setter modifies newFuncWithEndpoints without synchronization, while Get() reads it concurrently. The current usage in NewEtcdClient is safe because the setter is called immediately after construction before the pool is used. However, the comment should clarify that this method must only be called during initialization, before any concurrent Get() calls.

📝 Suggested documentation clarification
 // SetNewFuncWithEndpoints sets a client factory that dials a specific set of endpoints,
 // bypassing the default endpoint resolution. This must be set alongside fallbackEndpointsFunc
 // for the fallback path to create clients that connect to the node-derived endpoints directly.
+// This method is not thread-safe and must only be called during initialization, before any
+// concurrent Get() calls.
 func (p *EtcdClientPool) SetNewFuncWithEndpoints(fn func(endpoints []string) (*clientv3.Client, error)) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/etcdcli/etcdcli_pool.go` around lines 237 - 239, Document that
SetNewFuncWithEndpoints is not concurrency-safe and must only be called during
initialization before any concurrent use; update the comment on
SetNewFuncWithEndpoints to state the thread-safety contract, referencing the
field newFuncWithEndpoints and the concurrent reader Get(), and mention that
callers should set it in NewEtcdClient (or before the pool is shared) to avoid
data races.
pkg/etcdcli/etcdcli.go (1)

619-629: ⚖️ Poor tradeoff

Consider using sentinel errors instead of string matching for lister sync detection.

The current string-based error detection is fragile—if the error messages in endpoints() or endpointsFromNodes() change, this function will silently fail to match. Since both the error sources and this detector are in the same package, using sentinel errors would be more robust.

However, given the errors are all local to this package and unlikely to change independently, this is acceptable for now.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/etcdcli/etcdcli.go` around lines 619 - 629, Replace fragile string
matching in IsListersNotSynced with sentinel errors: define package-level
variables (e.g., ErrNodeListerNotSynced, ErrConfigMapsListerNotSynced,
ErrNetworkListerNotSynced), return the appropriate sentinel from the error
producers (endpoints, endpointsFromNodes) instead of formatting plain strings,
and update IsListersNotSynced to check via errors.Is against those sentinels (or
compare error values) so detection is robust to message changes.
pkg/operator/scriptcontroller/scriptcontroller.go (1)

79-83: 💤 Low value

Error string matching is acceptable here but could be more precise.

The check strings.Contains(err.Error(), "missing env var values") will match the specific error from line 127, but it would also match any wrapped or aggregated error that happens to contain that substring. Since createScriptConfigMap can aggregate multiple errors (line 119), consider whether other errors could inadvertently match this pattern.

That said, given the simple error generation at line 127 and the transient nature of the condition (env vars not yet populated by listeners), this pattern is reasonable for now.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/operator/scriptcontroller/scriptcontroller.go` around lines 79 - 83,
Replace the fragile substring match of err.Error() with a proper sentinel error
and errors.Is usage: define a package-level sentinel (e.g.
ErrMissingEnvVarValues) where the "missing env var values" error is created (in
createScriptConfigMap/GetEnvVars code path), return that sentinel when env vars
are missing, and here in scriptcontroller.go replace
strings.Contains(err.Error(), "missing env var values") with errors.Is(err,
ErrMissingEnvVarValues) so wrapped/aggregated errors are matched precisely.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/etcdcli/etcdcli_pool.go`:
- Around line 237-239: Document that SetNewFuncWithEndpoints is not
concurrency-safe and must only be called during initialization before any
concurrent use; update the comment on SetNewFuncWithEndpoints to state the
thread-safety contract, referencing the field newFuncWithEndpoints and the
concurrent reader Get(), and mention that callers should set it in NewEtcdClient
(or before the pool is shared) to avoid data races.

In `@pkg/etcdcli/etcdcli.go`:
- Around line 619-629: Replace fragile string matching in IsListersNotSynced
with sentinel errors: define package-level variables (e.g.,
ErrNodeListerNotSynced, ErrConfigMapsListerNotSynced,
ErrNetworkListerNotSynced), return the appropriate sentinel from the error
producers (endpoints, endpointsFromNodes) instead of formatting plain strings,
and update IsListersNotSynced to check via errors.Is against those sentinels (or
compare error values) so detection is robust to message changes.

In `@pkg/operator/scriptcontroller/scriptcontroller.go`:
- Around line 79-83: Replace the fragile substring match of err.Error() with a
proper sentinel error and errors.Is usage: define a package-level sentinel (e.g.
ErrMissingEnvVarValues) where the "missing env var values" error is created (in
createScriptConfigMap/GetEnvVars code path), return that sentinel when env vars
are missing, and here in scriptcontroller.go replace
strings.Contains(err.Error(), "missing env var values") with errors.Is(err,
ErrMissingEnvVarValues) so wrapped/aggregated errors are matched precisely.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2588780d-a2b0-4021-a8d8-884c21b25920

📥 Commits

Reviewing files that changed from the base of the PR and between 2abd78c and a60aad9.

📒 Files selected for processing (11)
  • pkg/etcdcli/etcdcli.go
  • pkg/etcdcli/etcdcli_pool.go
  • pkg/etcdcli/etcdcli_pool_test.go
  • pkg/etcdcli/helpers.go
  • pkg/operator/ceohelpers/bootstrap.go
  • pkg/operator/ceohelpers/bootstrap_test.go
  • pkg/operator/clustermembercontroller/clustermembercontroller.go
  • pkg/operator/defragcontroller/defragcontroller.go
  • pkg/operator/etcdcertsigner/etcdcertsignercontroller.go
  • pkg/operator/etcdmemberscontroller/etcdmemberscontroller.go
  • pkg/operator/scriptcontroller/scriptcontroller.go

@dpateriya dpateriya force-pushed the fix/etcd-stale-endpoints-fallback branch from a60aad9 to 75bb511 Compare June 12, 2026 14:51
@openshift-ci-robot

Copy link
Copy Markdown

@dpateriya: This pull request references Jira Issue OCPBUGS-88490, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

Fixes a deadlock in the etcd operator that occurs when all IPs in the etcd-endpoints configmap become unreachable (e.g. after VM migration). The operator cannot connect to etcd to update the configmap, and cannot get a working client because the configmap is its only endpoint source.

  • Fix 1 — Pool-level fallback to node-based endpoint discovery: After exhausting 3 retries with configmap-derived endpoints, the client pool creates a new client that dials control-plane node IPs directly via a dedicated newFuncWithEndpoints factory, bypassing the stale configmap entirely.
  • Fix 2 — Resilient IsBootstrapComplete: When the kube-system/bootstrap configmap says complete but MemberList fails, return (true, nil) with a warning instead of blocking recovery controllers.
  • Controller hardening: DefragController, EtcdMembersController, ClusterMemberController, EtcdCertSignerController, and ScriptController are hardened against transient etcd connectivity failures to prevent unnecessary degraded status reporting during recovery.

Bug: https://redhat.atlassian.net/browse/OCPBUGS-88490

Test plan

  • go build passes for all modified packages and the operator binary
  • go vet clean on pkg/etcdcli/... and pkg/operator/ceohelpers/...
  • New unit tests added and passing:
  • TestClientGetFallsBackToNodeEndpoints — primary newFunc fails (simulates stale configmap dial timeout), fallback newFuncWithEndpoints creates a working client with node-derived endpoints
  • TestClientGetFallbackAlsoFails — both primary and fallback paths fail, error message confirms fallback was attempted
  • TestClientGetNoFallbackFunc — no fallback configured, confirms existing behavior is preserved
  • Test_IsBootstrapComplete/bootstrap_complete,_etcd_unreachable — configmap says complete + etcd down → returns (true, nil)
  • Test_IsBootstrapComplete/bootstrap_incomplete,_etcd_unreachable — configmap says progressing + etcd down → returns (false, nil)
  • e2e: simulate stale configmap by patching etcd-endpoints to unreachable IPs, confirm operator self-heals

Made with Cursor

Summary by CodeRabbit

  • New Features

  • Added a fallback endpoint discovery and client-creation path to improve etcd connectivity and recovery when primary endpoints fail.

  • Bug Fixes

  • Avoids degrading operators for transient etcd/lister sync issues; skips sync cycles instead of erroring.

  • Treats certain bootstrap, cert, member, defrag and script errors as non-fatal so controllers can recover without unnecessary failure.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@dpateriya

Copy link
Copy Markdown
Contributor Author

/retest-required

@dpateriya

Copy link
Copy Markdown
Contributor Author

Reproduction Evidence: Before/After on OCP 4.21.9

Environment

  • Platform: AWS (us-east-2)
  • OCP Version: 4.21.9
  • Control Plane Nodes:
    • ip-10-0-25-110.us-east-2.compute.internal (IP: 10.0.25.110)
    • ip-10-0-38-32.us-east-2.compute.internal (IP: 10.0.38.32)
    • ip-10-0-54-109.us-east-2.compute.internal (IP: 10.0.54.109)
  • Jira: OCPBUGS-88490

Scenario

Simulates the real-world failure where VM migration causes all etcd member IPs in the etcd-endpoints configmap to become unreachable. The configmap is patched to replace all 3 real member IPs with unreachable addresses (10.255.255.x), then the etcd-operator pod is restarted to force it to re-read the stale endpoints.


BEFORE Fix (stock operator v4.21.9) — Deadlock

Step 1: Verify healthy baseline

$ oc get pods -n openshift-etcd -l app=etcd
NAME                                             READY   STATUS    RESTARTS   AGE
etcd-ip-10-0-25-110.us-east-2.compute.internal   5/5     Running   30         13d
etcd-ip-10-0-38-32.us-east-2.compute.internal    5/5     Running   30         13d
etcd-ip-10-0-54-109.us-east-2.compute.internal   5/5     Running   30         13d

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        False         False      13d

$ oc get configmap etcd-endpoints -n openshift-etcd -o yaml
apiVersion: v1
data:
  53b334f36137e67e: 10.0.38.32
  779ef227edbb656e: 10.0.54.109
  eccc1bd82f1a278c: 10.0.25.110
kind: ConfigMap
metadata:
  name: etcd-endpoints
  namespace: openshift-etcd

All 3 etcd pods healthy, operator Available, configmap contains correct IPs.

Step 2: Backup and inject stale endpoints

$ oc get configmap etcd-endpoints -n openshift-etcd -o json > /tmp/etcd-endpoints-backup.json

$ oc patch configmap etcd-endpoints -n openshift-etcd --type json -p '[
  {"op":"replace","path":"/data/53b334f36137e67e","value":"10.255.255.1"},
  {"op":"replace","path":"/data/779ef227edbb656e","value":"10.255.255.2"},
  {"op":"replace","path":"/data/eccc1bd82f1a278c","value":"10.255.255.3"}
]'
configmap/etcd-endpoints patched

$ oc get configmap etcd-endpoints -n openshift-etcd -o jsonpath='{.data}' | python3 -m json.tool
{
    "53b334f36137e67e": "10.255.255.1",
    "779ef227edbb656e": "10.255.255.2",
    "eccc1bd82f1a278c": "10.255.255.3"
}

All 3 entries now point to unreachable IPs.

Step 3: Restart operator and observe deadlock

$ oc delete pod -n openshift-etcd-operator -l app=etcd-operator

$ oc logs -n openshift-etcd-operator -l app=etcd-operator --tail=200 2>&1 | \
    grep -E "context deadline|connection refused|giving up" | head -30

W0615 10:14:48.766546  1 etcdcli_pool.go:73] could not create a new cached client after 0 tries, trying again.
  Err: failed to make etcd client for endpoints [https://10.255.255.1:2379 https://10.255.255.2:2379
  https://10.255.255.3:2379]: context deadline exceeded
W0615 10:14:48.766707  1 etcdcli_pool.go:73] could not create a new cached client after 0 tries, trying again.
  Err: failed to make etcd client for endpoints [https://10.255.255.1:2379 https://10.255.255.2:2379
  https://10.255.255.3:2379]: context deadline exceeded
W0615 10:14:48.768581  1 etcdcli_pool.go:73] could not create a new cached client after 0 tries, trying again.
  Err: failed to make etcd client for endpoints [https://10.255.255.1:2379 https://10.255.255.2:2379
  https://10.255.255.3:2379]: context deadline exceeded
W0615 10:15:05.768561  1 etcdcli_pool.go:73] could not create a new cached client after 1 tries, trying again.
  Err: failed to make etcd client for endpoints [https://10.255.255.1:2379 https://10.255.255.2:2379
  https://10.255.255.3:2379]: context deadline exceeded
W0615 10:15:05.768565  1 etcdcli_pool.go:73] could not create a new cached client after 1 tries, trying again.
  Err: failed to make etcd client for endpoints [https://10.255.255.1:2379 https://10.255.255.2:2379
  https://10.255.255.3:2379]: context deadline exceeded
(repeats indefinitely across all controllers)

The operator is stuck in an infinite retry loop. Every controller that needs an etcd client fails with context deadline exceeded because it keeps trying the stale configmap IPs. It cannot break out of this cycle on its own.

Step 4: Operator status — entered Progressing state

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        True          False      13d     NodeInstallerProgressing: 3 nodes are at revision 8; 0 nodes have achieved new revision 10

Step 5: Manual recovery required

$ oc patch configmap etcd-endpoints -n openshift-etcd --type json -p '[
  {"op":"replace","path":"/data/53b334f36137e67e","value":"10.0.38.32"},
  {"op":"replace","path":"/data/779ef227edbb656e","value":"10.0.54.109"},
  {"op":"replace","path":"/data/eccc1bd82f1a278c","value":"10.0.25.110"}
]'
configmap/etcd-endpoints patched

$ oc delete pod -n openshift-etcd-operator -l app=etcd-operator

Recovery required multiple revision rolls over several minutes:

$ oc get co etcd   # immediately after restore
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        True          False      13d     NodeInstallerProgressing: 2 nodes are at revision 8;
  1 node is at revision 10; 0 nodes have achieved new revision 12

$ oc get co etcd   # ~2 minutes later
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        True          False      13d     NodeInstallerProgressing: 1 node is at revision 8;
  1 node is at revision 10; 1 node is at revision 12

$ oc get co etcd   # ~5 minutes later — finally stable
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        False         False      13d

Conclusion (Before): The stock operator cannot recover from stale etcd-endpoints configmap. Manual intervention is required to restore correct IPs. Recovery triggers a multi-revision rollout cascade (rev 8 → 10 → 12).


AFTER Fix (patched operator) — Self-Heal

Step 1: Verify healthy baseline (post-recovery from Before test)

$ oc get pods -n openshift-etcd -l app=etcd
NAME                                             READY   STATUS    RESTARTS   AGE
etcd-ip-10-0-25-110.us-east-2.compute.internal   5/5     Running   0          3m21s
etcd-ip-10-0-38-32.us-east-2.compute.internal    5/5     Running   0          7m35s
etcd-ip-10-0-54-109.us-east-2.compute.internal   5/5     Running   0          5m40s

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        False         False      13d

$ oc get configmap etcd-endpoints -n openshift-etcd -o jsonpath='{.data}' | python3 -m json.tool
{
    "53b334f36137e67e": "10.0.38.32",
    "779ef227edbb656e": "10.0.54.109",
    "eccc1bd82f1a278c": "10.0.25.110"
}

Step 2: Deploy patched operator image

$ oc patch clusterversion version --type merge -p '{
  "spec":{"overrides":[{
    "kind":"Deployment","name":"etcd-operator",
    "namespace":"openshift-etcd-operator","unmanaged":true,"group":"apps/v1"
  }]}'
clusterversion.config.openshift.io/version patched

$ oc set image deployment/etcd-operator -n openshift-etcd-operator \
    etcd-operator=quay.io/rhn_support_dpateriy/cluster-etcd-operator:stale-fix
deployment.apps/etcd-operator image updated

$ oc get pods -n openshift-etcd-operator -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE
etcd-operator-5cc4db9b69-z4jj7   1/1     Running   0          13s   10.128.0.40
  ip-10-0-25-110.us-east-2.compute.internal

Patched operator pod running.

Step 3: Inject same stale configmap

$ oc patch configmap etcd-endpoints -n openshift-etcd --type json -p '[
  {"op":"replace","path":"/data/53b334f36137e67e","value":"10.255.255.1"},
  {"op":"replace","path":"/data/779ef227edbb656e","value":"10.255.255.2"},
  {"op":"replace","path":"/data/eccc1bd82f1a278c","value":"10.255.255.3"}
]'
configmap/etcd-endpoints patched

$ oc get configmap etcd-endpoints -n openshift-etcd -o jsonpath='{.data}' | python3 -m json.tool
{
    "53b334f36137e67e": "10.255.255.1",
    "779ef227edbb656e": "10.255.255.2",
    "eccc1bd82f1a278c": "10.255.255.3"
}

Same stale IPs injected as in the Before test.

Step 4: Restart operator and observe self-heal

$ oc delete pod -n openshift-etcd-operator -l app=etcd-operator
pod "etcd-operator-5cc4db9b69-z4jj7" deleted

Step 5: Configmap auto-corrected — no manual intervention

After ~30 seconds, the configmap was automatically restored to the correct IPs by the operator's fallback endpoint discovery:

$ oc get configmap etcd-endpoints -n openshift-etcd -o jsonpath='{.data}' | python3 -m json.tool
{
    "53b334f36137e67e": "10.0.38.32",
    "779ef227edbb656e": "10.0.54.109",
    "eccc1bd82f1a278c": "10.0.25.110"
}

No oc patch or manual restore was performed. The operator discovered the real etcd member IPs by falling back to control-plane node internal IPs and auto-corrected the configmap.

Step 6: Operator returned to healthy state

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        True          False      13d     NodeInstallerProgressing: 3 nodes are at revision 12;
  0 nodes have achieved new revision 17

$ oc get co etcd   # ~2 minutes later
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        True          False      13d     NodeInstallerProgressing: 2 nodes are at revision 12;
  1 node is at revision 18

$ oc get co etcd   # ~4 minutes later — fully stable
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        False         False      13d

Conclusion (After): The patched operator automatically detected that all configmap endpoints were unreachable, fell back to node-based endpoint discovery, connected to etcd via the real control-plane node IPs, and corrected the configmap. Full recovery completed without any manual intervention.


Comparison Summary

Aspect Before (stock operator) After (patched operator)
Operator behavior with stale configmap Stuck in infinite context deadline exceeded retry loop Falls back to node-based endpoint discovery, connects successfully
Configmap recovery Stays stale indefinitely, requires manual oc patch Auto-corrected to real IPs within ~30 seconds
Cluster operator status Progressing=True, triggers multi-revision rollout cascade (rev 8→10→12) Returns to Available=True, Progressing=False cleanly
Manual intervention required Yes — must manually restore correct IPs None
Time to full recovery ~5+ minutes (after manual fix) ~4 minutes (fully automatic)

@dpateriya

Copy link
Copy Markdown
Contributor Author

/verified by me

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 15, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@dpateriya: This PR has been marked as verified by me.

Details

In response to this:

/verified by me

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Comment thread pkg/etcdcli/etcdcli.go
@dpateriya dpateriya force-pushed the fix/etcd-stale-endpoints-fallback branch from 75bb511 to c52400d Compare June 15, 2026 11:45
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Jun 15, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@dpateriya: This pull request references Jira Issue OCPBUGS-88490, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

Fixes a deadlock in the etcd operator that occurs when all IPs in the etcd-endpoints configmap become unreachable (e.g. after VM migration). The operator cannot connect to etcd to update the configmap, and cannot get a working client because the configmap is its only endpoint source.

  • Fix 1 — Pool-level fallback to node-based endpoint discovery: After exhausting 3 retries with configmap-derived endpoints, the client pool creates a new client that dials control-plane node IPs directly via a dedicated newFuncWithEndpoints factory, bypassing the stale configmap entirely.
  • Fix 2 — Resilient IsBootstrapComplete: When the kube-system/bootstrap configmap says complete but MemberList fails, return (true, nil) with a warning instead of blocking recovery controllers.
  • Controller hardening: DefragController, EtcdMembersController, ClusterMemberController, EtcdCertSignerController, and ScriptController are hardened against transient etcd connectivity failures to prevent unnecessary degraded status reporting during recovery.

Bug: https://redhat.atlassian.net/browse/OCPBUGS-88490

Test plan

  • go build passes for all modified packages and the operator binary
  • go vet clean on pkg/etcdcli/... and pkg/operator/ceohelpers/...
  • New unit tests added and passing:
  • TestClientGetFallsBackToNodeEndpoints — primary newFunc fails (simulates stale configmap dial timeout), fallback newFuncWithEndpoints creates a working client with node-derived endpoints
  • TestClientGetFallbackAlsoFails — both primary and fallback paths fail, error message confirms fallback was attempted
  • TestClientGetNoFallbackFunc — no fallback configured, confirms existing behavior is preserved
  • Test_IsBootstrapComplete/bootstrap_complete,_etcd_unreachable — configmap says complete + etcd down → returns (true, nil)
  • Test_IsBootstrapComplete/bootstrap_incomplete,_etcd_unreachable — configmap says progressing + etcd down → returns (false, nil)
  • e2e: simulate stale configmap by patching etcd-endpoints to unreachable IPs, confirm operator self-heals

Made with Cursor

Summary by CodeRabbit

  • New Features

  • Added automatic fallback discovery of etcd endpoints from cluster nodes and improved client recovery when primary endpoints are unavailable.

  • Bug Fixes

  • Made multiple operator sync paths treat temporary etcd/lister sync and connectivity issues as non-fatal, reducing unnecessary degraded states.

  • Continued recovery when etcd is unreachable during bootstrap, and ensured certificate syncing proceeds when operator status can’t be read.

  • Treated “unhealthy cluster” learner add failures and missing script env values as transient.

  • Tests

  • Added integration coverage for fallback/retry behavior in the etcd client pool.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@dpateriya dpateriya force-pushed the fix/etcd-stale-endpoints-fallback branch 2 times, most recently from 421b3b4 to 65a1365 Compare June 15, 2026 14:23

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go`:
- Around line 180-183: The nodeLister.List call currently filters only nodes
with the node-role.kubernetes.io/master label, which can result in an incomplete
etcd-endpoints ConfigMap during recovery in arbiter topology. Modify the
labels.Set selector passed to c.nodeLister.List to include both the master label
and the arbiter label for control-plane nodes, ensuring the fallback endpoint
discovery includes arbiter-labeled nodes consistent with how endpoint fallback
is handled elsewhere in the codebase.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3d6948da-0dbb-4aeb-a7ac-13480a6daf7c

📥 Commits

Reviewing files that changed from the base of the PR and between 421b3b4 and 65a1365.

📒 Files selected for processing (4)
  • pkg/etcdcli/helpers.go
  • pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go
  • pkg/operator/etcdendpointscontroller/etcdendpointscontroller_test.go
  • pkg/operator/starter.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/etcdcli/helpers.go

Comment thread pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go Outdated
@dpateriya dpateriya force-pushed the fix/etcd-stale-endpoints-fallback branch from 65a1365 to f5156e1 Compare June 15, 2026 14:38
@dpateriya

Copy link
Copy Markdown
Contributor Author

Before/After Reproduction Evidence (v2 — EtcdEndpointsController approach)

Tested on OCP 4.21.9 cluster (3 control-plane nodes, AWS).

Before Fix (Stock Operator — Deadlock Confirmed)

Cluster healthy before test:

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        False         False      13d

Original configmap:

data:
  53b334f36137e67e: 10.0.38.32
  779ef227edbb656e: 10.0.54.109
  eccc1bd82f1a278c: 10.0.25.110

Injected stale IPs + restarted operator pod:

$ oc patch configmap etcd-endpoints -n openshift-etcd --type merge \
  -p '{"data":{"53b334f36137e67e":"192.0.2.99","779ef227edbb656e":"192.0.2.98","eccc1bd82f1a278c":"192.0.2.97"}}'
$ oc delete pod -n openshift-etcd-operator -l app=etcd-operator

Result — operator stuck in deadlock, configmap stays stale:

$ oc get configmap etcd-endpoints -n openshift-etcd -o yaml
data:
  53b334f36137e67e: 192.0.2.99
  779ef227edbb656e: 192.0.2.98
  eccc1bd82f1a278c: 192.0.2.97

$ oc logs -n openshift-etcd-operator deployment/etcd-operator --tail=10
W0615 14:56:28.211537  1 etcdcli_pool.go:73] could not create a new cached client after 1 tries, trying again. Err: failed to make etcd client for endpoints [https://192.0.2.97:2379 https://192.0.2.98:2379 https://192.0.2.99:2379]: context deadline exceeded
W0615 14:56:28.610854  1 etcdcli_pool.go:73] could not create a new cached client after 1 tries, trying again. Err: failed to make etcd client for endpoints [https://192.0.2.97:2379 https://192.0.2.98:2379 https://192.0.2.99:2379]: context deadline exceeded
...

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        True          False      13d     NodeInstallerProgressing: 1 node is at revision 18; ...

Conclusion: The stock operator cannot self-heal. Configmap stays stale indefinitely.


After Fix (Patched Operator — Self-Healing Confirmed)

Deployed patched image:

$ oc set image deployment/etcd-operator -n openshift-etcd-operator \
  etcd-operator=quay.io/rhn_support_dpateriy/cluster-etcd-operator:stale-endpoints-fix-v2
$ oc rollout status deployment/etcd-operator -n openshift-etcd-operator
deployment "etcd-operator" successfully rolled out

Injected same stale IPs + restarted operator pod:

$ oc patch configmap etcd-endpoints -n openshift-etcd --type merge \
  -p '{"data":{"53b334f36137e67e":"192.0.2.99","779ef227edbb656e":"192.0.2.98","eccc1bd82f1a278c":"192.0.2.97"}}'
$ oc delete pod -n openshift-etcd-operator -l app=etcd-operator

Result — operator self-healed via node IP fallback:

$ oc logs -n openshift-etcd-operator deployment/etcd-operator -f | grep -E "MemberList failed|populated etcd-endpoints"
W0615 15:07:39.217472  1 etcdendpointscontroller.go:127] EtcdEndpointsController: MemberList failed (giving up getting a cached client after 3 tries), falling back to control-plane node IPs
I0615 15:07:41.514850  1 etcdendpointscontroller.go:136] EtcdEndpointsController: populated etcd-endpoints configmap from node IPs (3 endpoints)

Configmap auto-corrected with real IPs:

$ oc get configmap etcd-endpoints -n openshift-etcd -o yaml
data:
  53b334f36137e67e: 10.0.38.32
  779ef227edbb656e: 10.0.54.109
  eccc1bd82f1a278c: 10.0.25.110

Cluster fully recovered:

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        False         False      13d

Conclusion: The patched EtcdEndpointsController detects MemberList failure, falls back to control-plane node IPs, populates the configmap, and restores full cluster health automatically — no manual intervention required.

@dpateriya

Copy link
Copy Markdown
Contributor Author

/verified by me

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2026
@dpateriya

Copy link
Copy Markdown
Contributor Author

/verfied by me

@dpateriya

Copy link
Copy Markdown
Contributor Author

/verified by me

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 17, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@dpateriya: This PR has been marked as verified by me.

Details

In response to this:

/verified by me

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@dpateriya

Copy link
Copy Markdown
Contributor Author

@tjungblu, Once this is merged, should we backport this fix to 4.22, 4.21, and 4.20? The stale configmap deadlock can occur on any version where VM migration changes node IPs, and it requires manual intervention to recover without this fix.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD f38807a and 2 for PR HEAD ac50724 in total

@dpateriya

Copy link
Copy Markdown
Contributor Author

/retest-required

1 similar comment
@dpateriya

Copy link
Copy Markdown
Contributor Author

/retest-required

@dpateriya

Copy link
Copy Markdown
Contributor Author

/test e2e-gcp-operator-disruptive

@dpateriya

Copy link
Copy Markdown
Contributor Author

@tjungblu the CI JOB ci/prow/e2e-gcp-operator-disruptive is failing because of the invalid value of spec.backendQuotaGiB under etcd.operator/cluster YAML:

Do we have someone looking into this?

image

@dpateriya

Copy link
Copy Markdown
Contributor Author

/test e2e-gcp-operator-disruptive

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 6dc19bf and 1 for PR HEAD ac50724 in total

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD f4d5750 and 0 for PR HEAD ac50724 in total

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/hold

Revision ac50724 was retested 3 times: holding

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 20, 2026
@dpateriya

Copy link
Copy Markdown
Contributor Author

/test e2e-gcp-operator-disruptive

@dpateriya

Copy link
Copy Markdown
Contributor Author

/unhold

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 21, 2026
@dpateriya

Copy link
Copy Markdown
Contributor Author

/cherry-pick release-4.22
/cherry-pick release-4.21

@openshift-cherrypick-robot

Copy link
Copy Markdown

@dpateriya: once the present PR merges, I will cherry-pick it on top of release-4.21, release-4.22 in new PRs and assign them to you.

Details

In response to this:

/cherry-pick release-4.22
/cherry-pick release-4.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci

openshift-ci Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

@dpateriya: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit c028077 into openshift:main Jun 21, 2026
17 checks passed
@openshift-ci-robot

Copy link
Copy Markdown

@dpateriya: Jira Issue Verification Checks: Jira Issue OCPBUGS-88490
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-88490 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Summary

Fixes OCPBUGS-88490

When the etcd-endpoints configmap contains stale IPs (e.g. after VM migration by vSphere DRS), the etcd client pool cannot reach any member, causing MemberList() to fail with context deadline exceeded. This creates a circular dependency: the operator cannot update the configmap without MemberList, but MemberList fails because the configmap has stale addresses.

Root Cause

After VMs migrate to new ESXi hosts, node IPs may change, but the etcd-endpoints configmap retains the old IPs. EtcdEndpointsController.syncConfigMap() calls MemberList() to discover current members, but the etcd client is initialized from the stale configmap and cannot connect. The controller returns an error and retries indefinitely with the same stale endpoints.

Fix

In EtcdEndpointsController.syncConfigMap(), when MemberList() fails, fall back to discovering control-plane node internal IPs via the node lister and network config (already available in the operator). This populates the configmap with reachable node IPs, allowing the etcd client to reconnect on the next cycle. Once connectivity is restored, MemberList() succeeds and overwrites the configmap with authoritative member data (member ID keys instead of node name keys).

Changes

  • pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go: Added nodeLister and networkLister fields; added endpointsFromNodeLister() method; modified syncConfigMap() to fall back to node IPs when MemberList fails.
  • pkg/operator/starter.go: Passed controlPlaneNodeLister and networkInformer.Lister() to NewEtcdEndpointsController.
  • pkg/etcdcli/helpers.go: Added WithMemberListError option to FakeEtcdClient for testing.
  • pkg/operator/etcdendpointscontroller/etcdendpointscontroller_test.go: Added TestMemberListFallbackToNodeIPs with cases for successful fallback and dual failure.

Test Plan

  • go build ./... compiles cleanly
  • go vet ./... passes
  • All existing TestBootstrapAnnotationRemoval tests pass
  • New TestMemberListFallbackToNodeIPs tests pass
  • Reproduced on real OCP cluster: stale configmap leads to operator deadlock (before), self-healing via node IP fallback (after)

Summary by CodeRabbit

  • Bug Fixes
  • Etcd endpoint configuration now includes a fallback mechanism: when member discovery fails, the system uses control-plane node IP addresses to populate the endpoint configuration instead of failing completely.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

Copy link
Copy Markdown

@dpateriya: new pull request created: #1636

Details

In response to this:

/cherry-pick release-4.22
/cherry-pick release-4.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-cherrypick-robot

Copy link
Copy Markdown

@dpateriya: new pull request created: #1637

Details

In response to this:

/cherry-pick release-4.22
/cherry-pick release-4.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-robot

Copy link
Copy Markdown
Contributor

Fix included in release 5.0.0-0.nightly-2026-06-21-181448

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants