OCPBUGS-88490: fix etcd operator deadlock when etcd-endpoints configmap is stale by dpateriya · Pull Request #1631 · openshift/cluster-etcd-operator

dpateriya · 2026-06-12T14:32:49Z

Summary

When the etcd-endpoints configmap contains stale IPs (e.g. after VM migration by vSphere DRS), the etcd client pool cannot reach any member, causing MemberList() to fail with context deadline exceeded. This creates a circular dependency: the operator cannot update the configmap without MemberList, but MemberList fails because the configmap has stale addresses.

Root Cause

After VMs migrate to new ESXi hosts, node IPs may change, but the etcd-endpoints configmap retains the old IPs. EtcdEndpointsController.syncConfigMap() calls MemberList() to discover current members, but the etcd client is initialized from the stale configmap and cannot connect. The controller returns an error and retries indefinitely with the same stale endpoints.

Fix

In EtcdEndpointsController.syncConfigMap(), when MemberList() fails, fall back to discovering control-plane node internal IPs via the node lister and network config (already available in the operator). This populates the configmap with reachable node IPs, allowing the etcd client to reconnect on the next cycle. Once connectivity is restored, MemberList() succeeds and overwrites the configmap with authoritative member data (member ID keys instead of node name keys).

Changes

pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go: Added nodeLister and networkLister fields; added endpointsFromNodeLister() method; modified syncConfigMap() to fall back to node IPs when MemberList fails.
pkg/operator/starter.go: Passed controlPlaneNodeLister and networkInformer.Lister() to NewEtcdEndpointsController.
pkg/etcdcli/helpers.go: Added WithMemberListError option to FakeEtcdClient for testing.
pkg/operator/etcdendpointscontroller/etcdendpointscontroller_test.go: Added TestMemberListFallbackToNodeIPs with cases for successful fallback and dual failure.

Test Plan

go build ./... compiles cleanly
go vet ./... passes
All existing TestBootstrapAnnotationRemoval tests pass
New TestMemberListFallbackToNodeIPs tests pass
Reproduced on real OCP cluster: stale configmap leads to operator deadlock (before), self-healing via node IP fallback (after)

Summary by CodeRabbit

Bug Fixes
- Etcd endpoint configuration now includes a fallback mechanism: when member discovery fails, the system uses control-plane node IP addresses to populate the endpoint configuration instead of failing completely.

coderabbitai · 2026-06-12T14:33:04Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

EtcdEndpointsController gains nodeLister and networkLister fields. When MemberList fails during syncConfigMap, the controller now logs a warning and falls back to populating the etcd-endpoints ConfigMap from control-plane node internal IPs via a new endpointsFromNodeLister helper. A WithMemberListError option is added to the fake etcd client for testing, and RunOperator is updated to wire the new listers.

Changes

MemberList Fallback to Node IPs

Layer / File(s)	Summary
Fake etcd client MemberListError option `pkg/etcdcli/helpers.go`	Adds `memberListError` field to `FakeClientOptions` and `WithMemberListError` constructor option; `fakeEtcdClient.MemberList` returns early with the configured error when set.
Controller struct, constructor, syncConfigMap fallback, and endpointsFromNodeLister `pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go`	Extends `EtcdEndpointsController` with `nodeLister` and `networkLister`; updates constructor to wire them; changes `syncConfigMap` to fall back to `endpointsFromNodeLister` on `MemberList` failure; adds `endpointsFromNodeLister` that resolves master node internal IPs from the cluster network.
Operator wiring and fallback tests `pkg/operator/starter.go`, `pkg/operator/etcdendpointscontroller/etcdendpointscontroller_test.go`	`RunOperator` passes `controlPlaneNodeLister` and `networkInformer.Lister()` to the controller constructor; `TestMemberListFallbackToNodeIPs` covers successful node-IP fallback and error cases when both member listing and node fallback fail.

Sequence Diagram(s)

sequenceDiagram
  participant syncConfigMap
  participant etcdClient
  participant endpointsFromNodeLister
  participant networkLister
  participant nodeLister

  syncConfigMap->>etcdClient: MemberList()
  alt MemberList succeeds
    etcdClient-->>syncConfigMap: members (filter learners/etcd-bootstrap)
  else MemberList fails
    etcdClient-->>syncConfigMap: error
    syncConfigMap->>syncConfigMap: log warning, clear endpoints
    syncConfigMap->>endpointsFromNodeLister: endpointsFromNodeLister()
    endpointsFromNodeLister->>networkLister: Get("cluster")
    networkLister-->>endpointsFromNodeLister: network config
    endpointsFromNodeLister->>nodeLister: List(node-role.kubernetes.io/master)
    nodeLister-->>endpointsFromNodeLister: control-plane nodes
    endpointsFromNodeLister-->>syncConfigMap: endpoint map keyed by node name
  end
  syncConfigMap->>syncConfigMap: apply updated etcd-endpoints ConfigMap

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 12 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Topology-Aware Scheduling Compatibility	⚠️ Warning	The `endpointsFromNodeLister()` method explicitly filters for only `node-role.kubernetes.io/master` labeled nodes (line 180), missing arbiter nodes in Two-Node Arbiter (TNA) topology. The test only...	Update line 180 to include both master and arbiter nodes via a multi-selector approach or explicit query for both labels, consistent with TNA topology support (see review comment for suggested patch).
Test Structure And Quality	❓ Inconclusive	Custom check specifies "Review Ginkgo test code" but the test added (TestMemberListFallbackToNodeIPs) uses standard Go testing.T with testify, not Ginkgo. Check scope unclear.	Clarify whether check applies to standard Go testing.T tests using testify/assert or only Ginkgo tests with Describe/It blocks. If standard Go tests are in scope: test lacks consistent assertion messages (some lack descriptions) and uses...

✅ Passed checks (12 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: fixing an etcd operator deadlock by implementing fallback endpoint discovery when the ConfigMap is stale.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR uses standard Go testing framework (*testing.T), not Ginkgo. All test scenario names are static and deterministic with no dynamic identifiers, IP addresses, timestamps, node/pod names, or genera...
Microshift Test Compatibility	✅ Passed	PR adds no Ginkgo e2e tests. The new test TestMemberListFallbackToNodeIPs is a standard Go unit test using testing package, not Ginkgo. Custom check scope does not apply.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added. Tests added are standard Go unit tests (TestMemberListFallbackToNodeIPs, Test_IsBootstrapComplete variants) using testing.T, not Ginkgo framework.
Ote Binary Stdout Contract	✅ Passed	PR modifies only library code and controller logic—no process-level code (main, init, TestMain, BeforeSuite) that could write to stdout. All klog calls are inside function bodies. OTE binary remain...
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No Ginkgo e2e tests were added in this PR. Only standard Go unit tests (TestMemberListFallbackToNodeIPs, Test_IsBootstrapComplete) using testing.T framework were added. Check is not applicable.
No-Weak-Crypto	✅ Passed	No weak cryptography, custom crypto implementations, or insecure secret comparisons detected in the modified files (helpers.go, etcdendpointscontroller.go/.go/.test.go, starter.go).
Container-Privileges	✅ Passed	No Kubernetes manifests or container configurations were modified; PR contains only Go source code changes without privilege-related settings.
No-Sensitive-Data-In-Logs	✅ Passed	No sensitive data (passwords, tokens, keys, PII, session IDs, or customer data) is logged. New logging statements log only error types, endpoint counts, and node names without exposing actual IP ad...

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-06-12T14:34:03Z

@dpateriya: This pull request references Jira Issue OCPBUGS-88490, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Fixes a deadlock in the etcd operator that occurs when all IPs in the etcd-endpoints configmap become unreachable (e.g. after VM migration). The operator cannot connect to etcd to update the configmap, and cannot get a working client because the configmap is its only endpoint source.

Fix 1 — Pool-level fallback to node-based endpoint discovery: After exhausting 3 retries with configmap-derived endpoints, the client pool creates a new client that dials control-plane node IPs directly via a dedicated newFuncWithEndpoints factory, bypassing the stale configmap entirely.

Fix 2 — Resilient IsBootstrapComplete: When the kube-system/bootstrap configmap says complete but MemberList fails, return (true, nil) with a warning instead of blocking recovery controllers.

Controller hardening: DefragController, EtcdMembersController, ClusterMemberController, EtcdCertSignerController, and ScriptController are hardened against transient etcd connectivity failures to prevent unnecessary degraded status reporting during recovery.

Bug: https://redhat.atlassian.net/browse/OCPBUGS-88490

Test plan

go build passes for all modified packages and the operator binary

go vet clean on pkg/etcdcli/... and pkg/operator/ceohelpers/...

New unit tests added and passing:

TestClientGetFallsBackToNodeEndpoints — primary newFunc fails (simulates stale configmap dial timeout), fallback newFuncWithEndpoints creates a working client with node-derived endpoints

TestClientGetFallbackAlsoFails — both primary and fallback paths fail, error message confirms fallback was attempted

TestClientGetNoFallbackFunc — no fallback configured, confirms existing behavior is preserved

Test_IsBootstrapComplete/bootstrap_complete,_etcd_unreachable — configmap says complete + etcd down → returns (true, nil)

Test_IsBootstrapComplete/bootstrap_incomplete,_etcd_unreachable — configmap says progressing + etcd down → returns (false, nil)

e2e: simulate stale configmap by patching etcd-endpoints to unreachable IPs, confirm operator self-heals

Made with Cursor

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

dpateriya · 2026-06-12T14:40:52Z

/jira refresh

openshift-ci-robot · 2026-06-12T14:40:59Z

@dpateriya: This pull request references Jira Issue OCPBUGS-88490, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

🧹 Nitpick comments (3)

pkg/etcdcli/etcdcli_pool.go (1)
237-239: 💤 Low value

Consider documenting the thread-safety contract for SetNewFuncWithEndpoints.

This setter modifies newFuncWithEndpoints without synchronization, while Get() reads it concurrently. The current usage in NewEtcdClient is safe because the setter is called immediately after construction before the pool is used. However, the comment should clarify that this method must only be called during initialization, before any concurrent Get() calls.
📝 Suggested documentation clarification
 // SetNewFuncWithEndpoints sets a client factory that dials a specific set of endpoints,
 // bypassing the default endpoint resolution. This must be set alongside fallbackEndpointsFunc
 // for the fallback path to create clients that connect to the node-derived endpoints directly.
+// This method is not thread-safe and must only be called during initialization, before any
+// concurrent Get() calls.
 func (p *EtcdClientPool) SetNewFuncWithEndpoints(fn func(endpoints []string) (*clientv3.Client, error)) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/etcdcli/etcdcli_pool.go` around lines 237 - 239, Document that
SetNewFuncWithEndpoints is not concurrency-safe and must only be called during
initialization before any concurrent use; update the comment on
SetNewFuncWithEndpoints to state the thread-safety contract, referencing the
field newFuncWithEndpoints and the concurrent reader Get(), and mention that
callers should set it in NewEtcdClient (or before the pool is shared) to avoid
data races.
pkg/etcdcli/etcdcli.go (1)
619-629: ⚖️ Poor tradeoff

Consider using sentinel errors instead of string matching for lister sync detection.

The current string-based error detection is fragile—if the error messages in endpoints() or endpointsFromNodes() change, this function will silently fail to match. Since both the error sources and this detector are in the same package, using sentinel errors would be more robust.

However, given the errors are all local to this package and unlikely to change independently, this is acceptable for now.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/etcdcli/etcdcli.go` around lines 619 - 629, Replace fragile string
matching in IsListersNotSynced with sentinel errors: define package-level
variables (e.g., ErrNodeListerNotSynced, ErrConfigMapsListerNotSynced,
ErrNetworkListerNotSynced), return the appropriate sentinel from the error
producers (endpoints, endpointsFromNodes) instead of formatting plain strings,
and update IsListersNotSynced to check via errors.Is against those sentinels (or
compare error values) so detection is robust to message changes.
pkg/operator/scriptcontroller/scriptcontroller.go (1)
79-83: 💤 Low value

Error string matching is acceptable here but could be more precise.

The check strings.Contains(err.Error(), "missing env var values") will match the specific error from line 127, but it would also match any wrapped or aggregated error that happens to contain that substring. Since createScriptConfigMap can aggregate multiple errors (line 119), consider whether other errors could inadvertently match this pattern.

That said, given the simple error generation at line 127 and the transient nature of the condition (env vars not yet populated by listeners), this pattern is reasonable for now.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/operator/scriptcontroller/scriptcontroller.go` around lines 79 - 83,
Replace the fragile substring match of err.Error() with a proper sentinel error
and errors.Is usage: define a package-level sentinel (e.g.
ErrMissingEnvVarValues) where the "missing env var values" error is created (in
createScriptConfigMap/GetEnvVars code path), return that sentinel when env vars
are missing, and here in scriptcontroller.go replace
strings.Contains(err.Error(), "missing env var values") with errors.Is(err,
ErrMissingEnvVarValues) so wrapped/aggregated errors are matched precisely.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/etcdcli/etcdcli_pool.go`:
- Around line 237-239: Document that SetNewFuncWithEndpoints is not
concurrency-safe and must only be called during initialization before any
concurrent use; update the comment on SetNewFuncWithEndpoints to state the
thread-safety contract, referencing the field newFuncWithEndpoints and the
concurrent reader Get(), and mention that callers should set it in NewEtcdClient
(or before the pool is shared) to avoid data races.

In `@pkg/etcdcli/etcdcli.go`:
- Around line 619-629: Replace fragile string matching in IsListersNotSynced
with sentinel errors: define package-level variables (e.g.,
ErrNodeListerNotSynced, ErrConfigMapsListerNotSynced,
ErrNetworkListerNotSynced), return the appropriate sentinel from the error
producers (endpoints, endpointsFromNodes) instead of formatting plain strings,
and update IsListersNotSynced to check via errors.Is against those sentinels (or
compare error values) so detection is robust to message changes.

In `@pkg/operator/scriptcontroller/scriptcontroller.go`:
- Around line 79-83: Replace the fragile substring match of err.Error() with a
proper sentinel error and errors.Is usage: define a package-level sentinel (e.g.
ErrMissingEnvVarValues) where the "missing env var values" error is created (in
createScriptConfigMap/GetEnvVars code path), return that sentinel when env vars
are missing, and here in scriptcontroller.go replace
strings.Contains(err.Error(), "missing env var values") with errors.Is(err,
ErrMissingEnvVarValues) so wrapped/aggregated errors are matched precisely.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2588780d-a2b0-4021-a8d8-884c21b25920

📥 Commits

Reviewing files that changed from the base of the PR and between 2abd78c and a60aad9.

📒 Files selected for processing (11)

pkg/etcdcli/etcdcli.go
pkg/etcdcli/etcdcli_pool.go
pkg/etcdcli/etcdcli_pool_test.go
pkg/etcdcli/helpers.go
pkg/operator/ceohelpers/bootstrap.go
pkg/operator/ceohelpers/bootstrap_test.go
pkg/operator/clustermembercontroller/clustermembercontroller.go
pkg/operator/defragcontroller/defragcontroller.go
pkg/operator/etcdcertsigner/etcdcertsignercontroller.go
pkg/operator/etcdmemberscontroller/etcdmemberscontroller.go
pkg/operator/scriptcontroller/scriptcontroller.go

openshift-ci-robot · 2026-06-12T14:53:24Z

@dpateriya: This pull request references Jira Issue OCPBUGS-88490, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

Fixes a deadlock in the etcd operator that occurs when all IPs in the etcd-endpoints configmap become unreachable (e.g. after VM migration). The operator cannot connect to etcd to update the configmap, and cannot get a working client because the configmap is its only endpoint source.

Fix 1 — Pool-level fallback to node-based endpoint discovery: After exhausting 3 retries with configmap-derived endpoints, the client pool creates a new client that dials control-plane node IPs directly via a dedicated newFuncWithEndpoints factory, bypassing the stale configmap entirely.

Fix 2 — Resilient IsBootstrapComplete: When the kube-system/bootstrap configmap says complete but MemberList fails, return (true, nil) with a warning instead of blocking recovery controllers.

Controller hardening: DefragController, EtcdMembersController, ClusterMemberController, EtcdCertSignerController, and ScriptController are hardened against transient etcd connectivity failures to prevent unnecessary degraded status reporting during recovery.

Bug: https://redhat.atlassian.net/browse/OCPBUGS-88490

Test plan

go build passes for all modified packages and the operator binary

go vet clean on pkg/etcdcli/... and pkg/operator/ceohelpers/...

New unit tests added and passing:

TestClientGetFallsBackToNodeEndpoints — primary newFunc fails (simulates stale configmap dial timeout), fallback newFuncWithEndpoints creates a working client with node-derived endpoints

TestClientGetFallbackAlsoFails — both primary and fallback paths fail, error message confirms fallback was attempted

TestClientGetNoFallbackFunc — no fallback configured, confirms existing behavior is preserved

Test_IsBootstrapComplete/bootstrap_complete,_etcd_unreachable — configmap says complete + etcd down → returns (true, nil)

Test_IsBootstrapComplete/bootstrap_incomplete,_etcd_unreachable — configmap says progressing + etcd down → returns (false, nil)

e2e: simulate stale configmap by patching etcd-endpoints to unreachable IPs, confirm operator self-heals

Made with Cursor

Summary by CodeRabbit

New Features

Added a fallback endpoint discovery and client-creation path to improve etcd connectivity and recovery when primary endpoints fail.

Bug Fixes

Avoids degrading operators for transient etcd/lister sync issues; skips sync cycles instead of erroring.

Treats certain bootstrap, cert, member, defrag and script errors as non-fatal so controllers can recover without unnecessary failure.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

dpateriya · 2026-06-15T08:52:50Z

/retest-required

dpateriya · 2026-06-15T11:11:26Z

Reproduction Evidence: Before/After on OCP 4.21.9

Environment

Platform: AWS (us-east-2)
OCP Version: 4.21.9
Control Plane Nodes:
- ip-10-0-25-110.us-east-2.compute.internal (IP: 10.0.25.110)
- ip-10-0-38-32.us-east-2.compute.internal (IP: 10.0.38.32)
- ip-10-0-54-109.us-east-2.compute.internal (IP: 10.0.54.109)
Jira: OCPBUGS-88490

Scenario

Simulates the real-world failure where VM migration causes all etcd member IPs in the etcd-endpoints configmap to become unreachable. The configmap is patched to replace all 3 real member IPs with unreachable addresses (10.255.255.x), then the etcd-operator pod is restarted to force it to re-read the stale endpoints.

BEFORE Fix (stock operator v4.21.9) — Deadlock

Step 1: Verify healthy baseline

$ oc get pods -n openshift-etcd -l app=etcd
NAME                                             READY   STATUS    RESTARTS   AGE
etcd-ip-10-0-25-110.us-east-2.compute.internal   5/5     Running   30         13d
etcd-ip-10-0-38-32.us-east-2.compute.internal    5/5     Running   30         13d
etcd-ip-10-0-54-109.us-east-2.compute.internal   5/5     Running   30         13d

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        False         False      13d

$ oc get configmap etcd-endpoints -n openshift-etcd -o yaml
apiVersion: v1
data:
  53b334f36137e67e: 10.0.38.32
  779ef227edbb656e: 10.0.54.109
  eccc1bd82f1a278c: 10.0.25.110
kind: ConfigMap
metadata:
  name: etcd-endpoints
  namespace: openshift-etcd

All 3 etcd pods healthy, operator Available, configmap contains correct IPs.

Step 2: Backup and inject stale endpoints

$ oc get configmap etcd-endpoints -n openshift-etcd -o json > /tmp/etcd-endpoints-backup.json

$ oc patch configmap etcd-endpoints -n openshift-etcd --type json -p '[
  {"op":"replace","path":"/data/53b334f36137e67e","value":"10.255.255.1"},
  {"op":"replace","path":"/data/779ef227edbb656e","value":"10.255.255.2"},
  {"op":"replace","path":"/data/eccc1bd82f1a278c","value":"10.255.255.3"}
]'
configmap/etcd-endpoints patched

$ oc get configmap etcd-endpoints -n openshift-etcd -o jsonpath='{.data}' | python3 -m json.tool
{
    "53b334f36137e67e": "10.255.255.1",
    "779ef227edbb656e": "10.255.255.2",
    "eccc1bd82f1a278c": "10.255.255.3"
}

All 3 entries now point to unreachable IPs.

Step 3: Restart operator and observe deadlock

$ oc delete pod -n openshift-etcd-operator -l app=etcd-operator

$ oc logs -n openshift-etcd-operator -l app=etcd-operator --tail=200 2>&1 | \
    grep -E "context deadline|connection refused|giving up" | head -30

W0615 10:14:48.766546  1 etcdcli_pool.go:73] could not create a new cached client after 0 tries, trying again.
  Err: failed to make etcd client for endpoints [https://10.255.255.1:2379 https://10.255.255.2:2379
  https://10.255.255.3:2379]: context deadline exceeded
W0615 10:14:48.766707  1 etcdcli_pool.go:73] could not create a new cached client after 0 tries, trying again.
  Err: failed to make etcd client for endpoints [https://10.255.255.1:2379 https://10.255.255.2:2379
  https://10.255.255.3:2379]: context deadline exceeded
W0615 10:14:48.768581  1 etcdcli_pool.go:73] could not create a new cached client after 0 tries, trying again.
  Err: failed to make etcd client for endpoints [https://10.255.255.1:2379 https://10.255.255.2:2379
  https://10.255.255.3:2379]: context deadline exceeded
W0615 10:15:05.768561  1 etcdcli_pool.go:73] could not create a new cached client after 1 tries, trying again.
  Err: failed to make etcd client for endpoints [https://10.255.255.1:2379 https://10.255.255.2:2379
  https://10.255.255.3:2379]: context deadline exceeded
W0615 10:15:05.768565  1 etcdcli_pool.go:73] could not create a new cached client after 1 tries, trying again.
  Err: failed to make etcd client for endpoints [https://10.255.255.1:2379 https://10.255.255.2:2379
  https://10.255.255.3:2379]: context deadline exceeded
(repeats indefinitely across all controllers)

The operator is stuck in an infinite retry loop. Every controller that needs an etcd client fails with context deadline exceeded because it keeps trying the stale configmap IPs. It cannot break out of this cycle on its own.

Step 4: Operator status — entered Progressing state

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        True          False      13d     NodeInstallerProgressing: 3 nodes are at revision 8; 0 nodes have achieved new revision 10

Step 5: Manual recovery required

$ oc patch configmap etcd-endpoints -n openshift-etcd --type json -p '[
  {"op":"replace","path":"/data/53b334f36137e67e","value":"10.0.38.32"},
  {"op":"replace","path":"/data/779ef227edbb656e","value":"10.0.54.109"},
  {"op":"replace","path":"/data/eccc1bd82f1a278c","value":"10.0.25.110"}
]'
configmap/etcd-endpoints patched

$ oc delete pod -n openshift-etcd-operator -l app=etcd-operator

Recovery required multiple revision rolls over several minutes:

$ oc get co etcd   # immediately after restore
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        True          False      13d     NodeInstallerProgressing: 2 nodes are at revision 8;
  1 node is at revision 10; 0 nodes have achieved new revision 12

$ oc get co etcd   # ~2 minutes later
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        True          False      13d     NodeInstallerProgressing: 1 node is at revision 8;
  1 node is at revision 10; 1 node is at revision 12

$ oc get co etcd   # ~5 minutes later — finally stable
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        False         False      13d

Conclusion (Before): The stock operator cannot recover from stale etcd-endpoints configmap. Manual intervention is required to restore correct IPs. Recovery triggers a multi-revision rollout cascade (rev 8 → 10 → 12).

AFTER Fix (patched operator) — Self-Heal

Step 1: Verify healthy baseline (post-recovery from Before test)

$ oc get pods -n openshift-etcd -l app=etcd
NAME                                             READY   STATUS    RESTARTS   AGE
etcd-ip-10-0-25-110.us-east-2.compute.internal   5/5     Running   0          3m21s
etcd-ip-10-0-38-32.us-east-2.compute.internal    5/5     Running   0          7m35s
etcd-ip-10-0-54-109.us-east-2.compute.internal   5/5     Running   0          5m40s

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        False         False      13d

$ oc get configmap etcd-endpoints -n openshift-etcd -o jsonpath='{.data}' | python3 -m json.tool
{
    "53b334f36137e67e": "10.0.38.32",
    "779ef227edbb656e": "10.0.54.109",
    "eccc1bd82f1a278c": "10.0.25.110"
}

Step 2: Deploy patched operator image

$ oc patch clusterversion version --type merge -p '{
  "spec":{"overrides":[{
    "kind":"Deployment","name":"etcd-operator",
    "namespace":"openshift-etcd-operator","unmanaged":true,"group":"apps/v1"
  }]}'
clusterversion.config.openshift.io/version patched

$ oc set image deployment/etcd-operator -n openshift-etcd-operator \
    etcd-operator=quay.io/rhn_support_dpateriy/cluster-etcd-operator:stale-fix
deployment.apps/etcd-operator image updated

$ oc get pods -n openshift-etcd-operator -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE
etcd-operator-5cc4db9b69-z4jj7   1/1     Running   0          13s   10.128.0.40
  ip-10-0-25-110.us-east-2.compute.internal

Patched operator pod running.

Step 3: Inject same stale configmap

$ oc patch configmap etcd-endpoints -n openshift-etcd --type json -p '[
  {"op":"replace","path":"/data/53b334f36137e67e","value":"10.255.255.1"},
  {"op":"replace","path":"/data/779ef227edbb656e","value":"10.255.255.2"},
  {"op":"replace","path":"/data/eccc1bd82f1a278c","value":"10.255.255.3"}
]'
configmap/etcd-endpoints patched

$ oc get configmap etcd-endpoints -n openshift-etcd -o jsonpath='{.data}' | python3 -m json.tool
{
    "53b334f36137e67e": "10.255.255.1",
    "779ef227edbb656e": "10.255.255.2",
    "eccc1bd82f1a278c": "10.255.255.3"
}

Same stale IPs injected as in the Before test.

Step 4: Restart operator and observe self-heal

$ oc delete pod -n openshift-etcd-operator -l app=etcd-operator
pod "etcd-operator-5cc4db9b69-z4jj7" deleted

Step 5: Configmap auto-corrected — no manual intervention

After ~30 seconds, the configmap was automatically restored to the correct IPs by the operator's fallback endpoint discovery:

$ oc get configmap etcd-endpoints -n openshift-etcd -o jsonpath='{.data}' | python3 -m json.tool
{
    "53b334f36137e67e": "10.0.38.32",
    "779ef227edbb656e": "10.0.54.109",
    "eccc1bd82f1a278c": "10.0.25.110"
}

No oc patch or manual restore was performed. The operator discovered the real etcd member IPs by falling back to control-plane node internal IPs and auto-corrected the configmap.

Step 6: Operator returned to healthy state

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        True          False      13d     NodeInstallerProgressing: 3 nodes are at revision 12;
  0 nodes have achieved new revision 17

$ oc get co etcd   # ~2 minutes later
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        True          False      13d     NodeInstallerProgressing: 2 nodes are at revision 12;
  1 node is at revision 18

$ oc get co etcd   # ~4 minutes later — fully stable
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        False         False      13d

Conclusion (After): The patched operator automatically detected that all configmap endpoints were unreachable, fell back to node-based endpoint discovery, connected to etcd via the real control-plane node IPs, and corrected the configmap. Full recovery completed without any manual intervention.

Comparison Summary

Aspect	Before (stock operator)	After (patched operator)
Operator behavior with stale configmap	Stuck in infinite `context deadline exceeded` retry loop	Falls back to node-based endpoint discovery, connects successfully
Configmap recovery	Stays stale indefinitely, requires manual `oc patch`	Auto-corrected to real IPs within ~30 seconds
Cluster operator status	`Progressing=True`, triggers multi-revision rollout cascade (rev 8→10→12)	Returns to `Available=True, Progressing=False` cleanly
Manual intervention required	Yes — must manually restore correct IPs	None
Time to full recovery	~5+ minutes (after manual fix)	~4 minutes (fully automatic)

dpateriya · 2026-06-15T11:12:25Z

/verified by me

openshift-ci-robot · 2026-06-15T11:12:36Z

@dpateriya: This PR has been marked as verified by me.

Details

In response to this:

/verified by me

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-06-15T11:47:07Z

@dpateriya: This pull request references Jira Issue OCPBUGS-88490, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

Fixes a deadlock in the etcd operator that occurs when all IPs in the etcd-endpoints configmap become unreachable (e.g. after VM migration). The operator cannot connect to etcd to update the configmap, and cannot get a working client because the configmap is its only endpoint source.

Fix 1 — Pool-level fallback to node-based endpoint discovery: After exhausting 3 retries with configmap-derived endpoints, the client pool creates a new client that dials control-plane node IPs directly via a dedicated newFuncWithEndpoints factory, bypassing the stale configmap entirely.

Fix 2 — Resilient IsBootstrapComplete: When the kube-system/bootstrap configmap says complete but MemberList fails, return (true, nil) with a warning instead of blocking recovery controllers.

Controller hardening: DefragController, EtcdMembersController, ClusterMemberController, EtcdCertSignerController, and ScriptController are hardened against transient etcd connectivity failures to prevent unnecessary degraded status reporting during recovery.

Bug: https://redhat.atlassian.net/browse/OCPBUGS-88490

Test plan

go build passes for all modified packages and the operator binary

go vet clean on pkg/etcdcli/... and pkg/operator/ceohelpers/...

New unit tests added and passing:

TestClientGetFallsBackToNodeEndpoints — primary newFunc fails (simulates stale configmap dial timeout), fallback newFuncWithEndpoints creates a working client with node-derived endpoints

TestClientGetFallbackAlsoFails — both primary and fallback paths fail, error message confirms fallback was attempted

TestClientGetNoFallbackFunc — no fallback configured, confirms existing behavior is preserved

Test_IsBootstrapComplete/bootstrap_complete,_etcd_unreachable — configmap says complete + etcd down → returns (true, nil)

Test_IsBootstrapComplete/bootstrap_incomplete,_etcd_unreachable — configmap says progressing + etcd down → returns (false, nil)

e2e: simulate stale configmap by patching etcd-endpoints to unreachable IPs, confirm operator self-heals

Made with Cursor

Summary by CodeRabbit

New Features

Added automatic fallback discovery of etcd endpoints from cluster nodes and improved client recovery when primary endpoints are unavailable.

Bug Fixes

Made multiple operator sync paths treat temporary etcd/lister sync and connectivity issues as non-fatal, reducing unnecessary degraded states.

Continued recovery when etcd is unreachable during bootstrap, and ensured certificate syncing proceeds when operator status can’t be read.

Treated “unhealthy cluster” learner add failures and missing script env values as transient.

Tests

Added integration coverage for fallback/retry behavior in the etcd client pool.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go`:
- Around line 180-183: The nodeLister.List call currently filters only nodes
with the node-role.kubernetes.io/master label, which can result in an incomplete
etcd-endpoints ConfigMap during recovery in arbiter topology. Modify the
labels.Set selector passed to c.nodeLister.List to include both the master label
and the arbiter label for control-plane nodes, ensuring the fallback endpoint
discovery includes arbiter-labeled nodes consistent with how endpoint fallback
is handled elsewhere in the codebase.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3d6948da-0dbb-4aeb-a7ac-13480a6daf7c

📥 Commits

Reviewing files that changed from the base of the PR and between 421b3b4 and 65a1365.

📒 Files selected for processing (4)

pkg/etcdcli/helpers.go
pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go
pkg/operator/etcdendpointscontroller/etcdendpointscontroller_test.go
pkg/operator/starter.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/etcdcli/helpers.go

dpateriya · 2026-06-15T15:17:59Z

Before/After Reproduction Evidence (v2 — EtcdEndpointsController approach)

Tested on OCP 4.21.9 cluster (3 control-plane nodes, AWS).

Before Fix (Stock Operator — Deadlock Confirmed)

Cluster healthy before test:

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        False         False      13d

Original configmap:

data:
  53b334f36137e67e: 10.0.38.32
  779ef227edbb656e: 10.0.54.109
  eccc1bd82f1a278c: 10.0.25.110

Injected stale IPs + restarted operator pod:

$ oc patch configmap etcd-endpoints -n openshift-etcd --type merge \
  -p '{"data":{"53b334f36137e67e":"192.0.2.99","779ef227edbb656e":"192.0.2.98","eccc1bd82f1a278c":"192.0.2.97"}}'
$ oc delete pod -n openshift-etcd-operator -l app=etcd-operator

Result — operator stuck in deadlock, configmap stays stale:

$ oc get configmap etcd-endpoints -n openshift-etcd -o yaml
data:
  53b334f36137e67e: 192.0.2.99
  779ef227edbb656e: 192.0.2.98
  eccc1bd82f1a278c: 192.0.2.97

$ oc logs -n openshift-etcd-operator deployment/etcd-operator --tail=10
W0615 14:56:28.211537  1 etcdcli_pool.go:73] could not create a new cached client after 1 tries, trying again. Err: failed to make etcd client for endpoints [https://192.0.2.97:2379 https://192.0.2.98:2379 https://192.0.2.99:2379]: context deadline exceeded
W0615 14:56:28.610854  1 etcdcli_pool.go:73] could not create a new cached client after 1 tries, trying again. Err: failed to make etcd client for endpoints [https://192.0.2.97:2379 https://192.0.2.98:2379 https://192.0.2.99:2379]: context deadline exceeded
...

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        True          False      13d     NodeInstallerProgressing: 1 node is at revision 18; ...

Conclusion: The stock operator cannot self-heal. Configmap stays stale indefinitely.

After Fix (Patched Operator — Self-Healing Confirmed)

Deployed patched image:

$ oc set image deployment/etcd-operator -n openshift-etcd-operator \
  etcd-operator=quay.io/rhn_support_dpateriy/cluster-etcd-operator:stale-endpoints-fix-v2
$ oc rollout status deployment/etcd-operator -n openshift-etcd-operator
deployment "etcd-operator" successfully rolled out

Injected same stale IPs + restarted operator pod:

$ oc patch configmap etcd-endpoints -n openshift-etcd --type merge \
  -p '{"data":{"53b334f36137e67e":"192.0.2.99","779ef227edbb656e":"192.0.2.98","eccc1bd82f1a278c":"192.0.2.97"}}'
$ oc delete pod -n openshift-etcd-operator -l app=etcd-operator

Result — operator self-healed via node IP fallback:

$ oc logs -n openshift-etcd-operator deployment/etcd-operator -f | grep -E "MemberList failed|populated etcd-endpoints"
W0615 15:07:39.217472  1 etcdendpointscontroller.go:127] EtcdEndpointsController: MemberList failed (giving up getting a cached client after 3 tries), falling back to control-plane node IPs
I0615 15:07:41.514850  1 etcdendpointscontroller.go:136] EtcdEndpointsController: populated etcd-endpoints configmap from node IPs (3 endpoints)

Configmap auto-corrected with real IPs:

$ oc get configmap etcd-endpoints -n openshift-etcd -o yaml
data:
  53b334f36137e67e: 10.0.38.32
  779ef227edbb656e: 10.0.54.109
  eccc1bd82f1a278c: 10.0.25.110

Cluster fully recovered:

$ oc get co etcd
NAME   VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
etcd   4.21.9    True        False         False      13d

Conclusion: The patched EtcdEndpointsController detects MemberList failure, falls back to control-plane node IPs, populates the configmap, and restores full cluster health automatically — no manual intervention required.

dpateriya · 2026-06-15T15:19:08Z

/verified by me

openshift-ci · 2026-06-17T11:36:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [tjungblu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dpateriya · 2026-06-17T11:39:44Z

/verfied by me

dpateriya · 2026-06-17T11:55:57Z

/verified by me

openshift-ci-robot · 2026-06-17T11:56:09Z

@dpateriya: This PR has been marked as verified by me.

Details

In response to this:

/verified by me

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

dpateriya · 2026-06-17T14:21:13Z

@tjungblu, Once this is merged, should we backport this fix to 4.22, 4.21, and 4.20? The stale configmap deadlock can occur on any version where VM migration changes node IPs, and it requires manual intervention to recover without this fix.

openshift-merge-bot · 2026-06-17T16:27:22Z

/retest-required

Remaining retests: 0 against base HEAD f38807a and 2 for PR HEAD ac50724 in total

dpateriya · 2026-06-18T08:48:54Z

/retest-required

dpateriya · 2026-06-18T13:23:55Z

/retest-required

dpateriya · 2026-06-18T16:54:29Z

/test e2e-gcp-operator-disruptive

dpateriya · 2026-06-18T20:28:43Z

@tjungblu the CI JOB ci/prow/e2e-gcp-operator-disruptive is failing because of the invalid value of spec.backendQuotaGiB under etcd.operator/cluster YAML:

Do we have someone looking into this?

dpateriya · 2026-06-19T05:05:00Z

/test e2e-gcp-operator-disruptive

openshift-merge-bot · 2026-06-19T12:26:50Z

/retest-required

Remaining retests: 0 against base HEAD 6dc19bf and 1 for PR HEAD ac50724 in total

openshift-merge-bot · 2026-06-19T23:26:54Z

/retest-required

Remaining retests: 0 against base HEAD f4d5750 and 0 for PR HEAD ac50724 in total

openshift-merge-bot · 2026-06-20T07:26:41Z

/hold

Revision ac50724 was retested 3 times: holding

dpateriya · 2026-06-20T18:49:14Z

/test e2e-gcp-operator-disruptive

dpateriya · 2026-06-21T07:03:05Z

/unhold

dpateriya · 2026-06-21T07:06:48Z

/cherry-pick release-4.22
/cherry-pick release-4.21

openshift-cherrypick-robot · 2026-06-21T07:06:51Z

@dpateriya: once the present PR merges, I will cherry-pick it on top of release-4.21, release-4.22 in new PRs and assign them to you.

Details

In response to this:

/cherry-pick release-4.22
/cherry-pick release-4.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2026-06-21T10:48:20Z

@dpateriya: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-06-21T10:51:30Z

@dpateriya: Jira Issue Verification Checks: Jira Issue OCPBUGS-88490
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-88490 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Summary

Fixes OCPBUGS-88490

When the etcd-endpoints configmap contains stale IPs (e.g. after VM migration by vSphere DRS), the etcd client pool cannot reach any member, causing MemberList() to fail with context deadline exceeded. This creates a circular dependency: the operator cannot update the configmap without MemberList, but MemberList fails because the configmap has stale addresses.

Root Cause

After VMs migrate to new ESXi hosts, node IPs may change, but the etcd-endpoints configmap retains the old IPs. EtcdEndpointsController.syncConfigMap() calls MemberList() to discover current members, but the etcd client is initialized from the stale configmap and cannot connect. The controller returns an error and retries indefinitely with the same stale endpoints.

Fix

In EtcdEndpointsController.syncConfigMap(), when MemberList() fails, fall back to discovering control-plane node internal IPs via the node lister and network config (already available in the operator). This populates the configmap with reachable node IPs, allowing the etcd client to reconnect on the next cycle. Once connectivity is restored, MemberList() succeeds and overwrites the configmap with authoritative member data (member ID keys instead of node name keys).

Changes

pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go: Added nodeLister and networkLister fields; added endpointsFromNodeLister() method; modified syncConfigMap() to fall back to node IPs when MemberList fails.

pkg/operator/starter.go: Passed controlPlaneNodeLister and networkInformer.Lister() to NewEtcdEndpointsController.

pkg/etcdcli/helpers.go: Added WithMemberListError option to FakeEtcdClient for testing.

pkg/operator/etcdendpointscontroller/etcdendpointscontroller_test.go: Added TestMemberListFallbackToNodeIPs with cases for successful fallback and dual failure.

Test Plan

go build ./... compiles cleanly

go vet ./... passes

All existing TestBootstrapAnnotationRemoval tests pass

New TestMemberListFallbackToNodeIPs tests pass

Reproduced on real OCP cluster: stale configmap leads to operator deadlock (before), self-healing via node IP fallback (after)

Summary by CodeRabbit

Bug Fixes

Etcd endpoint configuration now includes a fallback mechanism: when member discovery fails, the system uses control-plane node IP addresses to populate the endpoint configuration instead of failing completely.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-cherrypick-robot · 2026-06-21T10:52:27Z

@dpateriya: new pull request created: #1636

Details

In response to this:

/cherry-pick release-4.22
/cherry-pick release-4.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-cherrypick-robot · 2026-06-21T10:53:08Z

@dpateriya: new pull request created: #1637

Details

In response to this:

/cherry-pick release-4.22
/cherry-pick release-4.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-merge-robot · 2026-06-22T00:14:38Z

Fix included in release 5.0.0-0.nightly-2026-06-21-181448

openshift-ci Bot requested review from benluddy and ironcladlou June 12, 2026 14:33

dpateriya changed the title ~~Bug 88490: fix etcd operator deadlock when etcd-endpoints configmap is stale~~ OCPBUGS-88490: fix etcd operator deadlock when etcd-endpoints configmap is stale Jun 12, 2026

openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jun 12, 2026

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 12, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 12, 2026

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

dpateriya force-pushed the fix/etcd-stale-endpoints-fallback branch from a60aad9 to 75bb511 Compare June 12, 2026 14:51

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 15, 2026

tjungblu reviewed Jun 15, 2026

View reviewed changes

Comment thread pkg/etcdcli/etcdcli.go

dpateriya force-pushed the fix/etcd-stale-endpoints-fallback branch from 75bb511 to c52400d Compare June 15, 2026 11:45

openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Jun 15, 2026

dpateriya force-pushed the fix/etcd-stale-endpoints-fallback branch 2 times, most recently from 421b3b4 to 65a1365 Compare June 15, 2026 14:23

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go Outdated

dpateriya force-pushed the fix/etcd-stale-endpoints-fallback branch from 65a1365 to f5156e1 Compare June 15, 2026 14:38

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 17, 2026

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 20, 2026

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 21, 2026

openshift-merge-bot Bot merged commit c028077 into openshift:main Jun 21, 2026
17 checks passed

openshift-cherrypick-robot mentioned this pull request Jun 21, 2026

[release-4.22] OCPBUGS-90542: fix etcd operator deadlock when etcd-endpoints configmap is stale #1636

Merged

openshift-cherrypick-robot mentioned this pull request Jun 21, 2026

[release-4.21] OCPBUGS-90543: fix etcd operator deadlock when etcd-endpoints configmap is stale #1637

Merged

Conversation

dpateriya commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

openshift-ci-robot commented Jun 12, 2026

Summary

Test plan

Uh oh!

dpateriya commented Jun 12, 2026

Uh oh!

openshift-ci-robot commented Jun 12, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Jun 12, 2026

Summary

Test plan

Summary by CodeRabbit

Uh oh!

dpateriya commented Jun 15, 2026

Uh oh!

dpateriya commented Jun 15, 2026

Reproduction Evidence: Before/After on OCP 4.21.9

Environment

Scenario

BEFORE Fix (stock operator v4.21.9) — Deadlock

Step 1: Verify healthy baseline

Step 2: Backup and inject stale endpoints

Step 3: Restart operator and observe deadlock

Step 4: Operator status — entered Progressing state

Step 5: Manual recovery required

AFTER Fix (patched operator) — Self-Heal

Step 1: Verify healthy baseline (post-recovery from Before test)

Step 2: Deploy patched operator image

Step 3: Inject same stale configmap

Step 4: Restart operator and observe self-heal

Step 5: Configmap auto-corrected — no manual intervention

Step 6: Operator returned to healthy state

Comparison Summary

Uh oh!

dpateriya commented Jun 15, 2026

Uh oh!

openshift-ci-robot commented Jun 15, 2026

Uh oh!

Uh oh!

openshift-ci-robot commented Jun 15, 2026

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dpateriya commented Jun 15, 2026

Before/After Reproduction Evidence (v2 — EtcdEndpointsController approach)

Before Fix (Stock Operator — Deadlock Confirmed)

After Fix (Patched Operator — Self-Healing Confirmed)

Uh oh!

dpateriya commented Jun 15, 2026

Uh oh!

openshift-ci Bot commented Jun 17, 2026

Uh oh!

dpateriya commented Jun 17, 2026

dpateriya commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading