OCPBUGS-88490: fix etcd operator deadlock when etcd-endpoints configmap is stale#1631
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Walkthrough
ChangesMemberList Fallback to Node IPs
Sequence Diagram(s)sequenceDiagram
participant syncConfigMap
participant etcdClient
participant endpointsFromNodeLister
participant networkLister
participant nodeLister
syncConfigMap->>etcdClient: MemberList()
alt MemberList succeeds
etcdClient-->>syncConfigMap: members (filter learners/etcd-bootstrap)
else MemberList fails
etcdClient-->>syncConfigMap: error
syncConfigMap->>syncConfigMap: log warning, clear endpoints
syncConfigMap->>endpointsFromNodeLister: endpointsFromNodeLister()
endpointsFromNodeLister->>networkLister: Get("cluster")
networkLister-->>endpointsFromNodeLister: network config
endpointsFromNodeLister->>nodeLister: List(node-role.kubernetes.io/master)
nodeLister-->>endpointsFromNodeLister: control-plane nodes
endpointsFromNodeLister-->>syncConfigMap: endpoint map keyed by node name
end
syncConfigMap->>syncConfigMap: apply updated etcd-endpoints ConfigMap
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 12 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@dpateriya: This pull request references Jira Issue OCPBUGS-88490, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@dpateriya: This pull request references Jira Issue OCPBUGS-88490, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
pkg/etcdcli/etcdcli_pool.go (1)
237-239: 💤 Low valueConsider documenting the thread-safety contract for
SetNewFuncWithEndpoints.This setter modifies
newFuncWithEndpointswithout synchronization, whileGet()reads it concurrently. The current usage inNewEtcdClientis safe because the setter is called immediately after construction before the pool is used. However, the comment should clarify that this method must only be called during initialization, before any concurrentGet()calls.📝 Suggested documentation clarification
// SetNewFuncWithEndpoints sets a client factory that dials a specific set of endpoints, // bypassing the default endpoint resolution. This must be set alongside fallbackEndpointsFunc // for the fallback path to create clients that connect to the node-derived endpoints directly. +// This method is not thread-safe and must only be called during initialization, before any +// concurrent Get() calls. func (p *EtcdClientPool) SetNewFuncWithEndpoints(fn func(endpoints []string) (*clientv3.Client, error)) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/etcdcli/etcdcli_pool.go` around lines 237 - 239, Document that SetNewFuncWithEndpoints is not concurrency-safe and must only be called during initialization before any concurrent use; update the comment on SetNewFuncWithEndpoints to state the thread-safety contract, referencing the field newFuncWithEndpoints and the concurrent reader Get(), and mention that callers should set it in NewEtcdClient (or before the pool is shared) to avoid data races.pkg/etcdcli/etcdcli.go (1)
619-629: ⚖️ Poor tradeoffConsider using sentinel errors instead of string matching for lister sync detection.
The current string-based error detection is fragile—if the error messages in
endpoints()orendpointsFromNodes()change, this function will silently fail to match. Since both the error sources and this detector are in the same package, using sentinel errors would be more robust.However, given the errors are all local to this package and unlikely to change independently, this is acceptable for now.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/etcdcli/etcdcli.go` around lines 619 - 629, Replace fragile string matching in IsListersNotSynced with sentinel errors: define package-level variables (e.g., ErrNodeListerNotSynced, ErrConfigMapsListerNotSynced, ErrNetworkListerNotSynced), return the appropriate sentinel from the error producers (endpoints, endpointsFromNodes) instead of formatting plain strings, and update IsListersNotSynced to check via errors.Is against those sentinels (or compare error values) so detection is robust to message changes.pkg/operator/scriptcontroller/scriptcontroller.go (1)
79-83: 💤 Low valueError string matching is acceptable here but could be more precise.
The check
strings.Contains(err.Error(), "missing env var values")will match the specific error from line 127, but it would also match any wrapped or aggregated error that happens to contain that substring. SincecreateScriptConfigMapcan aggregate multiple errors (line 119), consider whether other errors could inadvertently match this pattern.That said, given the simple error generation at line 127 and the transient nature of the condition (env vars not yet populated by listeners), this pattern is reasonable for now.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/operator/scriptcontroller/scriptcontroller.go` around lines 79 - 83, Replace the fragile substring match of err.Error() with a proper sentinel error and errors.Is usage: define a package-level sentinel (e.g. ErrMissingEnvVarValues) where the "missing env var values" error is created (in createScriptConfigMap/GetEnvVars code path), return that sentinel when env vars are missing, and here in scriptcontroller.go replace strings.Contains(err.Error(), "missing env var values") with errors.Is(err, ErrMissingEnvVarValues) so wrapped/aggregated errors are matched precisely.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/etcdcli/etcdcli_pool.go`:
- Around line 237-239: Document that SetNewFuncWithEndpoints is not
concurrency-safe and must only be called during initialization before any
concurrent use; update the comment on SetNewFuncWithEndpoints to state the
thread-safety contract, referencing the field newFuncWithEndpoints and the
concurrent reader Get(), and mention that callers should set it in NewEtcdClient
(or before the pool is shared) to avoid data races.
In `@pkg/etcdcli/etcdcli.go`:
- Around line 619-629: Replace fragile string matching in IsListersNotSynced
with sentinel errors: define package-level variables (e.g.,
ErrNodeListerNotSynced, ErrConfigMapsListerNotSynced,
ErrNetworkListerNotSynced), return the appropriate sentinel from the error
producers (endpoints, endpointsFromNodes) instead of formatting plain strings,
and update IsListersNotSynced to check via errors.Is against those sentinels (or
compare error values) so detection is robust to message changes.
In `@pkg/operator/scriptcontroller/scriptcontroller.go`:
- Around line 79-83: Replace the fragile substring match of err.Error() with a
proper sentinel error and errors.Is usage: define a package-level sentinel (e.g.
ErrMissingEnvVarValues) where the "missing env var values" error is created (in
createScriptConfigMap/GetEnvVars code path), return that sentinel when env vars
are missing, and here in scriptcontroller.go replace
strings.Contains(err.Error(), "missing env var values") with errors.Is(err,
ErrMissingEnvVarValues) so wrapped/aggregated errors are matched precisely.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2588780d-a2b0-4021-a8d8-884c21b25920
📒 Files selected for processing (11)
pkg/etcdcli/etcdcli.gopkg/etcdcli/etcdcli_pool.gopkg/etcdcli/etcdcli_pool_test.gopkg/etcdcli/helpers.gopkg/operator/ceohelpers/bootstrap.gopkg/operator/ceohelpers/bootstrap_test.gopkg/operator/clustermembercontroller/clustermembercontroller.gopkg/operator/defragcontroller/defragcontroller.gopkg/operator/etcdcertsigner/etcdcertsignercontroller.gopkg/operator/etcdmemberscontroller/etcdmemberscontroller.gopkg/operator/scriptcontroller/scriptcontroller.go
a60aad9 to
75bb511
Compare
|
@dpateriya: This pull request references Jira Issue OCPBUGS-88490, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
Reproduction Evidence: Before/After on OCP 4.21.9Environment
ScenarioSimulates the real-world failure where VM migration causes all etcd member IPs in the BEFORE Fix (stock operator v4.21.9) — DeadlockStep 1: Verify healthy baselineAll 3 etcd pods healthy, operator Available, configmap contains correct IPs. Step 2: Backup and inject stale endpointsAll 3 entries now point to unreachable IPs. Step 3: Restart operator and observe deadlockThe operator is stuck in an infinite retry loop. Every controller that needs an etcd client fails with Step 4: Operator status — entered Progressing stateStep 5: Manual recovery requiredRecovery required multiple revision rolls over several minutes: Conclusion (Before): The stock operator cannot recover from stale AFTER Fix (patched operator) — Self-HealStep 1: Verify healthy baseline (post-recovery from Before test)Step 2: Deploy patched operator imagePatched operator pod running. Step 3: Inject same stale configmapSame stale IPs injected as in the Before test. Step 4: Restart operator and observe self-healStep 5: Configmap auto-corrected — no manual interventionAfter ~30 seconds, the configmap was automatically restored to the correct IPs by the operator's fallback endpoint discovery: No Step 6: Operator returned to healthy stateConclusion (After): The patched operator automatically detected that all configmap endpoints were unreachable, fell back to node-based endpoint discovery, connected to etcd via the real control-plane node IPs, and corrected the configmap. Full recovery completed without any manual intervention. Comparison Summary
|
|
/verified by me |
|
@dpateriya: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
75bb511 to
c52400d
Compare
|
@dpateriya: This pull request references Jira Issue OCPBUGS-88490, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
421b3b4 to
65a1365
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go`:
- Around line 180-183: The nodeLister.List call currently filters only nodes
with the node-role.kubernetes.io/master label, which can result in an incomplete
etcd-endpoints ConfigMap during recovery in arbiter topology. Modify the
labels.Set selector passed to c.nodeLister.List to include both the master label
and the arbiter label for control-plane nodes, ensuring the fallback endpoint
discovery includes arbiter-labeled nodes consistent with how endpoint fallback
is handled elsewhere in the codebase.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3d6948da-0dbb-4aeb-a7ac-13480a6daf7c
📒 Files selected for processing (4)
pkg/etcdcli/helpers.gopkg/operator/etcdendpointscontroller/etcdendpointscontroller.gopkg/operator/etcdendpointscontroller/etcdendpointscontroller_test.gopkg/operator/starter.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/etcdcli/helpers.go
65a1365 to
f5156e1
Compare
Before/After Reproduction Evidence (v2 — EtcdEndpointsController approach)Tested on OCP 4.21.9 cluster (3 control-plane nodes, AWS). Before Fix (Stock Operator — Deadlock Confirmed)Cluster healthy before test: Original configmap: Injected stale IPs + restarted operator pod: Result — operator stuck in deadlock, configmap stays stale: Conclusion: The stock operator cannot self-heal. Configmap stays stale indefinitely. After Fix (Patched Operator — Self-Healing Confirmed)Deployed patched image: Injected same stale IPs + restarted operator pod: Result — operator self-healed via node IP fallback: Configmap auto-corrected with real IPs: Cluster fully recovered: Conclusion: The patched |
|
/verified by me |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verfied by me |
|
/verified by me |
|
@dpateriya: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tjungblu, Once this is merged, should we backport this fix to 4.22, 4.21, and 4.20? The stale configmap deadlock can occur on any version where VM migration changes node IPs, and it requires manual intervention to recover without this fix. |
|
/retest-required |
1 similar comment
|
/retest-required |
|
/test e2e-gcp-operator-disruptive |
|
@tjungblu the CI JOB Do we have someone looking into this?
|
|
/test e2e-gcp-operator-disruptive |
|
/hold Revision ac50724 was retested 3 times: holding |
|
/test e2e-gcp-operator-disruptive |
|
/unhold |
|
/cherry-pick release-4.22 |
|
@dpateriya: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@dpateriya: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@dpateriya: Jira Issue Verification Checks: Jira Issue OCPBUGS-88490 Jira Issue OCPBUGS-88490 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@dpateriya: new pull request created: #1636 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@dpateriya: new pull request created: #1637 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in release 5.0.0-0.nightly-2026-06-21-181448 |

Summary
Fixes OCPBUGS-88490
When the etcd-endpoints configmap contains stale IPs (e.g. after VM migration by vSphere DRS), the etcd client pool cannot reach any member, causing MemberList() to fail with context deadline exceeded. This creates a circular dependency: the operator cannot update the configmap without MemberList, but MemberList fails because the configmap has stale addresses.
Root Cause
After VMs migrate to new ESXi hosts, node IPs may change, but the etcd-endpoints configmap retains the old IPs. EtcdEndpointsController.syncConfigMap() calls MemberList() to discover current members, but the etcd client is initialized from the stale configmap and cannot connect. The controller returns an error and retries indefinitely with the same stale endpoints.
Fix
In EtcdEndpointsController.syncConfigMap(), when MemberList() fails, fall back to discovering control-plane node internal IPs via the node lister and network config (already available in the operator). This populates the configmap with reachable node IPs, allowing the etcd client to reconnect on the next cycle. Once connectivity is restored, MemberList() succeeds and overwrites the configmap with authoritative member data (member ID keys instead of node name keys).
Changes
Test Plan
Summary by CodeRabbit