OCPBUGS-88318: fix cluster-restore-tnf.sh IP auto-detection when hostname diverges from node name#1633
Conversation
…name diverges from node name get_etcd_advertise_ip() and get_peer_node_name() construct etcd.env variable names from `hostname` output, but etcd.env keys entries by the Kubernetes node name (via envVarSafe() in etcd_env.go). When the system hostname diverges from the k8s node name (e.g. hostname set to FQDN after deployment), the variable lookup returns empty and the restore aborts with "could not determine etcd advertise IP address". The same mismatch causes get_peer_node_name() to return both nodes instead of just the peer, and crm_attribute --node calls to target the wrong Pacemaker identity. Fix by adding resolve_k8s_node_name() which matches local IPs against NODE_*_IP entries sourced from etcd.env, then reads the k8s node name from the corresponding NODE_*_ETCD_NAME variable. This makes node identity resolution robust regardless of hostname configuration. Also improve get_peer_node_name() exclusion to match the full variable name pattern rather than a substring, preventing false matches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@Neilhamza: This pull request references Jira Issue OCPBUGS-88318, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughA new ChangesNODENAME Auto-Resolution in cluster-restore-tnf.sh
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 13 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (13 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/jira refresh |
|
@Neilhamza: This pull request references Jira Issue OCPBUGS-88318. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@Neilhamza: This pull request references Jira Issue OCPBUGS-88318, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@Neilhamza: This pull request references Jira Issue OCPBUGS-88318, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
fonta-rh
left a comment
There was a problem hiding this comment.
IPv6 bracket mismatch breaks resolve_k8s_node_name() on IPv6/dual-stack clusters
NODE_*_IP values in etcd.env are set via GetEscapedPreferredInternalIPAddressForNodeName (pkg/dnshelpers/util.go:20-21), which wraps IPv6 addresses in brackets (e.g. [fd00::1]). However, ip -o addr show scope global returns bare addresses (fd00::1).
The grep -qxF "$var_value" exact-match comparison will never succeed on an IPv6 cluster, causing resolve_k8s_node_name() to silently return empty and fall back to hostname — which is the exact bug this PR is meant to fix.
Fix: Strip brackets from var_value before comparing. Add before line 47:
var_value="${var_value//[\[\]]/}"Everything else in this PR looks good — the approach is sound, grep patterns are correctly tightened, and the fallback cascade is well-structured.
| while IFS='=' read -r var_name var_value; do | ||
| if [[ "$var_name" =~ ^NODE_(.+)_IP$ ]]; then | ||
| node_prefix="${BASH_REMATCH[1]}" | ||
| if echo "$local_ips" | grep -qxF "$var_value"; then |
There was a problem hiding this comment.
NODE_*_IP stores bracket-escaped IPv6 via GetEscapedPreferredInternalIPAddressForNodeName (e.g. [fd00::1]), but ip -o addr show returns bare fd00::1. This grep -qxF exact match will never succeed on IPv6 clusters.
Suggested fix — strip brackets before comparing:
| if echo "$local_ips" | grep -qxF "$var_value"; then | |
| if echo "$local_ips" | grep -qxF "${var_value//[\[\]]/}"; then |
There was a problem hiding this comment.
applied suggestion
NODE_*_IP values in etcd.env are set via GetEscapedPreferredInternalIPAddressForNodeName which wraps IPv6 addresses in brackets (e.g. [fd00::1]), but ip addr returns bare addresses. Strip brackets before comparing to fix IPv6/dual-stack clusters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fonta-rh
left a comment
There was a problem hiding this comment.
Looking good from my side
|
/lgtm |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fonta-rh, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified by ci |
|
@Neilhamza: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
/retest |
|
/hold Revision 0935788 was retested 3 times: holding |
|
/retest |
|
/unhold |
|
/retest |
|
@Neilhamza: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@Neilhamza: Jira Issue Verification Checks: Jira Issue OCPBUGS-88318 Jira Issue OCPBUGS-88318 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Summary
cluster-restore-tnf.shfails to auto-detect the etcd advertise IP when the systemhostnamedoes not match the Kubernetes node name used to generateetcd.env. This commonly occurs on bare metal clusters where the hostname is set to an FQDN after deployment.Root cause:
get_etcd_advertise_ip()constructs an env var name fromhostnameoutput (e.g.NODE_master_0_test_example_com_IP), butetcd.envcontains entries keyed by the Kubernetes node name (e.g.NODE_master_0_IP). When these diverge, the indirect variable expansion returns empty and the restore aborts.The same mismatch also causes:
get_peer_node_name()to return both nodes instead of just the peercrm_attribute --nodecalls to target the wrong Pacemaker identityFix: Add
resolve_k8s_node_name()which matches local IPs (fromip addr) againstNODE_*_IPentries sourced frometcd.env, then reads the k8s node name from the correspondingNODE_*_ETCD_NAMEvariable. This makes node identity resolution robust regardless of hostname configuration, with a hostname fallback for backward compatibility.Also improve
get_peer_node_name()exclusion to match the full variable name pattern (NODE_<safe>_ETCD_NAME=) rather than a raw substring, preventing false matches when one node name is a prefix of another.Test plan
Tested on a live TNF cluster (OCP 4.22, two masters) without redeploying:
resolve_k8s_node_name()returns correct k8s node name (master-0,master-1) on both nodesmaster-0.test.example.com):master-0via IP matchingget_peer_node_name()correctly excludes only the current node (old code returned both nodes with diverged hostname)crm_node -l🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements
NODENAMEhandling to auto-resolve only when unset, warn and fall back tohostnameif detection fails, and error if still empty.