Skip to content

OCPBUGS-88318: fix cluster-restore-tnf.sh IP auto-detection when hostname diverges from node name#1633

Merged
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
Neilhamza:OCPBUGS-88318
Jun 22, 2026
Merged

OCPBUGS-88318: fix cluster-restore-tnf.sh IP auto-detection when hostname diverges from node name#1633
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
Neilhamza:OCPBUGS-88318

Conversation

@Neilhamza

@Neilhamza Neilhamza commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

cluster-restore-tnf.sh fails to auto-detect the etcd advertise IP when the system hostname does not match the Kubernetes node name used to generate etcd.env. This commonly occurs on bare metal clusters where the hostname is set to an FQDN after deployment.

Root cause: get_etcd_advertise_ip() constructs an env var name from hostname output (e.g. NODE_master_0_test_example_com_IP), but etcd.env contains entries keyed by the Kubernetes node name (e.g. NODE_master_0_IP). When these diverge, the indirect variable expansion returns empty and the restore aborts.

The same mismatch also causes:

  • get_peer_node_name() to return both nodes instead of just the peer
  • crm_attribute --node calls to target the wrong Pacemaker identity

Fix: Add resolve_k8s_node_name() which matches local IPs (from ip addr) against NODE_*_IP entries sourced from etcd.env, then reads the k8s node name from the corresponding NODE_*_ETCD_NAME variable. This makes node identity resolution robust regardless of hostname configuration, with a hostname fallback for backward compatibility.

Also improve get_peer_node_name() exclusion to match the full variable name pattern (NODE_<safe>_ETCD_NAME=) rather than a raw substring, preventing false matches when one node name is a prefix of another.

Test plan

Tested on a live TNF cluster (OCP 4.22, two masters) without redeploying:

  • resolve_k8s_node_name() returns correct k8s node name (master-0, master-1) on both nodes
  • With simulated FQDN hostname divergence (master-0.test.example.com):
    • Old code: IP lookup returns empty (would abort)
    • New code: correctly resolves to master-0 via IP matching
  • get_peer_node_name() correctly excludes only the current node (old code returned both nodes with diverged hostname)
  • Pacemaker node names confirmed matching k8s node names via crm_node -l
  • shellcheck clean (only pre-existing SC2064 warning on unrelated line)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added automatic node name detection during cluster restoration by matching local network IPs to configured node IP/name entries.
  • Improvements

    • Improved peer node selection by using a clearer, safer exclusion approach based on the resolved node name.
    • Refined NODENAME handling to auto-resolve only when unset, warn and fall back to hostname if detection fails, and error if still empty.

…name diverges from node name

get_etcd_advertise_ip() and get_peer_node_name() construct etcd.env
variable names from `hostname` output, but etcd.env keys entries by
the Kubernetes node name (via envVarSafe() in etcd_env.go). When the
system hostname diverges from the k8s node name (e.g. hostname set to
FQDN after deployment), the variable lookup returns empty and the
restore aborts with "could not determine etcd advertise IP address".

The same mismatch causes get_peer_node_name() to return both nodes
instead of just the peer, and crm_attribute --node calls to target the
wrong Pacemaker identity.

Fix by adding resolve_k8s_node_name() which matches local IPs against
NODE_*_IP entries sourced from etcd.env, then reads the k8s node name
from the corresponding NODE_*_ETCD_NAME variable. This makes node
identity resolution robust regardless of hostname configuration.

Also improve get_peer_node_name() exclusion to match the full variable
name pattern rather than a substring, preventing false matches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 17, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@Neilhamza: This pull request references Jira Issue OCPBUGS-88318, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

cluster-restore-tnf.sh fails to auto-detect the etcd advertise IP when the system hostname does not match the Kubernetes node name used to generate etcd.env. This commonly occurs on bare metal clusters where the hostname is set to an FQDN after deployment.

Root cause: get_etcd_advertise_ip() constructs an env var name from hostname output (e.g. NODE_master_0_test_example_com_IP), but etcd.env contains entries keyed by the Kubernetes node name (e.g. NODE_master_0_IP). When these diverge, the indirect variable expansion returns empty and the restore aborts.

The same mismatch also causes:

  • get_peer_node_name() to return both nodes instead of just the peer
  • crm_attribute --node calls to target the wrong Pacemaker identity

Fix: Add resolve_k8s_node_name() which matches local IPs (from ip addr) against NODE_*_IP entries sourced from etcd.env, then reads the k8s node name from the corresponding NODE_*_ETCD_NAME variable. This makes node identity resolution robust regardless of hostname configuration, with a hostname fallback for backward compatibility.

Also improve get_peer_node_name() exclusion to match the full variable name pattern (NODE_<safe>_ETCD_NAME=) rather than a raw substring, preventing false matches when one node name is a prefix of another.

Test plan

Tested on a live TNF cluster (OCP 4.22, two masters) without redeploying:

  • resolve_k8s_node_name() returns correct k8s node name (master-0, master-1) on both nodes
  • With simulated FQDN hostname divergence (master-0.test.example.com):
  • Old code: IP lookup returns empty (would abort)
  • New code: correctly resolves to master-0 via IP matching
  • get_peer_node_name() correctly excludes only the current node (old code returned both nodes with diverged hostname)
  • Pacemaker node names confirmed matching k8s node names via crm_node -l
  • shellcheck clean (only pre-existing SC2064 warning on unrelated line)

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9b055827-c399-4127-a43c-4d5179aae951

📥 Commits

Reviewing files that changed from the base of the PR and between 0c6ca0d and 0935788.

📒 Files selected for processing (1)
  • bindata/etcd/cluster-restore-tnf.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • bindata/etcd/cluster-restore-tnf.sh

Walkthrough

A new resolve_k8s_node_name helper is added to cluster-restore-tnf.sh that auto-detects NODENAME by matching local global-scope IPs against NODE_*_IP entries from etcd.env. setup_pacemaker_restore calls this resolver when NODENAME is unset, falls back to hostname with a warning on detection failure, and fails fast if NODENAME remains empty. get_peer_node_name is updated to sanitize NODENAME into safe_name and exclude the current node's entry using explicit environment-variable filtering.

Changes

NODENAME Auto-Resolution in cluster-restore-tnf.sh

Layer / File(s) Summary
New node name resolution helper
bindata/etcd/cluster-restore-tnf.sh
Adds the resolve_k8s_node_name function that enumerates local global-scope IPs via ip addr, iterates NODE_*_IP variables sourced from etcd.env, and returns the matching NODE_<prefix>_ETCD_NAME value; returns empty when no match is found or IP enumeration yields nothing.
Integration into restore setup and peer listing
bindata/etcd/cluster-restore-tnf.sh
Updates setup_pacemaker_restore to call resolve_k8s_node_name when NODENAME is empty, fall back to hostname with a warning on failure, and error out if NODENAME remains unset. Updates get_peer_node_name to compute safe_name from NODENAME and exclude the current node's NODE_<safe_name>_ETCD_NAME entry using an explicit env | grep filter.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Ginkgo e2e tests lack meaningful failure messages in assertions; network_policy.go has 20+ assertions without diagnostic messages (e.g., o.Expect(err).NotTo(o.HaveOccurred())), violating requirem... Add diagnostic messages to all Ginkgo assertions in e2e tests (e.g., o.Expect(err).NotTo(o.HaveOccurred(), "failed to create kubeconfig")).
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main fix: auto-detection logic for IP matching when hostname diverges from Kubernetes node name.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo test files or test definitions found in PR. Changes are only to shell scripts and configuration files, not to test code.
Microshift Test Compatibility ✅ Passed This PR modifies only a bash shell script (cluster-restore-tnf.sh) and adds no Ginkgo e2e tests; the MicroShift test compatibility check applies only to new e2e tests and is not applicable here.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR modifies shell scripts (bindata/etcd/cluster-restore-tnf.sh), not Ginkgo e2e tests. The custom check applies only when new Ginkgo tests are added.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only a bash shell script for etcd cluster restoration, not deployment manifests, operators, or controllers with pod scheduling constraints. No topology-unaware scheduling constraints ar...
Ote Binary Stdout Contract ✅ Passed PR modifies only shell script (bindata/etcd/cluster-restore-tnf.sh); custom check applies to OTE binaries (Go code), so check is not applicable.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. The PR only modifies a bash shell script (cluster-restore-tnf.sh) for etcd cluster restoration, which is outside the scope of the IPv6/disconnected network...
No-Weak-Crypto ✅ Passed The script modification introduces node name resolution and cluster restore logic with no weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or...
Container-Privileges ✅ Passed The PR modifies only a shell script (bindata/etcd/cluster-restore-tnf.sh) and does not introduce or modify any Kubernetes manifests or container security configurations, making the container-privil...
No-Sensitive-Data-In-Logs ✅ Passed The script logs only informational and error messages without exposing passwords, tokens, API keys, PII, session IDs, or customer data.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from hasbro17 and tjungblu June 17, 2026 12:52
@Neilhamza

Copy link
Copy Markdown
Contributor Author

/jira refresh

@Neilhamza Neilhamza closed this Jun 17, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@Neilhamza: This pull request references Jira Issue OCPBUGS-88318. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

Details

In response to this:

Summary

cluster-restore-tnf.sh fails to auto-detect the etcd advertise IP when the system hostname does not match the Kubernetes node name used to generate etcd.env. This commonly occurs on bare metal clusters where the hostname is set to an FQDN after deployment.

Root cause: get_etcd_advertise_ip() constructs an env var name from hostname output (e.g. NODE_master_0_test_example_com_IP), but etcd.env contains entries keyed by the Kubernetes node name (e.g. NODE_master_0_IP). When these diverge, the indirect variable expansion returns empty and the restore aborts.

The same mismatch also causes:

  • get_peer_node_name() to return both nodes instead of just the peer
  • crm_attribute --node calls to target the wrong Pacemaker identity

Fix: Add resolve_k8s_node_name() which matches local IPs (from ip addr) against NODE_*_IP entries sourced from etcd.env, then reads the k8s node name from the corresponding NODE_*_ETCD_NAME variable. This makes node identity resolution robust regardless of hostname configuration, with a hostname fallback for backward compatibility.

Also improve get_peer_node_name() exclusion to match the full variable name pattern (NODE_<safe>_ETCD_NAME=) rather than a raw substring, preventing false matches when one node name is a prefix of another.

Test plan

Tested on a live TNF cluster (OCP 4.22, two masters) without redeploying:

  • resolve_k8s_node_name() returns correct k8s node name (master-0, master-1) on both nodes
  • With simulated FQDN hostname divergence (master-0.test.example.com):
  • Old code: IP lookup returns empty (would abort)
  • New code: correctly resolves to master-0 via IP matching
  • get_peer_node_name() correctly excludes only the current node (old code returned both nodes with diverged hostname)
  • Pacemaker node names confirmed matching k8s node names via crm_node -l
  • shellcheck clean (only pre-existing SC2064 warning on unrelated line)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Added automatic node name detection during cluster restoration by matching local network IPs against configured node entries.

  • Improvements

  • Enhanced node name resolution logic with better fallback handling and error messaging.

  • Simplified cluster peer identification process during restoration workflow.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 17, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@Neilhamza: This pull request references Jira Issue OCPBUGS-88318, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 17, 2026
@Neilhamza Neilhamza reopened this Jun 17, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@Neilhamza: This pull request references Jira Issue OCPBUGS-88318, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

cluster-restore-tnf.sh fails to auto-detect the etcd advertise IP when the system hostname does not match the Kubernetes node name used to generate etcd.env. This commonly occurs on bare metal clusters where the hostname is set to an FQDN after deployment.

Root cause: get_etcd_advertise_ip() constructs an env var name from hostname output (e.g. NODE_master_0_test_example_com_IP), but etcd.env contains entries keyed by the Kubernetes node name (e.g. NODE_master_0_IP). When these diverge, the indirect variable expansion returns empty and the restore aborts.

The same mismatch also causes:

  • get_peer_node_name() to return both nodes instead of just the peer
  • crm_attribute --node calls to target the wrong Pacemaker identity

Fix: Add resolve_k8s_node_name() which matches local IPs (from ip addr) against NODE_*_IP entries sourced from etcd.env, then reads the k8s node name from the corresponding NODE_*_ETCD_NAME variable. This makes node identity resolution robust regardless of hostname configuration, with a hostname fallback for backward compatibility.

Also improve get_peer_node_name() exclusion to match the full variable name pattern (NODE_<safe>_ETCD_NAME=) rather than a raw substring, preventing false matches when one node name is a prefix of another.

Test plan

Tested on a live TNF cluster (OCP 4.22, two masters) without redeploying:

  • resolve_k8s_node_name() returns correct k8s node name (master-0, master-1) on both nodes
  • With simulated FQDN hostname divergence (master-0.test.example.com):
  • Old code: IP lookup returns empty (would abort)
  • New code: correctly resolves to master-0 via IP matching
  • get_peer_node_name() correctly excludes only the current node (old code returned both nodes with diverged hostname)
  • Pacemaker node names confirmed matching k8s node names via crm_node -l
  • shellcheck clean (only pre-existing SC2064 warning on unrelated line)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Added automatic node name detection during cluster restoration by matching local network IPs against configured node entries.

  • Improvements

  • Enhanced node name resolution logic with better fallback handling and error messaging.

  • Simplified cluster peer identification process during restoration workflow.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@fonta-rh fonta-rh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPv6 bracket mismatch breaks resolve_k8s_node_name() on IPv6/dual-stack clusters

NODE_*_IP values in etcd.env are set via GetEscapedPreferredInternalIPAddressForNodeName (pkg/dnshelpers/util.go:20-21), which wraps IPv6 addresses in brackets (e.g. [fd00::1]). However, ip -o addr show scope global returns bare addresses (fd00::1).

The grep -qxF "$var_value" exact-match comparison will never succeed on an IPv6 cluster, causing resolve_k8s_node_name() to silently return empty and fall back to hostname — which is the exact bug this PR is meant to fix.

Fix: Strip brackets from var_value before comparing. Add before line 47:

var_value="${var_value//[\[\]]/}"

Everything else in this PR looks good — the approach is sound, grep patterns are correctly tightened, and the fallback cascade is well-structured.

Comment thread bindata/etcd/cluster-restore-tnf.sh Outdated
while IFS='=' read -r var_name var_value; do
if [[ "$var_name" =~ ^NODE_(.+)_IP$ ]]; then
node_prefix="${BASH_REMATCH[1]}"
if echo "$local_ips" | grep -qxF "$var_value"; then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NODE_*_IP stores bracket-escaped IPv6 via GetEscapedPreferredInternalIPAddressForNodeName (e.g. [fd00::1]), but ip -o addr show returns bare fd00::1. This grep -qxF exact match will never succeed on IPv6 clusters.

Suggested fix — strip brackets before comparing:

Suggested change
if echo "$local_ips" | grep -qxF "$var_value"; then
if echo "$local_ips" | grep -qxF "${var_value//[\[\]]/}"; then

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied suggestion

NODE_*_IP values in etcd.env are set via
GetEscapedPreferredInternalIPAddressForNodeName which wraps IPv6
addresses in brackets (e.g. [fd00::1]), but ip addr returns bare
addresses. Strip brackets before comparing to fix IPv6/dual-stack
clusters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Neilhamza Neilhamza requested a review from fonta-rh June 18, 2026 09:44

@fonta-rh fonta-rh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good from my side

@fonta-rh

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2026
@tjungblu

Copy link
Copy Markdown
Contributor

/approve

@openshift-ci

openshift-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fonta-rh, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2026
@Neilhamza

Copy link
Copy Markdown
Contributor Author

/verified by ci

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 18, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@Neilhamza: This PR has been marked as verified by ci.

Details

In response to this:

/verified by ci

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD f38807a and 2 for PR HEAD 0935788 in total

@Neilhamza

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 6dc19bf and 1 for PR HEAD 0935788 in total

@Neilhamza

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD f4d5750 and 0 for PR HEAD 0935788 in total

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/hold

Revision 0935788 was retested 3 times: holding

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 20, 2026
@Neilhamza

Copy link
Copy Markdown
Contributor Author

/retest

@Neilhamza

Copy link
Copy Markdown
Contributor Author

/unhold

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 21, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD c028077 and 2 for PR HEAD 0935788 in total

@Neilhamza

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

@Neilhamza: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 54e8d0f into openshift:main Jun 22, 2026
17 checks passed
@openshift-ci-robot

Copy link
Copy Markdown

@Neilhamza: Jira Issue Verification Checks: Jira Issue OCPBUGS-88318
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-88318 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Summary

cluster-restore-tnf.sh fails to auto-detect the etcd advertise IP when the system hostname does not match the Kubernetes node name used to generate etcd.env. This commonly occurs on bare metal clusters where the hostname is set to an FQDN after deployment.

Root cause: get_etcd_advertise_ip() constructs an env var name from hostname output (e.g. NODE_master_0_test_example_com_IP), but etcd.env contains entries keyed by the Kubernetes node name (e.g. NODE_master_0_IP). When these diverge, the indirect variable expansion returns empty and the restore aborts.

The same mismatch also causes:

  • get_peer_node_name() to return both nodes instead of just the peer
  • crm_attribute --node calls to target the wrong Pacemaker identity

Fix: Add resolve_k8s_node_name() which matches local IPs (from ip addr) against NODE_*_IP entries sourced from etcd.env, then reads the k8s node name from the corresponding NODE_*_ETCD_NAME variable. This makes node identity resolution robust regardless of hostname configuration, with a hostname fallback for backward compatibility.

Also improve get_peer_node_name() exclusion to match the full variable name pattern (NODE_<safe>_ETCD_NAME=) rather than a raw substring, preventing false matches when one node name is a prefix of another.

Test plan

Tested on a live TNF cluster (OCP 4.22, two masters) without redeploying:

  • resolve_k8s_node_name() returns correct k8s node name (master-0, master-1) on both nodes
  • With simulated FQDN hostname divergence (master-0.test.example.com):
  • Old code: IP lookup returns empty (would abort)
  • New code: correctly resolves to master-0 via IP matching
  • get_peer_node_name() correctly excludes only the current node (old code returned both nodes with diverged hostname)
  • Pacemaker node names confirmed matching k8s node names via crm_node -l
  • shellcheck clean (only pre-existing SC2064 warning on unrelated line)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

  • Added automatic node name detection during cluster restoration by matching local network IPs to configured node IP/name entries.

  • Improvements

  • Improved peer node selection by using a clearer, safer exclusion approach based on the resolved node name.

  • Refined NODENAME handling to auto-resolve only when unset, warn and fall back to hostname if detection fails, and error if still empty.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants