ETCD-704: cluster-restore.sh: move extra /var/lib/etcd files to backup#1628
Conversation
WalkthroughThe restore scripts now back up any remaining entries from ETCD_DATA_DIR into ETCD_DATA_DIR_BACKUP/extra-data-dir-contents via a new helper, and both restore initialization paths invoke that helper before wiping or requiring an empty data directory. Changesetcd Restore Data Cleanup
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 12 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
ETCD-704-VERIFICATION-OUTPUTS.txt |
|
CI analysis for failed required jobs:
Manual verification on OCP 4.22.0-rc.4 (3-node HA): legacy restore fails with extra files in /var/lib/etcd; patched script moves files to /var/lib/etcd-backup and completes SNAPSHOT RESTORE. Full HA restore verified (ETCD-704). /retest required |
|
2/3 required jobs now green. Remaining failure is TestRetentionBySize /test e2e-gcp-operator-disruptive |
|
Latest e2e-gcp-operator-disruptive run: all operator e2e tests passed (47m), Previous failures were TestRetentionBySize / TestPeriodicBackupHappyPath flakes. /test e2e-gcp-operator-disruptive |
|
/label merge-review-needed |
|
@apurvanisal5: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/label ready-for-human-review |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@bindata/etcd/cluster-restore.sh`:
- Around line 70-75: The script currently does rm -rf
"${ETCD_DATA_DIR_BACKUP:?}/${base}" before mv "${entry}"
"${ETCD_DATA_DIR_BACKUP}/", which can clobber previously staged restore
artifacts (snapshot.db, revision.json, member/); instead, stop deleting the
target backup root entry—if "${ETCD_DATA_DIR_BACKUP}/${base}" already exists,
either skip moving the stray "${entry}", or move it to a conflict-safe name
(e.g., append a timestamp or ".orig") so existing staged files are preserved;
apply the same behavior to the separate moves that write snapshot.db and
revision.json (refer to ETCD_DATA_DIR_BACKUP, base, entry, snapshot.db,
revision.json, and member/) so nothing in the backup root gets overwritten.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5b980737-38e5-4d48-b0aa-3efe28c29c3d
📒 Files selected for processing (1)
bindata/etcd/cluster-restore.sh
|
@apurvanisal5 can you re-target this change to the main branch please? And we can backport if necessary |
|
/test e2e-gcp-operator-disruptive |
|
/test e2e-metal-ipi-ovn-ipv6 |
1 similar comment
|
/test e2e-metal-ipi-ovn-ipv6 |
|
you can save yourself the testing @apurvanisal5 - you're targeting the wrong branch. Please follow the backport process in https://docs.google.com/document/d/1FCL6HIUGKhelHKc6dHtZ8AvpFYjmE4_aRz8NOmPsQl8/edit?tab=t.k517pl1uza05#heading=h.wnk379sj5pq7 |
| done | ||
| } | ||
|
|
||
| function backup_remaining_etcd_data_dir_contents() { |
There was a problem hiding this comment.
shouldn't this go into common and be shared with cluster-restore-tnf.sh?
| mkdir -p "${extras_dir}" | ||
|
|
||
| shopt -s nullglob dotglob | ||
| for entry in "${ETCD_DATA_DIR}"/*; do |
There was a problem hiding this comment.
doesn't that move the revision.json out of this folder again?
There was a problem hiding this comment.
for revion.json
109 # Copy snapshot to backupdir
110 cp -p "${SNAPSHOT_FILE}" "${ETCD_DATA_DIR_BACKUP}"/snapshot.db
111 # Move the revision.json when it exists
112 [ ! -f "${ETCD_REV_JSON}" ] || mv -f "${ETCD_REV_JSON}" "${ETCD_DATA_DIR_BACKUP}"/revision.json <===== this line already moves revision.json before backup_remaining_etcd_data_dir_contents is called
113 # Move any remaining files (fio perf artifacts, stray snapshots, etc.) out of the data dir.
114 # The restore pod requires /var/lib/etcd to be empty before it runs.
115 backup_remaining_etcd_data_dir_contents
When cluster-restore.sh runs the restore-pod path, it moves member/ and revision.json to /var/lib/etcd-backup, deletes etcd_perf*, then exits if anything remains in /var/lib/etcd. Extra files (perf artifacts, stray snapshots, etc.) cause DR restore to fail before the restore pod starts. Add backup_remaining_etcd_data_dir_contents() to move all remaining top-level entries to /var/lib/etcd-backup instead of failing. Fixes: ETCD-704 Related: https://access.redhat.com/solutions/6958920
Move leftover /var/lib/etcd entries to extra-data-dir-contents/ so staged snapshot.db, revision.json, and member/ are not overwritten. Co-authored-by: Cursor <cursoragent@cursor.com>
Share backup_remaining_etcd_data_dir_contents via etcd-common-tools per review feedback. Behavior unchanged.
367a077 to
4b715f1
Compare
|
Re-targeted to main. Moved the helper to etcd-common-tools per @tjungblu review. Same fix as before nothing else changed. I already tested this on a real 3-node cluster (4.22) as mentioned in #1628 (comment) Will backport to 4.22 after merge if needed. |
|
/retest-required |
|
/test e2e-gcp-operator-disruptive |
|
@apurvanisal5: This pull request references ETCD-704 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@apurvanisal5: This pull request references ETCD-704 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/verified by @apurvanisal5 Manual 3-node HA etcd restore on OCP 4.21.18 (cluster anisal-test-m):
|
|
@apurvanisal5: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-metal-ovn-two-node-fencing |
|
/retest-required |
|
/test e2e-gcp-operator-disruptive |
3 similar comments
|
/test e2e-gcp-operator-disruptive |
|
/test e2e-gcp-operator-disruptive |
|
/test e2e-gcp-operator-disruptive |
|
/retest-required |
2 similar comments
|
/retest-required |
|
/retest-required |
|
@apurvanisal5: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/override ci/prow/e2e-gcp-operator-disruptive |
|
@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-gcp-operator-disruptive DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherry-pick release-4.22 |
|
@apurvanisal5: #1628 failed to apply on top of branch "release-4.17": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@apurvanisal5: #1628 failed to apply on top of branch "release-4.18": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@apurvanisal5: #1628 failed to apply on top of branch "release-4.19": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@apurvanisal5: #1628 failed to apply on top of branch "release-4.20": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@apurvanisal5: #1628 failed to apply on top of branch "release-4.21": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@apurvanisal5: new pull request created: #1632 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary
cluster-restore.shfails withfolder /var/lib/etcd is not emptywhen extra files exist under/var/lib/etcdaftermember/is moved.backup_remaining_etcd_data_dir_contents()to move remaining top-level files to/var/lib/etcd-backupinstead of exiting.Jira
Fixes: ETCD-704
Verification
OCP 4.22.0-rc.4, AWS IPI 3-node HA:
/var/lib/etcd/var/lib/etcd-backupand completesSNAPSHOT RESTORE COMPLETEDtesting-seed-projectrestored from backupTest plan
/var/lib/etcd/var/lib/etcd-backupSummary by CodeRabbit