Fix data loss when adding a node under write load.#392
Fix data loss when adding a node under write load.#392ibrarahmad wants to merge 4 commits intomainfrom
Conversation
When a new node is added to a running cluster while the existing nodes are handling writes, the new node can end up permanently missing rows or accumulating stale updates. Root cause: in the fallback path (Case B) of create_slot_with_progress(), the resume LSN (P_snap) was derived from ros_remote_lsn read after the slot-creation retry loop. By that time the position could have advanced past the actual COPY boundary, creating a gap of permanently lost rows or allowing duplicate transactions to replay and corrupt balances. Fix: capture v_pre_ros_lsn just before each slot-creation attempt and use it as the preferred P_snap fallback. This value is guaranteed to be at or before the COPY boundary. Add a MAX-by-LSN guard in spock_group_progress_force_set_list() to prevent overwriting a live apply worker's higher LSN with a stale P_snap value. Also add the create_slot_with_progress() function to the upgrade SQL, update zodan.sql procedures, and adjust the TAP test timing.
📝 WalkthroughWalkthroughAdds atomic slot-creation-with-progress and a force-set progress API, refactors sync flows to use the new slot/snapshot/progress mechanism, updates SQL migrations and sample sync scripts (new catchup helper and signature tweaks), and extends tests/CI timeouts with additional stabilization waits. Changes
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
tests/tap/t/011_zodan_sync_third.pl (1)
114-115: Consider replacing fixed sleeps with state-based waits.There are multiple unconditional long sleeps here; they add significant runtime and still rely on timing assumptions. A bounded poll on replication state (or a helper that waits for an explicit sync condition) will usually be faster and less flaky than fixed delays.
Also applies to: 129-130, 140-141, 241-242, 273-274, 348-349, 359-360, 370-370
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/tap/t/011_zodan_sync_third.pl` around lines 114 - 115, Replace the unconditional long sleep(60) and other fixed sleeps with a state-based poll/wait helper (e.g., implement and call wait_for_sync() or poll_replication_state()) that repeatedly checks the replication/sync condition used by this test until it is met or a timeout elapses; update the calls at the locations around sleep(60) and the other listed occurrences (near lines with sleep calls) to call that helper (or inline loop) instead of sleeping, ensuring the check uses the same replication status function the test already uses so the waits are bounded and deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/zodan_sync.yml:
- Line 20: The workflow uses an invalid GitHub runner label "ubuntu-24.04-x64"
in the runs-on setting; update the runs-on value to the supported label
"ubuntu-24.04" (i.e., replace the string "ubuntu-24.04-x64" with "ubuntu-24.04")
so the job will run on a valid GitHub-hosted runner.
In `@samples/Z0DAN/zodan.sql`:
- Around line 1501-1503: The PERFORM dblink(...) calls that invoke
spock.wait_for_sync_event discard its boolean return (it returns false on
timeout); change each PERFORM dblink call (e.g., the one using new_node_dsn +
format('CALL spock.wait_for_sync_event(...)', src_node_name, sync_lsn,
timeout_ms)) to SELECT the result INTO a local boolean (e.g., wait_ok) from
dblink(...) AS t(result boolean) and then IF NOT wait_ok THEN RAISE EXCEPTION
'wait_for_sync_event timed out for %', origin_name; END IF; so the function
detects timeouts instead of silently proceeding; apply the same change to the
other occurrences noted (around lines shown).
- Around line 1377-1379: The timeout and missing-progress branches should fail
closed so P_snap preconditions cannot be bypassed: replace the soft RAISE
WARNING + EXIT behavior when wait_iters >= max_wait_iters and any code that
substitutes missing spock.progress LSN with pg_current_wal_lsn() with a hard
failure (RAISE EXCEPTION) that aborts the operation, and ensure the routines
that inspect src_progress_lsn and spock.progress (the catch-up loop and the LSN
substitution logic) return an error instead of continuing; update the handling
in the same blocks that reference wait_iters, max_wait_iters, src_progress_lsn
and spock.progress so the function aborts on timeout or missing progress LSN
rather than proceeding.
In `@sql/spock--5.0.6--6.0.0-devel.sql`:
- Around line 199-205: The ELSE branch still prefers the post-slot
rec.ros_remote_lsn causing the unsafe "read after retry loop" gap; update the
COALESCE that sets remote_commit_lsn to drop rec.ros_remote_lsn and instead use
the safe lower-bound fallback order (rec.grp_remote_commit_lsn, v_pre_ros_lsn,
'0/0'::pg_lsn). Modify the assignment to remote_commit_lsn so it no longer
references rec.ros_remote_lsn but mirrors the fresh-install logic (use
rec.grp_remote_commit_lsn, then v_pre_ros_lsn, then '0/0'::pg_lsn).
In `@src/spock_group.c`:
- Around line 746-755: The code mutates the live shared-memory progress
(entry->progress) via init_progress_fields() before calling
spock_apply_progress_add_to_wal(sap), exposing SHM to regressions if the WAL
write fails; instead, create a local temporary progress structure, initialize it
(use the same logic as init_progress_fields on the temp), call
spock_apply_progress_add_to_wal(sap) using that staged data, and only after the
WAL write returns success copy the temp into entry->progress (or call
progress_update_struct on entry->progress with the staged values); update
references to progress_update_struct(&entry->progress, sap) to use the staged
data until the WAL commit is confirmed.
In `@tests/tap/t/011_zodan_sync_third.pl`:
- Around line 113-115: The printed "warming up" and "done warmup" duration
messages are inconsistent with the actual sleep(60) calls; update each print
STDERR message (e.g., the "warming up pgbench for 5s\n" and similar prints
around the sleep calls) to reflect the real duration (60s) or alternatively
reduce the sleep to match the message—ensure all occurrences (the ones around
the sleep calls at the shown diffs and the other instances referenced) are
updated so the logged duration matches the actual sleep value.
---
Nitpick comments:
In `@tests/tap/t/011_zodan_sync_third.pl`:
- Around line 114-115: Replace the unconditional long sleep(60) and other fixed
sleeps with a state-based poll/wait helper (e.g., implement and call
wait_for_sync() or poll_replication_state()) that repeatedly checks the
replication/sync condition used by this test until it is met or a timeout
elapses; update the calls at the locations around sleep(60) and the other listed
occurrences (near lines with sleep calls) to call that helper (or inline loop)
instead of sleeping, ensuring the check uses the same replication status
function the test already uses so the waits are bounded and deterministic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dc40892c-480a-4657-98e2-dc68c6f86c76
📒 Files selected for processing (8)
.github/workflows/zodan_sync.ymlinclude/spock_group.hsamples/Z0DAN/zodan.sqlsql/spock--5.0.6--6.0.0-devel.sqlsql/spock--6.0.0-devel.sqlsrc/spock_group.csrc/spock_sync.ctests/tap/t/011_zodan_sync_third.pl
Check the boolean returned by wait_for_sync_event() instead of discarding it with PERFORM. The procedure signals timeout by returning false, not by throwing, so the old code silently continued past failed sync waits. All four dblink call sites now capture the result and raise an exception when the wait did not succeed. In the 5.0.6 → 6.0.0-devel upgrade migration, Case B was still preferring the post-slot ros.remote_lsn, which can be higher than P_snap. Align with the fresh-install logic: use v_pre_ros_lsn (captured before slot creation) as the primary fallback, matching the safe lower-bound semantics already in spock--6.0.0-devel.sql. Move init_progress_fields() in spock_group_progress_force_set_list() to after the WAL write so that a failure in spock_apply_progress_add_to_wal() does not leave shared memory in a regressed state with no durable record to replay from. Fix misleading sleep-duration messages in 011_zodan_sync_third.pl that still said 5s/10s/20s after the actual sleeps were changed to 60s. Increase spockbench TAP test timeout from 20 to 45 minutes.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
samples/Z0DAN/zodan.sql (1)
1377-1379:⚠️ Potential issue | 🔴 CriticalFail closed when
P_snapsafety preconditions are missing.Line 1377/Line 1393/Line 1842/Line 1988 still allow timeout or missing-progress paths to continue. That can invalidate
P_snapguarantees and reintroduce the skip-window risk this PR is fixing.Suggested fix
- IF wait_iters >= max_wait_iters THEN - RAISE WARNING ' Timeout waiting for source node commit catchup (last seen: %)', src_progress_lsn; - EXIT; - END IF; + IF wait_iters >= max_wait_iters THEN + RAISE EXCEPTION 'Timeout waiting for source node % commit catchup from % (last seen %, need >= %)', + src_node_name, rec.node_name, src_progress_lsn, _commit_lsn; + END IF; ... - EXCEPTION - WHEN OTHERS THEN - RAISE WARNING ' Could not verify source commit catchup for %: %', rec.node_name, SQLERRM; + EXCEPTION + WHEN OTHERS THEN + RAISE EXCEPTION 'Could not verify source commit catchup for %: %', rec.node_name, SQLERRM; END; ... - EXCEPTION - WHEN OTHERS THEN - RAISE WARNING ' - sub_wait_for_sync(%) failed: %; proceeding anyway', v_sub_name, SQLERRM; + EXCEPTION + WHEN OTHERS THEN + RAISE EXCEPTION 'sub_wait_for_sync(%) failed: %', v_sub_name, SQLERRM; END; ... - IF target_lsn IS NULL THEN - RAISE NOTICE ' WARNING: No spock.progress entry for %, falling back to pg_current_wal_lsn()', rec.node_name; - remotesql := 'SELECT pg_current_wal_lsn()'; - SELECT * FROM dblink(rec.dsn, remotesql) AS t(lsn pg_lsn) INTO target_lsn; - END IF; + IF target_lsn IS NULL THEN + RAISE EXCEPTION 'No spock.progress entry for %, refusing to advance slot % without validated P_snap', + rec.node_name, slot_name; + END IF;Also applies to: 1391-1394, 1841-1843, 1987-1991
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@samples/Z0DAN/zodan.sql` around lines 1377 - 1379, The timeout branches currently RAISE WARNING and EXIT (e.g., the IF wait_iters >= max_wait_iters THEN block referencing wait_iters, max_wait_iters and src_progress_lsn) which allows execution to continue and can violate the P_snap safety preconditions; change these to fail-closed by raising an error (use RAISE EXCEPTION with a clear message including src_progress_lsn and relevant context) instead of warning+exit, and apply the same replacement at the other occurrences mentioned (the blocks around the symbols at lines noted that perform missing-progress/timeouts) so the procedure aborts when P_snap preconditions are not met.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@samples/Z0DAN/zodan.sql`:
- Around line 1377-1379: The timeout branches currently RAISE WARNING and EXIT
(e.g., the IF wait_iters >= max_wait_iters THEN block referencing wait_iters,
max_wait_iters and src_progress_lsn) which allows execution to continue and can
violate the P_snap safety preconditions; change these to fail-closed by raising
an error (use RAISE EXCEPTION with a clear message including src_progress_lsn
and relevant context) instead of warning+exit, and apply the same replacement at
the other occurrences mentioned (the blocks around the symbols at lines noted
that perform missing-progress/timeouts) so the procedure aborts when P_snap
preconditions are not met.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1707107a-0f0e-4446-9e76-21a4e747da86
📒 Files selected for processing (5)
.github/workflows/spockbench.ymlsamples/Z0DAN/zodan.sqlsql/spock--5.0.6--6.0.0-devel.sqlsrc/spock_group.ctests/tap/t/011_zodan_sync_third.pl
There was a problem hiding this comment.
♻️ Duplicate comments (3)
samples/Z0DAN/zodan.sql (3)
1928-1951:⚠️ Potential issue | 🔴 CriticalDo not continue when src→new subscription never reaches READY.
Line 1928, Line 1947, and Line 1957 downgrade readiness failures to warnings. This is a safety gate for reliable progress state; continuing here can leave downstream slot/origin advancement based on incomplete state.
Suggested fix
- IF v_subid IS NULL THEN - RAISE WARNING ' - Subscription % not found on new node; continuing', v_sub_name; + IF v_subid IS NULL THEN + RAISE EXCEPTION 'Subscription % not found on new node', v_sub_name; @@ - IF clock_timestamp() - v_wait_started > interval '3 minutes' THEN - RAISE WARNING ' - Timed out waiting for % to become READY (pending rows: %, status: %); continuing', - v_sub_name, v_pending_sync, coalesce(v_sub_status, '<unknown>'); - EXIT; - END IF; + IF clock_timestamp() - v_wait_started > interval '3 minutes' THEN + RAISE EXCEPTION 'Timed out waiting for % to become READY (pending rows: %, status: %)', + v_sub_name, v_pending_sync, coalesce(v_sub_status, '<unknown>'); + END IF; @@ - EXCEPTION - WHEN OTHERS THEN - RAISE WARNING ' - READY wait for subscription % failed: %; proceeding anyway', v_sub_name, SQLERRM; - RAISE WARNING ' - Current subscription status snapshot: %', - coalesce((SELECT string_agg(subscription_name || ':' || status, ', ') - FROM spock.sub_show_status()), '<none>'); + EXCEPTION + WHEN OTHERS THEN + RAISE EXCEPTION 'READY wait for subscription % failed: %', v_sub_name, SQLERRM; END;Also applies to: 1957-1962
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@samples/Z0DAN/zodan.sql` around lines 1928 - 1951, The code currently downgrades failure-to-become-READY into RAISE WARNING and continues, which can leave downstream replication state inconsistent; update the two places that check readiness timeout (the block that measures clock_timestamp() - v_wait_started and the other similar block later) to abort instead of continue by replacing RAISE WARNING with a terminating error (e.g., RAISE EXCEPTION or RAISE ERROR) so the function stops execution when v_pending_sync does not reach 0 for v_sub_name; keep the diagnostic interpolation (v_sub_name, v_pending_sync, coalesce(v_sub_status,'<unknown>')) and ensure you target the branches that reference v_subid, v_wait_started, spock.local_sync_status and spock.sub_show_status() so the failure becomes fatal rather than a warning.
1459-1462:⚠️ Potential issue | 🔴 CriticalFail closed when source catchup verification times out or errors.
Line 1459 and Line 1473 currently warn and continue. That allows Phase 7 to proceed without proving
src_progress_lsn >= _commit_lsn, which reopens the snapshot-gap risk this PR is trying to eliminate.Suggested fix
- IF clock_timestamp() - wait_started > wait_timeout THEN - RAISE WARNING ' Timeout waiting for source node commit catchup (last seen: %)', src_progress_lsn; - EXIT; - END IF; + IF clock_timestamp() - wait_started > wait_timeout THEN + RAISE EXCEPTION 'Timeout waiting for source node % to commit % up to % (last seen: %)', + src_node_name, rec.node_name, _commit_lsn, COALESCE(src_progress_lsn::text, '<null>'); + END IF; @@ - EXCEPTION - WHEN OTHERS THEN - RAISE WARNING ' Could not verify source commit catchup for %: %', rec.node_name, SQLERRM; + EXCEPTION + WHEN OTHERS THEN + RAISE EXCEPTION 'Could not verify source commit catchup for %: %', rec.node_name, SQLERRM; END;Also applies to: 1473-1475
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@samples/Z0DAN/zodan.sql` around lines 1459 - 1462, The timeout/error handlers that currently RAISE WARNING and EXIT must fail closed so Phase 7 cannot continue without proving src_progress_lsn >= _commit_lsn; replace the RAISE WARNING in the timeout branch (the IF using clock_timestamp() - wait_started > wait_timeout) and the similar error-handling branch later with RAISE EXCEPTION (including a descriptive message and the src_progress_lsn/_commit_lsn context) so the function aborts instead of continuing; update both occurrences referenced by the symbols src_progress_lsn, _commit_lsn, wait_started, and wait_timeout.
2106-2110:⚠️ Potential issue | 🔴 CriticalRemove unsafe
pg_current_wal_lsn()fallback for missingspock.progress.Line 2106 falls back to provider current WAL when progress is missing. That can advance the slot past the copy boundary and permanently skip unapplied rows.
Suggested fix
- IF target_lsn IS NULL THEN - RAISE NOTICE ' WARNING: No spock.progress entry for %, falling back to pg_current_wal_lsn()', rec.node_name; - remotesql := 'SELECT pg_current_wal_lsn()'; - SELECT * FROM dblink(rec.dsn, remotesql) AS t(lsn pg_lsn) INTO target_lsn; - END IF; + IF target_lsn IS NULL THEN + RAISE EXCEPTION 'No spock.progress entry for node %; refusing unsafe slot advance', rec.node_name; + END IF;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@samples/Z0DAN/zodan.sql` around lines 2106 - 2110, The current fallback that sets remotesql = 'SELECT pg_current_wal_lsn()' and queries it into target_lsn (using dblink and pg_current_wal_lsn()) is unsafe because it can advance the replication slot past the copy boundary; remove that fallback and instead fail loudly when spock.progress is missing for the provider: replace the fallback block so that when target_lsn IS NULL you RAISE EXCEPTION (or RETURN/EXIT with an explicit error) including rec.node_name to force the caller to fix the missing spock.progress entry rather than using pg_current_wal_lsn(); do not call pg_current_wal_lsn() or perform the dblink SELECT into target_lsn.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@samples/Z0DAN/zodan.sql`:
- Around line 1928-1951: The code currently downgrades failure-to-become-READY
into RAISE WARNING and continues, which can leave downstream replication state
inconsistent; update the two places that check readiness timeout (the block that
measures clock_timestamp() - v_wait_started and the other similar block later)
to abort instead of continue by replacing RAISE WARNING with a terminating error
(e.g., RAISE EXCEPTION or RAISE ERROR) so the function stops execution when
v_pending_sync does not reach 0 for v_sub_name; keep the diagnostic
interpolation (v_sub_name, v_pending_sync, coalesce(v_sub_status,'<unknown>'))
and ensure you target the branches that reference v_subid, v_wait_started,
spock.local_sync_status and spock.sub_show_status() so the failure becomes fatal
rather than a warning.
- Around line 1459-1462: The timeout/error handlers that currently RAISE WARNING
and EXIT must fail closed so Phase 7 cannot continue without proving
src_progress_lsn >= _commit_lsn; replace the RAISE WARNING in the timeout branch
(the IF using clock_timestamp() - wait_started > wait_timeout) and the similar
error-handling branch later with RAISE EXCEPTION (including a descriptive
message and the src_progress_lsn/_commit_lsn context) so the function aborts
instead of continuing; update both occurrences referenced by the symbols
src_progress_lsn, _commit_lsn, wait_started, and wait_timeout.
- Around line 2106-2110: The current fallback that sets remotesql = 'SELECT
pg_current_wal_lsn()' and queries it into target_lsn (using dblink and
pg_current_wal_lsn()) is unsafe because it can advance the replication slot past
the copy boundary; remove that fallback and instead fail loudly when
spock.progress is missing for the provider: replace the fallback block so that
when target_lsn IS NULL you RAISE EXCEPTION (or RETURN/EXIT with an explicit
error) including rec.node_name to force the caller to fix the missing
spock.progress entry rather than using pg_current_wal_lsn(); do not call
pg_current_wal_lsn() or perform the dblink SELECT into target_lsn.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5133dd44-cff6-41c0-b43f-328ecbec06b1
📒 Files selected for processing (1)
samples/Z0DAN/zodan.sql
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
samples/Z0DAN/zodan.sql (2)
1418-1475:⚠️ Potential issue | 🔴 CriticalDon’t fail open on the source catch-up prerequisite.
This wait is what guarantees the Phase 5 snapshot already contains
rec.node_namechanges through_commit_lsn. Warning and continuing on timeout/error can reopen the[resume_lsn, _commit_lsn)gap.Suggested fix
- IF clock_timestamp() - wait_started > wait_timeout THEN - RAISE WARNING ' Timeout waiting for source node commit catchup (last seen: %)', src_progress_lsn; - EXIT; - END IF; + IF clock_timestamp() - wait_started > wait_timeout THEN + RAISE EXCEPTION 'Timeout waiting for source node % to commit % changes through % (last seen: %)', + src_node_name, rec.node_name, _commit_lsn, src_progress_lsn; + END IF; @@ - EXCEPTION - WHEN OTHERS THEN - RAISE WARNING ' Could not verify source commit catchup for %: %', rec.node_name, SQLERRM; + EXCEPTION + WHEN OTHERS THEN + RAISE EXCEPTION 'Could not verify source commit catchup for %: %', rec.node_name, SQLERRM; END;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@samples/Z0DAN/zodan.sql` around lines 1418 - 1475, The source-catchup block (the anonymous BEGIN ... DECLARE containing src_progress_lsn, wait_started, wait_timeout, progress_sql and the dblink probe) currently logs warnings and continues on timeout or errors, which allows proceeding without guaranteeing rec.node_name reached _commit_lsn; change the behavior to fail-fast by replacing the non-fatal RAISE WARNINGs with RAISE EXCEPTION (including useful context: rec.node_name and _commit_lsn and last seen src_progress_lsn) so the whole operation aborts when the loop times out or the outer EXCEPTION handler catches OTHERS; ensure the dblink probe loop and the outer block both raise exceptions instead of just warnings to prevent reopening the [resume_lsn, _commit_lsn) gap.
2099-2110:⚠️ Potential issue | 🔴 CriticalNever substitute
pg_current_wal_lsn()for missing progress.If
spock.progresshas no row yet, advancing the slot/origin to current WAL can jump past rows that were not present in the COPY snapshot.Suggested fix
- IF target_lsn IS NULL THEN - RAISE NOTICE ' WARNING: No spock.progress entry for %, falling back to pg_current_wal_lsn()', rec.node_name; - remotesql := 'SELECT pg_current_wal_lsn()'; - SELECT * FROM dblink(rec.dsn, remotesql) AS t(lsn pg_lsn) INTO target_lsn; - END IF; + IF target_lsn IS NULL THEN + RAISE EXCEPTION 'No spock.progress entry found for % on the new node; refusing to advance % blindly', + rec.node_name, slot_name; + END IF;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@samples/Z0DAN/zodan.sql` around lines 2099 - 2110, The code currently falls back to selecting pg_current_wal_lsn() into target_lsn when no spock.progress row exists (variables/functions: target_lsn, spock.progress, rec.node_name, remotesql, dblink(rec.dsn)), which can advance the slot past rows missing from the COPY snapshot; remove the fallback that assigns pg_current_wal_lsn() (do not call pg_current_wal_lsn() here), instead leave target_lsn NULL (or set an explicit sentinel) and emit a clear NOTICE about missing progress so the caller logic can skip advancing the slot/origin for that node; ensure downstream code checks target_lsn for NULL/sentinel and avoids moving the replication slot/origin when progress is absent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@samples/Z0DAN/zodan.sql`:
- Around line 1480-1501: The stale-origin cleanup currently swallows failures
(the EXCEPTION WHEN OTHERS block that does PERFORM dblink_exec for the DROP and
then RAISE WARNING) which allows create_sub to proceed with stale origin state;
change the handler so failures abort the operation: in the BEGIN...EXCEPTION
block that calls dblink_exec (the block that formats and executes the DO $x$
that checks pg_replication_origin for slot_name), replace the EXCEPTION WHEN
OTHERS THEN RAISE WARNING ... with a statement that raises an error (e.g., RAISE
EXCEPTION including slot_name and SQLERRM or re-raise the caught error) so the
process stops on cleanup failure and create_sub cannot reuse stale LSN state.
---
Duplicate comments:
In `@samples/Z0DAN/zodan.sql`:
- Around line 1418-1475: The source-catchup block (the anonymous BEGIN ...
DECLARE containing src_progress_lsn, wait_started, wait_timeout, progress_sql
and the dblink probe) currently logs warnings and continues on timeout or
errors, which allows proceeding without guaranteeing rec.node_name reached
_commit_lsn; change the behavior to fail-fast by replacing the non-fatal RAISE
WARNINGs with RAISE EXCEPTION (including useful context: rec.node_name and
_commit_lsn and last seen src_progress_lsn) so the whole operation aborts when
the loop times out or the outer EXCEPTION handler catches OTHERS; ensure the
dblink probe loop and the outer block both raise exceptions instead of just
warnings to prevent reopening the [resume_lsn, _commit_lsn) gap.
- Around line 2099-2110: The code currently falls back to selecting
pg_current_wal_lsn() into target_lsn when no spock.progress row exists
(variables/functions: target_lsn, spock.progress, rec.node_name, remotesql,
dblink(rec.dsn)), which can advance the slot past rows missing from the COPY
snapshot; remove the fallback that assigns pg_current_wal_lsn() (do not call
pg_current_wal_lsn() here), instead leave target_lsn NULL (or set an explicit
sentinel) and emit a clear NOTICE about missing progress so the caller logic can
skip advancing the slot/origin for that node; ensure downstream code checks
target_lsn for NULL/sentinel and avoids moving the replication slot/origin when
progress is absent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1c3b1dd9-0a29-4093-99e5-656edf8908c0
📒 Files selected for processing (4)
samples/Z0DAN/zodan.sqlsql/spock--5.0.6--6.0.0-devel.sqlsql/spock--6.0.0-devel.sqlsrc/spock_group.c
| -- Drop stale replication origin from a previous add_node cycle | ||
| -- so create_sub starts fresh at 0/0 (avoids stale-LSN data loss). | ||
| BEGIN | ||
| PERFORM dblink_exec( | ||
| new_node_dsn, | ||
| format($dsql$ | ||
| DO $x$ | ||
| BEGIN | ||
| IF EXISTS (SELECT 1 FROM pg_replication_origin | ||
| WHERE roname = %L) THEN | ||
| PERFORM pg_replication_origin_drop(%L); | ||
| END IF; | ||
| END $x$ | ||
| $dsql$, | ||
| slot_name, slot_name) | ||
| ); | ||
| RAISE NOTICE ' OK: Dropped stale origin % on new node (if existed)', | ||
| slot_name; | ||
| EXCEPTION | ||
| WHEN OTHERS THEN | ||
| RAISE WARNING ' Could not drop stale origin % on new node: %', | ||
| slot_name, SQLERRM; |
There was a problem hiding this comment.
Abort when stale-origin cleanup fails.
This block exists specifically to force the new subscription to start from 0/0. Turning cleanup failure into a warning lets create_sub reuse stale origin state and skip data.
Suggested fix
- EXCEPTION
- WHEN OTHERS THEN
- RAISE WARNING ' Could not drop stale origin % on new node: %',
- slot_name, SQLERRM;
+ EXCEPTION
+ WHEN OTHERS THEN
+ RAISE EXCEPTION 'Could not drop stale origin % on new node: %',
+ slot_name, SQLERRM;
END;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@samples/Z0DAN/zodan.sql` around lines 1480 - 1501, The stale-origin cleanup
currently swallows failures (the EXCEPTION WHEN OTHERS block that does PERFORM
dblink_exec for the DROP and then RAISE WARNING) which allows create_sub to
proceed with stale origin state; change the handler so failures abort the
operation: in the BEGIN...EXCEPTION block that calls dblink_exec (the block that
formats and executes the DO $x$ that checks pg_replication_origin for
slot_name), replace the EXCEPTION WHEN OTHERS THEN RAISE WARNING ... with a
statement that raises an error (e.g., RAISE EXCEPTION including slot_name and
SQLERRM or re-raise the caught error) so the process stops on cleanup failure
and create_sub cannot reuse stale LSN state.
When a new node is added to a running cluster while the existing nodes are handling writes, the new node can end up permanently missing rows or accumulating stale updates.
Root cause: in the fallback path (Case B) of create_slot_with_progress(), the resume LSN (P_snap) was derived from ros_remote_lsn read after the slot-creation retry loop. By that time the position could have advanced past the actual COPY boundary, creating a gap of permanently lost rows or allowing duplicate transactions to replay and corrupt balances.
Fix: capture v_pre_ros_lsn just before each slot-creation attempt and use it as the preferred P_snap fallback. This value is guaranteed to be at or before the COPY boundary. Add a MAX-by-LSN guard in spock_group_progress_force_set_list() to prevent overwriting a live apply worker's higher LSN with a stale P_snap value.
Also add the create_slot_with_progress() function to the upgrade SQL, update zodan.sql procedures, and adjust the TAP test timing.