Skip to content

Fix data loss when adding a node under write load.#392

Open
ibrarahmad wants to merge 4 commits intomainfrom
SPOC-311-V2
Open

Fix data loss when adding a node under write load.#392
ibrarahmad wants to merge 4 commits intomainfrom
SPOC-311-V2

Conversation

@ibrarahmad
Copy link
Copy Markdown
Contributor

When a new node is added to a running cluster while the existing nodes are handling writes, the new node can end up permanently missing rows or accumulating stale updates.

Root cause: in the fallback path (Case B) of create_slot_with_progress(), the resume LSN (P_snap) was derived from ros_remote_lsn read after the slot-creation retry loop. By that time the position could have advanced past the actual COPY boundary, creating a gap of permanently lost rows or allowing duplicate transactions to replay and corrupt balances.

Fix: capture v_pre_ros_lsn just before each slot-creation attempt and use it as the preferred P_snap fallback. This value is guaranteed to be at or before the COPY boundary. Add a MAX-by-LSN guard in spock_group_progress_force_set_list() to prevent overwriting a live apply worker's higher LSN with a stale P_snap value.

Also add the create_slot_with_progress() function to the upgrade SQL, update zodan.sql procedures, and adjust the TAP test timing.

When a new node is added to a running cluster while the existing
nodes are handling writes, the new node can end up permanently
missing rows or accumulating stale updates.

Root cause: in the fallback path (Case B) of create_slot_with_progress(),
the resume LSN (P_snap) was derived from ros_remote_lsn read after the
slot-creation retry loop.  By that time the position could have advanced
past the actual COPY boundary, creating a gap of permanently lost rows
or allowing duplicate transactions to replay and corrupt balances.

Fix: capture v_pre_ros_lsn just before each slot-creation attempt and
use it as the preferred P_snap fallback.  This value is guaranteed to
be at or before the COPY boundary.  Add a MAX-by-LSN guard in
spock_group_progress_force_set_list() to prevent overwriting a live
apply worker's higher LSN with a stale P_snap value.

Also add the create_slot_with_progress() function to the upgrade SQL,
update zodan.sql procedures, and adjust the TAP test timing.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 16, 2026

📝 Walkthrough

Walkthrough

Adds atomic slot-creation-with-progress and a force-set progress API, refactors sync flows to use the new slot/snapshot/progress mechanism, updates SQL migrations and sample sync scripts (new catchup helper and signature tweaks), and extends tests/CI timeouts with additional stabilization waits.

Changes

Cohort / File(s) Summary
CI Workflows
/.github/workflows/zodan_sync.yml, /.github/workflows/spockbench.yml
zodan_sync.yml: removed OS matrix and hard-coded runs-on: ubuntu-24.04-x64. spockbench.yml: increased TAP test step timeout from 20 to 45 minutes.
Public API Header
include/spock_group.h
Added declaration: extern void spock_group_progress_force_set_list(List *lst);.
Group Progress Implementation
src/spock_group.c
Added spock_group_progress_force_set_list(List *lst) to force-set per-peer progress into shared hash with MAX-by-LSN guards, WAL logging, and handling for new/existing entries and full-hash conditions.
Sync Core Changes
src/spock_sync.c
Added spock_create_slot_and_get_progress and spock_release_slot_snapshot; extended copy_replication_sets_data to accept progress_out; updated sync paths to use the new slot/snapshot/progress flow and call spock_group_progress_force_set_list.
Migration SQL (5.x → 6.x)
sql/spock--5.0.6--6.0.0-devel.sql, sql/spock--6.0.0-devel.sql
Added spock.create_slot_with_progress(p_slot_name, p_provider_node_id, p_subscriber_node_id) returning snapshot/lsn plus per-peer progress rows; implements atomic slot creation, snapshot export, and aggregated progress emission.
Sample Sync Script & PL/pgSQL APIs
samples/Z0DAN/zodan.sql
Added spock.wait_for_replication_catchup_with_dblink(...); extended spock.wait_for_sync_event signature (extra boolean arg); introduced bounded wait loops, timeouts, origin/slot recreation logic, more granular logging, and defensive add_node/remove_node flows.
Tests
tests/tap/t/011_zodan_sync_third.pl
Test count decreased (34→30); increased warm-up/stabilization sleeps (added 60s delays), enabled transdiscard on N3, drained backlog before add_node, extended wait_for_sync_event timeout (600s→1200s) and added extra boolean arg to calls, and updated node-target references/messages.

Poem

🐇 I dug a tiny slot so deep,

I took a snapshot for peers to keep,
I nudged their LSNs so they’d align,
I watched the replication hop in time,
Hooray — the cluster’s feeling fine!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main fix: preventing data loss when adding a node to a cluster under write load. It accurately reflects the primary objective.
Description check ✅ Passed The description is clearly related to the changeset, explaining the root cause of data loss, the specific fix implemented, and listing all major changes across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch SPOC-311-V2
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ibrarahmad ibrarahmad requested a review from mason-sharp March 16, 2026 15:26
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
tests/tap/t/011_zodan_sync_third.pl (1)

114-115: Consider replacing fixed sleeps with state-based waits.

There are multiple unconditional long sleeps here; they add significant runtime and still rely on timing assumptions. A bounded poll on replication state (or a helper that waits for an explicit sync condition) will usually be faster and less flaky than fixed delays.

Also applies to: 129-130, 140-141, 241-242, 273-274, 348-349, 359-360, 370-370

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tap/t/011_zodan_sync_third.pl` around lines 114 - 115, Replace the
unconditional long sleep(60) and other fixed sleeps with a state-based poll/wait
helper (e.g., implement and call wait_for_sync() or poll_replication_state())
that repeatedly checks the replication/sync condition used by this test until it
is met or a timeout elapses; update the calls at the locations around sleep(60)
and the other listed occurrences (near lines with sleep calls) to call that
helper (or inline loop) instead of sleeping, ensuring the check uses the same
replication status function the test already uses so the waits are bounded and
deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/zodan_sync.yml:
- Line 20: The workflow uses an invalid GitHub runner label "ubuntu-24.04-x64"
in the runs-on setting; update the runs-on value to the supported label
"ubuntu-24.04" (i.e., replace the string "ubuntu-24.04-x64" with "ubuntu-24.04")
so the job will run on a valid GitHub-hosted runner.

In `@samples/Z0DAN/zodan.sql`:
- Around line 1501-1503: The PERFORM dblink(...) calls that invoke
spock.wait_for_sync_event discard its boolean return (it returns false on
timeout); change each PERFORM dblink call (e.g., the one using new_node_dsn +
format('CALL spock.wait_for_sync_event(...)', src_node_name, sync_lsn,
timeout_ms)) to SELECT the result INTO a local boolean (e.g., wait_ok) from
dblink(...) AS t(result boolean) and then IF NOT wait_ok THEN RAISE EXCEPTION
'wait_for_sync_event timed out for %', origin_name; END IF; so the function
detects timeouts instead of silently proceeding; apply the same change to the
other occurrences noted (around lines shown).
- Around line 1377-1379: The timeout and missing-progress branches should fail
closed so P_snap preconditions cannot be bypassed: replace the soft RAISE
WARNING + EXIT behavior when wait_iters >= max_wait_iters and any code that
substitutes missing spock.progress LSN with pg_current_wal_lsn() with a hard
failure (RAISE EXCEPTION) that aborts the operation, and ensure the routines
that inspect src_progress_lsn and spock.progress (the catch-up loop and the LSN
substitution logic) return an error instead of continuing; update the handling
in the same blocks that reference wait_iters, max_wait_iters, src_progress_lsn
and spock.progress so the function aborts on timeout or missing progress LSN
rather than proceeding.

In `@sql/spock--5.0.6--6.0.0-devel.sql`:
- Around line 199-205: The ELSE branch still prefers the post-slot
rec.ros_remote_lsn causing the unsafe "read after retry loop" gap; update the
COALESCE that sets remote_commit_lsn to drop rec.ros_remote_lsn and instead use
the safe lower-bound fallback order (rec.grp_remote_commit_lsn, v_pre_ros_lsn,
'0/0'::pg_lsn). Modify the assignment to remote_commit_lsn so it no longer
references rec.ros_remote_lsn but mirrors the fresh-install logic (use
rec.grp_remote_commit_lsn, then v_pre_ros_lsn, then '0/0'::pg_lsn).

In `@src/spock_group.c`:
- Around line 746-755: The code mutates the live shared-memory progress
(entry->progress) via init_progress_fields() before calling
spock_apply_progress_add_to_wal(sap), exposing SHM to regressions if the WAL
write fails; instead, create a local temporary progress structure, initialize it
(use the same logic as init_progress_fields on the temp), call
spock_apply_progress_add_to_wal(sap) using that staged data, and only after the
WAL write returns success copy the temp into entry->progress (or call
progress_update_struct on entry->progress with the staged values); update
references to progress_update_struct(&entry->progress, sap) to use the staged
data until the WAL commit is confirmed.

In `@tests/tap/t/011_zodan_sync_third.pl`:
- Around line 113-115: The printed "warming up" and "done warmup" duration
messages are inconsistent with the actual sleep(60) calls; update each print
STDERR message (e.g., the "warming up pgbench for 5s\n" and similar prints
around the sleep calls) to reflect the real duration (60s) or alternatively
reduce the sleep to match the message—ensure all occurrences (the ones around
the sleep calls at the shown diffs and the other instances referenced) are
updated so the logged duration matches the actual sleep value.

---

Nitpick comments:
In `@tests/tap/t/011_zodan_sync_third.pl`:
- Around line 114-115: Replace the unconditional long sleep(60) and other fixed
sleeps with a state-based poll/wait helper (e.g., implement and call
wait_for_sync() or poll_replication_state()) that repeatedly checks the
replication/sync condition used by this test until it is met or a timeout
elapses; update the calls at the locations around sleep(60) and the other listed
occurrences (near lines with sleep calls) to call that helper (or inline loop)
instead of sleeping, ensuring the check uses the same replication status
function the test already uses so the waits are bounded and deterministic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dc40892c-480a-4657-98e2-dc68c6f86c76

📥 Commits

Reviewing files that changed from the base of the PR and between 0ae4205 and 6cdac3c.

📒 Files selected for processing (8)
  • .github/workflows/zodan_sync.yml
  • include/spock_group.h
  • samples/Z0DAN/zodan.sql
  • sql/spock--5.0.6--6.0.0-devel.sql
  • sql/spock--6.0.0-devel.sql
  • src/spock_group.c
  • src/spock_sync.c
  • tests/tap/t/011_zodan_sync_third.pl

Check the boolean returned by wait_for_sync_event() instead of
discarding it with PERFORM.  The procedure signals timeout by returning
false, not by throwing, so the old code silently continued past failed
sync waits.  All four dblink call sites now capture the result and raise
an exception when the wait did not succeed.

In the 5.0.6 → 6.0.0-devel upgrade migration, Case B was still
preferring the post-slot ros.remote_lsn, which can be higher than
P_snap.  Align with the fresh-install logic: use v_pre_ros_lsn
(captured before slot creation) as the primary fallback, matching the
safe lower-bound semantics already in spock--6.0.0-devel.sql.

Move init_progress_fields() in spock_group_progress_force_set_list()
to after the WAL write so that a failure in
spock_apply_progress_add_to_wal() does not leave shared memory in a
regressed state with no durable record to replay from.

Fix misleading sleep-duration messages in 011_zodan_sync_third.pl that
still said 5s/10s/20s after the actual sleeps were changed to 60s.

Increase spockbench TAP test timeout from 20 to 45 minutes.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
samples/Z0DAN/zodan.sql (1)

1377-1379: ⚠️ Potential issue | 🔴 Critical

Fail closed when P_snap safety preconditions are missing.

Line 1377/Line 1393/Line 1842/Line 1988 still allow timeout or missing-progress paths to continue. That can invalidate P_snap guarantees and reintroduce the skip-window risk this PR is fixing.

Suggested fix
-                    IF wait_iters >= max_wait_iters THEN
-                        RAISE WARNING '    Timeout waiting for source node commit catchup (last seen: %)', src_progress_lsn;
-                        EXIT;
-                    END IF;
+                    IF wait_iters >= max_wait_iters THEN
+                        RAISE EXCEPTION 'Timeout waiting for source node % commit catchup from % (last seen %, need >= %)',
+                            src_node_name, rec.node_name, src_progress_lsn, _commit_lsn;
+                    END IF;
...
-        EXCEPTION
-            WHEN OTHERS THEN
-                RAISE WARNING '    Could not verify source commit catchup for %: %', rec.node_name, SQLERRM;
+        EXCEPTION
+            WHEN OTHERS THEN
+                RAISE EXCEPTION 'Could not verify source commit catchup for %: %', rec.node_name, SQLERRM;
         END;
...
-    EXCEPTION
-        WHEN OTHERS THEN
-            RAISE WARNING '    - sub_wait_for_sync(%) failed: %; proceeding anyway', v_sub_name, SQLERRM;
+    EXCEPTION
+        WHEN OTHERS THEN
+            RAISE EXCEPTION 'sub_wait_for_sync(%) failed: %', v_sub_name, SQLERRM;
     END;
...
-                IF target_lsn IS NULL THEN
-                    RAISE NOTICE '    WARNING: No spock.progress entry for %, falling back to pg_current_wal_lsn()', rec.node_name;
-                    remotesql := 'SELECT pg_current_wal_lsn()';
-                    SELECT * FROM dblink(rec.dsn, remotesql) AS t(lsn pg_lsn) INTO target_lsn;
-                END IF;
+                IF target_lsn IS NULL THEN
+                    RAISE EXCEPTION 'No spock.progress entry for %, refusing to advance slot % without validated P_snap',
+                        rec.node_name, slot_name;
+                END IF;

Also applies to: 1391-1394, 1841-1843, 1987-1991

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@samples/Z0DAN/zodan.sql` around lines 1377 - 1379, The timeout branches
currently RAISE WARNING and EXIT (e.g., the IF wait_iters >= max_wait_iters THEN
block referencing wait_iters, max_wait_iters and src_progress_lsn) which allows
execution to continue and can violate the P_snap safety preconditions; change
these to fail-closed by raising an error (use RAISE EXCEPTION with a clear
message including src_progress_lsn and relevant context) instead of
warning+exit, and apply the same replacement at the other occurrences mentioned
(the blocks around the symbols at lines noted that perform
missing-progress/timeouts) so the procedure aborts when P_snap preconditions are
not met.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@samples/Z0DAN/zodan.sql`:
- Around line 1377-1379: The timeout branches currently RAISE WARNING and EXIT
(e.g., the IF wait_iters >= max_wait_iters THEN block referencing wait_iters,
max_wait_iters and src_progress_lsn) which allows execution to continue and can
violate the P_snap safety preconditions; change these to fail-closed by raising
an error (use RAISE EXCEPTION with a clear message including src_progress_lsn
and relevant context) instead of warning+exit, and apply the same replacement at
the other occurrences mentioned (the blocks around the symbols at lines noted
that perform missing-progress/timeouts) so the procedure aborts when P_snap
preconditions are not met.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1707107a-0f0e-4446-9e76-21a4e747da86

📥 Commits

Reviewing files that changed from the base of the PR and between 6cdac3c and e730de4.

📒 Files selected for processing (5)
  • .github/workflows/spockbench.yml
  • samples/Z0DAN/zodan.sql
  • sql/spock--5.0.6--6.0.0-devel.sql
  • src/spock_group.c
  • tests/tap/t/011_zodan_sync_third.pl

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
samples/Z0DAN/zodan.sql (3)

1928-1951: ⚠️ Potential issue | 🔴 Critical

Do not continue when src→new subscription never reaches READY.

Line 1928, Line 1947, and Line 1957 downgrade readiness failures to warnings. This is a safety gate for reliable progress state; continuing here can leave downstream slot/origin advancement based on incomplete state.

Suggested fix
-        IF v_subid IS NULL THEN
-            RAISE WARNING '    - Subscription % not found on new node; continuing', v_sub_name;
+        IF v_subid IS NULL THEN
+            RAISE EXCEPTION 'Subscription % not found on new node', v_sub_name;
@@
-                IF clock_timestamp() - v_wait_started > interval '3 minutes' THEN
-                    RAISE WARNING '    - Timed out waiting for % to become READY (pending rows: %, status: %); continuing',
-                        v_sub_name, v_pending_sync, coalesce(v_sub_status, '<unknown>');
-                    EXIT;
-                END IF;
+                IF clock_timestamp() - v_wait_started > interval '3 minutes' THEN
+                    RAISE EXCEPTION 'Timed out waiting for % to become READY (pending rows: %, status: %)',
+                        v_sub_name, v_pending_sync, coalesce(v_sub_status, '<unknown>');
+                END IF;
@@
-    EXCEPTION
-        WHEN OTHERS THEN
-            RAISE WARNING '    - READY wait for subscription % failed: %; proceeding anyway', v_sub_name, SQLERRM;
-            RAISE WARNING '    - Current subscription status snapshot: %',
-                coalesce((SELECT string_agg(subscription_name || ':' || status, ', ')
-                          FROM spock.sub_show_status()), '<none>');
+    EXCEPTION
+        WHEN OTHERS THEN
+            RAISE EXCEPTION 'READY wait for subscription % failed: %', v_sub_name, SQLERRM;
     END;

Also applies to: 1957-1962

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@samples/Z0DAN/zodan.sql` around lines 1928 - 1951, The code currently
downgrades failure-to-become-READY into RAISE WARNING and continues, which can
leave downstream replication state inconsistent; update the two places that
check readiness timeout (the block that measures clock_timestamp() -
v_wait_started and the other similar block later) to abort instead of continue
by replacing RAISE WARNING with a terminating error (e.g., RAISE EXCEPTION or
RAISE ERROR) so the function stops execution when v_pending_sync does not reach
0 for v_sub_name; keep the diagnostic interpolation (v_sub_name, v_pending_sync,
coalesce(v_sub_status,'<unknown>')) and ensure you target the branches that
reference v_subid, v_wait_started, spock.local_sync_status and
spock.sub_show_status() so the failure becomes fatal rather than a warning.

1459-1462: ⚠️ Potential issue | 🔴 Critical

Fail closed when source catchup verification times out or errors.

Line 1459 and Line 1473 currently warn and continue. That allows Phase 7 to proceed without proving src_progress_lsn >= _commit_lsn, which reopens the snapshot-gap risk this PR is trying to eliminate.

Suggested fix
-                    IF clock_timestamp() - wait_started > wait_timeout THEN
-                        RAISE WARNING '    Timeout waiting for source node commit catchup (last seen: %)', src_progress_lsn;
-                        EXIT;
-                    END IF;
+                    IF clock_timestamp() - wait_started > wait_timeout THEN
+                        RAISE EXCEPTION 'Timeout waiting for source node % to commit % up to % (last seen: %)',
+                            src_node_name, rec.node_name, _commit_lsn, COALESCE(src_progress_lsn::text, '<null>');
+                    END IF;
@@
-        EXCEPTION
-            WHEN OTHERS THEN
-                RAISE WARNING '    Could not verify source commit catchup for %: %', rec.node_name, SQLERRM;
+        EXCEPTION
+            WHEN OTHERS THEN
+                RAISE EXCEPTION 'Could not verify source commit catchup for %: %', rec.node_name, SQLERRM;
         END;

Also applies to: 1473-1475

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@samples/Z0DAN/zodan.sql` around lines 1459 - 1462, The timeout/error handlers
that currently RAISE WARNING and EXIT must fail closed so Phase 7 cannot
continue without proving src_progress_lsn >= _commit_lsn; replace the RAISE
WARNING in the timeout branch (the IF using clock_timestamp() - wait_started >
wait_timeout) and the similar error-handling branch later with RAISE EXCEPTION
(including a descriptive message and the src_progress_lsn/_commit_lsn context)
so the function aborts instead of continuing; update both occurrences referenced
by the symbols src_progress_lsn, _commit_lsn, wait_started, and wait_timeout.

2106-2110: ⚠️ Potential issue | 🔴 Critical

Remove unsafe pg_current_wal_lsn() fallback for missing spock.progress.

Line 2106 falls back to provider current WAL when progress is missing. That can advance the slot past the copy boundary and permanently skip unapplied rows.

Suggested fix
-                IF target_lsn IS NULL THEN
-                    RAISE NOTICE '    WARNING: No spock.progress entry for %, falling back to pg_current_wal_lsn()', rec.node_name;
-                    remotesql := 'SELECT pg_current_wal_lsn()';
-                    SELECT * FROM dblink(rec.dsn, remotesql) AS t(lsn pg_lsn) INTO target_lsn;
-                END IF;
+                IF target_lsn IS NULL THEN
+                    RAISE EXCEPTION 'No spock.progress entry for node %; refusing unsafe slot advance', rec.node_name;
+                END IF;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@samples/Z0DAN/zodan.sql` around lines 2106 - 2110, The current fallback that
sets remotesql = 'SELECT pg_current_wal_lsn()' and queries it into target_lsn
(using dblink and pg_current_wal_lsn()) is unsafe because it can advance the
replication slot past the copy boundary; remove that fallback and instead fail
loudly when spock.progress is missing for the provider: replace the fallback
block so that when target_lsn IS NULL you RAISE EXCEPTION (or RETURN/EXIT with
an explicit error) including rec.node_name to force the caller to fix the
missing spock.progress entry rather than using pg_current_wal_lsn(); do not call
pg_current_wal_lsn() or perform the dblink SELECT into target_lsn.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@samples/Z0DAN/zodan.sql`:
- Around line 1928-1951: The code currently downgrades failure-to-become-READY
into RAISE WARNING and continues, which can leave downstream replication state
inconsistent; update the two places that check readiness timeout (the block that
measures clock_timestamp() - v_wait_started and the other similar block later)
to abort instead of continue by replacing RAISE WARNING with a terminating error
(e.g., RAISE EXCEPTION or RAISE ERROR) so the function stops execution when
v_pending_sync does not reach 0 for v_sub_name; keep the diagnostic
interpolation (v_sub_name, v_pending_sync, coalesce(v_sub_status,'<unknown>'))
and ensure you target the branches that reference v_subid, v_wait_started,
spock.local_sync_status and spock.sub_show_status() so the failure becomes fatal
rather than a warning.
- Around line 1459-1462: The timeout/error handlers that currently RAISE WARNING
and EXIT must fail closed so Phase 7 cannot continue without proving
src_progress_lsn >= _commit_lsn; replace the RAISE WARNING in the timeout branch
(the IF using clock_timestamp() - wait_started > wait_timeout) and the similar
error-handling branch later with RAISE EXCEPTION (including a descriptive
message and the src_progress_lsn/_commit_lsn context) so the function aborts
instead of continuing; update both occurrences referenced by the symbols
src_progress_lsn, _commit_lsn, wait_started, and wait_timeout.
- Around line 2106-2110: The current fallback that sets remotesql = 'SELECT
pg_current_wal_lsn()' and queries it into target_lsn (using dblink and
pg_current_wal_lsn()) is unsafe because it can advance the replication slot past
the copy boundary; remove that fallback and instead fail loudly when
spock.progress is missing for the provider: replace the fallback block so that
when target_lsn IS NULL you RAISE EXCEPTION (or RETURN/EXIT with an explicit
error) including rec.node_name to force the caller to fix the missing
spock.progress entry rather than using pg_current_wal_lsn(); do not call
pg_current_wal_lsn() or perform the dblink SELECT into target_lsn.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5133dd44-cff6-41c0-b43f-328ecbec06b1

📥 Commits

Reviewing files that changed from the base of the PR and between e730de4 and 3a5974b.

📒 Files selected for processing (1)
  • samples/Z0DAN/zodan.sql

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
samples/Z0DAN/zodan.sql (2)

1418-1475: ⚠️ Potential issue | 🔴 Critical

Don’t fail open on the source catch-up prerequisite.

This wait is what guarantees the Phase 5 snapshot already contains rec.node_name changes through _commit_lsn. Warning and continuing on timeout/error can reopen the [resume_lsn, _commit_lsn) gap.

Suggested fix
-                    IF clock_timestamp() - wait_started > wait_timeout THEN
-                        RAISE WARNING '    Timeout waiting for source node commit catchup (last seen: %)', src_progress_lsn;
-                        EXIT;
-                    END IF;
+                    IF clock_timestamp() - wait_started > wait_timeout THEN
+                        RAISE EXCEPTION 'Timeout waiting for source node % to commit % changes through % (last seen: %)',
+                            src_node_name, rec.node_name, _commit_lsn, src_progress_lsn;
+                    END IF;
@@
-        EXCEPTION
-            WHEN OTHERS THEN
-                RAISE WARNING '    Could not verify source commit catchup for %: %', rec.node_name, SQLERRM;
+        EXCEPTION
+            WHEN OTHERS THEN
+                RAISE EXCEPTION 'Could not verify source commit catchup for %: %', rec.node_name, SQLERRM;
         END;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@samples/Z0DAN/zodan.sql` around lines 1418 - 1475, The source-catchup block
(the anonymous BEGIN ... DECLARE containing src_progress_lsn, wait_started,
wait_timeout, progress_sql and the dblink probe) currently logs warnings and
continues on timeout or errors, which allows proceeding without guaranteeing
rec.node_name reached _commit_lsn; change the behavior to fail-fast by replacing
the non-fatal RAISE WARNINGs with RAISE EXCEPTION (including useful context:
rec.node_name and _commit_lsn and last seen src_progress_lsn) so the whole
operation aborts when the loop times out or the outer EXCEPTION handler catches
OTHERS; ensure the dblink probe loop and the outer block both raise exceptions
instead of just warnings to prevent reopening the [resume_lsn, _commit_lsn) gap.

2099-2110: ⚠️ Potential issue | 🔴 Critical

Never substitute pg_current_wal_lsn() for missing progress.

If spock.progress has no row yet, advancing the slot/origin to current WAL can jump past rows that were not present in the COPY snapshot.

Suggested fix
-                IF target_lsn IS NULL THEN
-                    RAISE NOTICE '    WARNING: No spock.progress entry for %, falling back to pg_current_wal_lsn()', rec.node_name;
-                    remotesql := 'SELECT pg_current_wal_lsn()';
-                    SELECT * FROM dblink(rec.dsn, remotesql) AS t(lsn pg_lsn) INTO target_lsn;
-                END IF;
+                IF target_lsn IS NULL THEN
+                    RAISE EXCEPTION 'No spock.progress entry found for % on the new node; refusing to advance % blindly',
+                        rec.node_name, slot_name;
+                END IF;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@samples/Z0DAN/zodan.sql` around lines 2099 - 2110, The code currently falls
back to selecting pg_current_wal_lsn() into target_lsn when no spock.progress
row exists (variables/functions: target_lsn, spock.progress, rec.node_name,
remotesql, dblink(rec.dsn)), which can advance the slot past rows missing from
the COPY snapshot; remove the fallback that assigns pg_current_wal_lsn() (do not
call pg_current_wal_lsn() here), instead leave target_lsn NULL (or set an
explicit sentinel) and emit a clear NOTICE about missing progress so the caller
logic can skip advancing the slot/origin for that node; ensure downstream code
checks target_lsn for NULL/sentinel and avoids moving the replication
slot/origin when progress is absent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@samples/Z0DAN/zodan.sql`:
- Around line 1480-1501: The stale-origin cleanup currently swallows failures
(the EXCEPTION WHEN OTHERS block that does PERFORM dblink_exec for the DROP and
then RAISE WARNING) which allows create_sub to proceed with stale origin state;
change the handler so failures abort the operation: in the BEGIN...EXCEPTION
block that calls dblink_exec (the block that formats and executes the DO $x$
that checks pg_replication_origin for slot_name), replace the EXCEPTION WHEN
OTHERS THEN RAISE WARNING ... with a statement that raises an error (e.g., RAISE
EXCEPTION including slot_name and SQLERRM or re-raise the caught error) so the
process stops on cleanup failure and create_sub cannot reuse stale LSN state.

---

Duplicate comments:
In `@samples/Z0DAN/zodan.sql`:
- Around line 1418-1475: The source-catchup block (the anonymous BEGIN ...
DECLARE containing src_progress_lsn, wait_started, wait_timeout, progress_sql
and the dblink probe) currently logs warnings and continues on timeout or
errors, which allows proceeding without guaranteeing rec.node_name reached
_commit_lsn; change the behavior to fail-fast by replacing the non-fatal RAISE
WARNINGs with RAISE EXCEPTION (including useful context: rec.node_name and
_commit_lsn and last seen src_progress_lsn) so the whole operation aborts when
the loop times out or the outer EXCEPTION handler catches OTHERS; ensure the
dblink probe loop and the outer block both raise exceptions instead of just
warnings to prevent reopening the [resume_lsn, _commit_lsn) gap.
- Around line 2099-2110: The code currently falls back to selecting
pg_current_wal_lsn() into target_lsn when no spock.progress row exists
(variables/functions: target_lsn, spock.progress, rec.node_name, remotesql,
dblink(rec.dsn)), which can advance the slot past rows missing from the COPY
snapshot; remove the fallback that assigns pg_current_wal_lsn() (do not call
pg_current_wal_lsn() here), instead leave target_lsn NULL (or set an explicit
sentinel) and emit a clear NOTICE about missing progress so the caller logic can
skip advancing the slot/origin for that node; ensure downstream code checks
target_lsn for NULL/sentinel and avoids moving the replication slot/origin when
progress is absent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1c3b1dd9-0a29-4093-99e5-656edf8908c0

📥 Commits

Reviewing files that changed from the base of the PR and between 3a5974b and 6aee965.

📒 Files selected for processing (4)
  • samples/Z0DAN/zodan.sql
  • sql/spock--5.0.6--6.0.0-devel.sql
  • sql/spock--6.0.0-devel.sql
  • src/spock_group.c

Comment on lines +1480 to +1501
-- Drop stale replication origin from a previous add_node cycle
-- so create_sub starts fresh at 0/0 (avoids stale-LSN data loss).
BEGIN
PERFORM dblink_exec(
new_node_dsn,
format($dsql$
DO $x$
BEGIN
IF EXISTS (SELECT 1 FROM pg_replication_origin
WHERE roname = %L) THEN
PERFORM pg_replication_origin_drop(%L);
END IF;
END $x$
$dsql$,
slot_name, slot_name)
);
RAISE NOTICE ' OK: Dropped stale origin % on new node (if existed)',
slot_name;
EXCEPTION
WHEN OTHERS THEN
RAISE WARNING ' Could not drop stale origin % on new node: %',
slot_name, SQLERRM;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Abort when stale-origin cleanup fails.

This block exists specifically to force the new subscription to start from 0/0. Turning cleanup failure into a warning lets create_sub reuse stale origin state and skip data.

Suggested fix
-            EXCEPTION
-                WHEN OTHERS THEN
-                    RAISE WARNING '    Could not drop stale origin % on new node: %',
-                        slot_name, SQLERRM;
+            EXCEPTION
+                WHEN OTHERS THEN
+                    RAISE EXCEPTION 'Could not drop stale origin % on new node: %',
+                        slot_name, SQLERRM;
             END;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@samples/Z0DAN/zodan.sql` around lines 1480 - 1501, The stale-origin cleanup
currently swallows failures (the EXCEPTION WHEN OTHERS block that does PERFORM
dblink_exec for the DROP and then RAISE WARNING) which allows create_sub to
proceed with stale origin state; change the handler so failures abort the
operation: in the BEGIN...EXCEPTION block that calls dblink_exec (the block that
formats and executes the DO $x$ that checks pg_replication_origin for
slot_name), replace the EXCEPTION WHEN OTHERS THEN RAISE WARNING ... with a
statement that raises an error (e.g., RAISE EXCEPTION including slot_name and
SQLERRM or re-raise the caught error) so the process stops on cleanup failure
and create_sub cannot reuse stale LSN state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant