test(e2e): repro reth-ahead-of-consensus cross-store desync (audit #703/#704) by keanji-x · Pull Request #745 · Galxe/gravity-sdk

keanji-x · 2026-06-10T14:15:30Z

Summary

Adds a deterministic single-node e2e suite (gravity_e2e/cluster_test_cases/cross_store_desync_703_704/) that reproduces the non-atomic cross-store commit bug and the recovery-path crash it triggers — findings F1/F2 in galxe/RESTART_RECOVERY_FINDINGS.md.

reth (execution rocksdb) and the aptos consensus DB persist on independent stores with no write-ordering barrier. A crash in the commit window can leave reth at block N while consensus is at N-1. On restart, recovery anchors on reth's height (bin/gravity_node/src/main.rs:123 recover_block_number) but the consensus index no longer contains it, so recovery cannot find the root and the node dies — there is no reconcile / state-sync fallback.

Reproduction (no power-loss race needed)

Bring the node live, commit past height H, node.stop() cleanly.
gravity_cli unwind the consensus DB back to ~H/2, leaving reth at H.
- Gotcha encoded in the test: --consensus-db-path must be the parent <data> dir, because ConsensusDB::new joins consensus_db internally (aptos-core/consensus/src/consensusdb/mod.rs:112). Passing <data>/consensus_db operates on an empty nested DB and silently no-ops.
node.start() and observe recovery.

Observed failure (current code)

ERROR aptos-core/consensus/src/persistent_liveness_storage.rs:549
  Failed to construct recovery data {"error":"unable to find root: 80dce1e8 ...
  LedgerRecoveryData::find_root ... RecoveryData::new ..."}
panicked at aptos-core/consensus/src/persistent_liveness_storage.rs:550:17
  (panic!(""))   -> node process dead (crash)

This is F1 (#703): the PartialRecoveryData / state-sync fallback at :551 is commented out and recovery_manager.rs:104 is todo!(), so any inconsistent recovery data is a panic!(""). The sibling F2 (#704) silent-stall path (crates/block-buffer-manager/src/block_buffer_manager.rs:343 early-return leaving the buffer Uninitialized → get_ordered_blocks awaits ready_notifier forever) is also asserted as a recovery failure.

Test status

The test asserts the correct post-fix behaviour (node detects reth_height > consensus_index, reconciles, and resumes producing blocks), so it is @pytest.mark.xfail(strict=True) on current code:

1 xfailed in 6.48s

Remove the xfail once startup auto-reconciles instead of panicking/stalling.

Notes

Test-only; no production code changed.
Unique base_dir=/tmp/gravity-cluster-desync and unique ports (rpc 8945, validator 6480, vfn 6490, authrpc 8953, reth_p2p 12324, metrics 9050, inspection 10300) — no collision with existing suites.
Cross-repo refs: Galxe/gravity-audit#703, Galxe/gravity-audit#704.

🤖 Generated with Claude Code

#704) Adds a single-node e2e suite that deterministically reproduces the non-atomic cross-store commit bug (reth execution rocksdb ahead of the aptos consensus DB) and the recovery-path panic it triggers. Repro (no power-loss race needed): 1. bring node live, commit past H, node.stop() cleanly 2. gravity_cli unwind the CONSENSUS DB back to ~H/2, leaving reth at H (NOTE: --consensus-db-path must be the parent <data> dir, the tool joins "consensus_db" internally; passing <data>/consensus_db no-ops) 3. node.start(): recovery anchors on reth's height (recover_block_number), the unwound consensus index lacks it, RecoveryData::new -> find_root fails "unable to find root" -> persistent_liveness_storage.rs:550 panic!("") and the node CRASHES (no PartialRecoveryData/state-sync fallback; recovery_manager.rs:104 is todo!()). Observed evidence captured by the test: ERROR persistent_liveness_storage.rs:549 Failed to construct recovery data {"error":"unable to find root: ..."} ; panicked at persistent_liveness_storage.rs:550 ; node process dead. The sibling silent-stall path (block_buffer_manager.rs:343 early-return leaving the buffer Uninitialized -> get_ordered_blocks awaits ready_notifier forever) is also asserted as a recovery failure. The test asserts the CORRECT post-fix behaviour (node reconciles reth_height > consensus_index and resumes), so it is xfail(strict) on current code. Remove the xfail once startup auto-reconciles instead of panicking/stalling. Test-only; no production code changed. Refs Galxe/gravity-audit#703, Galxe/gravity-audit#704 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): repro reth-ahead-of-consensus cross-store desync (audit #703/#704)#745

test(e2e): repro reth-ahead-of-consensus cross-store desync (audit #703/#704)#745
keanji-x wants to merge 1 commit into
mainfrom
test/e2e-cross-store-desync-703-704

keanji-x commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

keanji-x commented Jun 10, 2026

Summary

Reproduction (no power-loss race needed)

Observed failure (current code)

Test status

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant