Skip to content

perf(vector/hnsw): fold the .hnsw checksum into the Eager load pass (#789)#811

Merged
mosuka merged 1 commit into
mainfrom
perf/789-hnsw-checksum-fold
Jun 22, 2026
Merged

perf(vector/hnsw): fold the .hnsw checksum into the Eager load pass (#789)#811
mosuka merged 1 commit into
mainfrom
perf/789-hnsw-checksum-fold

Conversation

@mosuka

@mosuka mosuka commented Jun 21, 2026

Copy link
Copy Markdown
Owner

Summary

Removes the redundant full read of .hnsw segments on the Eager load path while keeping the #786 corruption guarantee.

#786 verified the CRC-32 footer with verify_checksum_footer — an independent full pass over the content before the structural parse. In Eager mode the parse then reads the same content again, so a footer-carrying segment was read ~twice on every reader creation (every searcher-cache miss, i.e. after each commit()).

This PR folds the CRC into the single Eager structural pass and verifies the footer after parse — no extra read. The Lazy/OnDemand path (which seeks over the vector payload) keeps a dedicated up-front verification pass; legacy footer-less segments skip verification.

Changes

  • storage/checksum.rs — new ChecksumTrackingInput, a StorageInput wrapper that accumulates a CRC over sequential reads. A real seek clears is_sequential; the no-op Current(0) seek used by stream_position() is served from a byte counter so it does not break tracking; track=false degrades it to a thin position-tracking pass-through; absorb_to(len) covers residual bytes; clone_input() returns the unwrapped inner (OnDemand clones inherit no running-CRC state).
  • vector/index/hnsw/reader.rsverify_checksum_footer split into read_footer_crc (8-byte footer probe only) and verify_footer_content (Lazy dedicated pass). load() sets fold = Eager && footer_present, runs the structural parse through ChecksumTrackingInput, then absorb_to(content_len) + compares against the stored CRC, falling back to verify_footer_content if is_sequential() is unexpectedly false.

Tests

  • ChecksumTrackingInput unit tests (sequential CRC, stream_position keeps sequential, real seek clears it, absorb_to, track=false, clone_input unwraps inner).
  • Corruption rejection on Eager (Scalar8Bit), Lazy/mmap, and pq-fastscan (feature-gated) segments.
  • Legacy footer-less load on both Eager and Lazy/mmap paths.
  • Read-count measurement (eager_load_reads_hnsw_segment_exactly_once): a CountingStorage asserts the Eager load reads the segment exactly file_size bytes (footer probe + one folded pass). The pre-perf(vector/hnsw): avoid the extra full read when verifying the .hnsw checksum on load #789 double-read was ~2 * content_len + 8, so this is a deterministic regression guard for the latency-restored criterion.

Verification

  • cargo test -p laurus --test vector_hnsw_checksum_test → 5 passed
  • cargo test -p laurus --lib checksum → 9 passed
  • cargo test -p laurus --lib vector::index::hnsw → 37 passed (default) / 38 passed (--features pq-fastscan)
  • cargo fmt --check clean; cargo clippy -p laurus --all-targets -- -D warnings clean (default and --features pq-fastscan)
  • cargo build -p laurus-wasm --target wasm32-unknown-unknown → success

Notes

  • HnswIndexReader::load signature is unchanged → no language-binding impact.
  • An adversarial multi-agent review (fold-correctness, position-tracking, lazy/ondemand, back-compat, test-adequacy) found no correctness defects; the actionable coverage gaps it surfaced are included in the tests above.

Closes #789

…789)

The #786 footer verification ran as a separate full read before the
structural parse, ~doubling read I/O on every Eager reader creation.
Fold the CRC into the single Eager structural pass via a new
ChecksumTrackingInput wrapper and verify the footer after parse; the
Lazy/OnDemand path keeps a dedicated up-front pass and legacy
footer-less segments skip verification.

A read-counting test asserts the Eager load now reads the segment
exactly once (file_size bytes) instead of ~2x content_len.

Closes #789
@mosuka mosuka merged commit d49c419 into main Jun 22, 2026
22 checks passed
@mosuka mosuka deleted the perf/789-hnsw-checksum-fold branch June 22, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(vector/hnsw): avoid the extra full read when verifying the .hnsw checksum on load

1 participant