Skip to content

fix(vector/index): bound reader/writer allocations sized from unverified on-disk header counts#808

Merged
mosuka merged 1 commit into
mainfrom
fix/reader-alloc-bounds
Jun 15, 2026
Merged

fix(vector/index): bound reader/writer allocations sized from unverified on-disk header counts#808
mosuka merged 1 commit into
mainfrom
fix/reader-alloc-bounds

Conversation

@mosuka

@mosuka mosuka commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Summary

Generalizes the #791 file-size bound to the HNSW / Flat / IVF vector segment loaders. A corrupt or hostile on-disk header count (num_vectors, n_clusters, the HNSW graph's node/layer/neighbor counts) or per-record length (field_name_len, PQ codes) could otherwise drive a multi-GiB with_capacity / vec![0u8; len] that aborts the process via handle_alloc_error (OOM) before any corruption check runs.

A new shared helper laurus/src/vector/index/alloc_bounds.rs rejects, before allocating, any count or length larger than what StorageInput::size() shows the file can hold, so a flipped byte surfaces a clean LaurusError::Index ("corrupted segment") instead of an OOM abort.

Approach

  • checked_capacity(count, min_stride, available, what) — each element is ≥ min_stride bytes on disk, so available bytes hold at most available / min_stride of them; a larger count is corruption.
  • checked_len(len, available, what) — a per-record buffer can't exceed the bytes left in the file.
  • Each loader captures file_size = input.size()? once and the bytes remaining per section once; per-record checks reuse that value, so no extra syscall is added in the hot loops (pure arithmetic on the default mmap backend, and the FileInput BufReader is never thrashed).

Scope

Applied to both the reader load paths and the writer reload paths (used on append/merge). The writer paths run no checksum footer verification at all, so they reach counts unverified even for footer-carrying segments — a larger exposure than the readers. This also covers legacy footer-less .hnsw segments, the residual exposure left by the #786 footer (footer-carrying .hnsw is already CRC-verified before the structural parse).

Changes

  • vector/index/alloc_bounds.rs (new helper + unit tests), registered in vector/index.rs
  • vector/index/hnsw/{reader,writer}.rs, flat/{reader,writer}.rs, ivf/{reader,writer}.rs — bounded every header-count with_capacity and per-record buffer read
  • storage.rs — documented the StorageInput::size() contract loaders rely on
  • docs/src/concepts/indexing/vector_indexing.md — user-facing "Bounded allocations from on-disk header counts" section

No on-disk format change and no public API signature change.

Tests

Per-reader alloc_bound_tests build a hand-crafted segment whose count is corrupted to a huge value and assert load() returns a clean error rather than aborting:

  • Flat: oversized num_vectors
  • IVF: oversized n_clusters (centroids)
  • HNSW: oversized num_vectors and oversized graph node_count, both on footer-less segments
  • alloc_bounds helper unit tests (6)

Verification

  • cargo clippy -p laurus --all-targets --features embeddings-all -- -D warnings — 0 warnings
  • cargo clippy -p laurus --all-targets --features pq-fastscan -- -D warnings — 0 warnings
  • cargo test -p laurus --lib — 1142 passed / 0 failed (10 new tests)
  • Integration: vector_recall_test (9), vector_segment_test (1)
  • markdownlint-cli2 on the edited doc — 0 errors
  • cargo build -p laurus-wasm --target wasm32-unknown-unknown — builds

Closes #806

…ied on-disk header counts

Generalize the #791 file-size bound to the HNSW/Flat/IVF segment loaders.
A corrupt or hostile header count (num_vectors, n_clusters, the HNSW
graph's node/layer/neighbor counts) or per-record length (field_name_len,
PQ codes) could otherwise drive a multi-GiB with_capacity / vec![0u8; len]
that aborts the process via handle_alloc_error (OOM) before any corruption
check runs.

Add a shared alloc_bounds helper (checked_capacity / checked_len) that
rejects, before allocating, any count or length larger than what
StorageInput::size() shows the file can hold. The loaders capture
file_size once and the bytes remaining per section once, so per-record
checks add no extra syscall (free on the default mmap backend).

Applied to both the reader load paths and the writer reload paths; the
latter run no checksum footer verification at all, so they reach counts
unverified even for footer-carrying segments. This also covers legacy
footer-less .hnsw segments, the residual exposure left by the #786 footer.

Document the StorageInput::size() contract that loaders rely on for these
bounds, and add oversized-count rejection tests for each reader (HNSW
exercises both a footer-less num_vectors and a footer-less graph
node_count).

Closes #806
@mosuka mosuka merged commit 826e859 into main Jun 15, 2026
22 checks passed
@mosuka mosuka deleted the fix/reader-alloc-bounds branch June 15, 2026 04:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

reliability(vector/index): bound reader allocations sized from unverified on-disk header counts (generalize #791)

1 participant