fix(vector/index): bound reader/writer allocations sized from unverified on-disk header counts#808
Merged
Merged
Conversation
…ied on-disk header counts Generalize the #791 file-size bound to the HNSW/Flat/IVF segment loaders. A corrupt or hostile header count (num_vectors, n_clusters, the HNSW graph's node/layer/neighbor counts) or per-record length (field_name_len, PQ codes) could otherwise drive a multi-GiB with_capacity / vec![0u8; len] that aborts the process via handle_alloc_error (OOM) before any corruption check runs. Add a shared alloc_bounds helper (checked_capacity / checked_len) that rejects, before allocating, any count or length larger than what StorageInput::size() shows the file can hold. The loaders capture file_size once and the bytes remaining per section once, so per-record checks add no extra syscall (free on the default mmap backend). Applied to both the reader load paths and the writer reload paths; the latter run no checksum footer verification at all, so they reach counts unverified even for footer-carrying segments. This also covers legacy footer-less .hnsw segments, the residual exposure left by the #786 footer. Document the StorageInput::size() contract that loaders rely on for these bounds, and add oversized-count rejection tests for each reader (HNSW exercises both a footer-less num_vectors and a footer-less graph node_count). Closes #806
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Generalizes the #791 file-size bound to the HNSW / Flat / IVF vector segment loaders. A corrupt or hostile on-disk header count (
num_vectors,n_clusters, the HNSW graph'snode/layer/neighborcounts) or per-record length (field_name_len, PQcodes) could otherwise drive a multi-GiBwith_capacity/vec![0u8; len]that aborts the process viahandle_alloc_error(OOM) before any corruption check runs.A new shared helper
laurus/src/vector/index/alloc_bounds.rsrejects, before allocating, any count or length larger than whatStorageInput::size()shows the file can hold, so a flipped byte surfaces a cleanLaurusError::Index("corrupted segment") instead of an OOM abort.Approach
checked_capacity(count, min_stride, available, what)— each element is ≥min_stridebytes on disk, soavailablebytes hold at mostavailable / min_strideof them; a larger count is corruption.checked_len(len, available, what)— a per-record buffer can't exceed the bytes left in the file.file_size = input.size()?once and the bytes remaining per section once; per-record checks reuse that value, so no extra syscall is added in the hot loops (pure arithmetic on the default mmap backend, and theFileInputBufReader is never thrashed).Scope
Applied to both the reader load paths and the writer reload paths (used on append/merge). The writer paths run no checksum footer verification at all, so they reach counts unverified even for footer-carrying segments — a larger exposure than the readers. This also covers legacy footer-less
.hnswsegments, the residual exposure left by the #786 footer (footer-carrying.hnswis already CRC-verified before the structural parse).Changes
vector/index/alloc_bounds.rs(new helper + unit tests), registered invector/index.rsvector/index/hnsw/{reader,writer}.rs,flat/{reader,writer}.rs,ivf/{reader,writer}.rs— bounded every header-countwith_capacityand per-record buffer readstorage.rs— documented theStorageInput::size()contract loaders rely ondocs/src/concepts/indexing/vector_indexing.md— user-facing "Bounded allocations from on-disk header counts" sectionNo on-disk format change and no public API signature change.
Tests
Per-reader
alloc_bound_testsbuild a hand-crafted segment whose count is corrupted to a huge value and assertload()returns a clean error rather than aborting:num_vectorsn_clusters(centroids)num_vectorsand oversized graphnode_count, both on footer-less segmentsalloc_boundshelper unit tests (6)Verification
cargo clippy -p laurus --all-targets --features embeddings-all -- -D warnings— 0 warningscargo clippy -p laurus --all-targets --features pq-fastscan -- -D warnings— 0 warningscargo test -p laurus --lib— 1142 passed / 0 failed (10 new tests)vector_recall_test(9),vector_segment_test(1)markdownlint-cli2on the edited doc — 0 errorscargo build -p laurus-wasm --target wasm32-unknown-unknown— buildsCloses #806