feat(data): coalesce position deletes into range inserts#645
Merged
Conversation
3a182b2 to
efb1db3
Compare
wgtmac
reviewed
May 22, 2026
Add ForEachPositionDelete (the C++ equivalent of Java's PositionDeleteRangeConsumer) and route DeleteLoader through it, replacing the per-position PositionDeleteIndex::Delete(pos) call. The function sniffs a 1024-position prefix and dispatches to either run coalescing (CRoaring addRange) or bulk addMany grouped by high-32-bit key. Also rework DeleteLoader::LoadPositionDelete to read Arrow batches via nanoarrow's ArrowArrayView directly. When the delete file's referenced_data_file matches the target (V2 writer hint), positions are passed as a zero-copy span; otherwise a per-batch staging vector filters by path. Local microbenchmarks: 2.2x-10.6x for ForEachPositionDelete and 2.1x-2.5x end-to-end through LoadPositionDeletes. Equivalent of apache/iceberg#16052.
Adds an integration test that exercises the loader's referenced_data_file fast path with enough rows (128) to clear the consumer's 64-element sniff threshold, and an assertion on the existing mixed-paths test that locks in the filter-path routing. Documents which branch each test covers so a future refactor of PositionDeleteWriter or the loader can't silently take the wrong path.
…rKey to span Address review feedback on PR apache#645: - AddManyForKey / BulkAddForKey now take std::span<const uint32_t> instead of pointer+length, matching the rest of the PR's style. - ForEachPositionDelete takes a caller-owned scratch vector instead of a thread_local, removing the re-entrancy hazard documented on the prior API and giving the caller full control over buffer lifetime.
676e1b3 to
72f9eb6
Compare
Contributor
Author
|
Thanks for the Look @wgtmac, I have addressed your comments! It extended the implementation quite a bit across more files, I am not sure i like it, but let me know if you would prefer something less invasive, and if you find something else! |
wgtmac
approved these changes
May 27, 2026
Member
wgtmac
left a comment
There was a problem hiding this comment.
Nice improvement! Thanks @Baunsgaard!
| 'data/position_delete_writer.cc', | ||
| 'data/writer.cc', | ||
| 'deletes/position_delete_index.cc', | ||
| 'deletes/position_delete_range_consumer.cc', |
Member
There was a problem hiding this comment.
We need to install deletes/position_delete_range_consumer.h in the meson.build.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add ForEachPositionDelete (the C++ equivalent of Java's PositionDeleteRangeConsumer) and route DeleteLoader through it, replacing the per-position PositionDeleteIndex::Delete(pos) call. The function sniffs a 1024-position prefix and dispatches to either run coalescing (CRoaring addRange) or bulk addMany grouped by high-32-bit key.
Also rework DeleteLoader::LoadPositionDelete to read Arrow batches via nanoarrow's ArrowArrayView directly. When the delete file's referenced_data_file matches the target (V2 writer hint), positions are passed as a zero-copy span; otherwise a per-batch staging vector filters by path.
Local microbenchmarks: 2.2x-10.6x for ForEachPositionDelete and 2.1x-2.5x end-to-end through LoadPositionDeletes. Equivalent of apache/iceberg#16052.