Skip to content

feat(data): coalesce position deletes into range inserts#645

Merged
wgtmac merged 3 commits into
apache:mainfrom
Baunsgaard:coalesce-position-deletes-loader
May 27, 2026
Merged

feat(data): coalesce position deletes into range inserts#645
wgtmac merged 3 commits into
apache:mainfrom
Baunsgaard:coalesce-position-deletes-loader

Conversation

@Baunsgaard
Copy link
Copy Markdown
Contributor

Add ForEachPositionDelete (the C++ equivalent of Java's PositionDeleteRangeConsumer) and route DeleteLoader through it, replacing the per-position PositionDeleteIndex::Delete(pos) call. The function sniffs a 1024-position prefix and dispatches to either run coalescing (CRoaring addRange) or bulk addMany grouped by high-32-bit key.

Also rework DeleteLoader::LoadPositionDelete to read Arrow batches via nanoarrow's ArrowArrayView directly. When the delete file's referenced_data_file matches the target (V2 writer hint), positions are passed as a zero-copy span; otherwise a per-batch staging vector filters by path.

Local microbenchmarks: 2.2x-10.6x for ForEachPositionDelete and 2.1x-2.5x end-to-end through LoadPositionDeletes. Equivalent of apache/iceberg#16052.

@Baunsgaard Baunsgaard force-pushed the coalesce-position-deletes-loader branch 4 times, most recently from 3a182b2 to efb1db3 Compare May 13, 2026 08:20
Comment thread src/iceberg/deletes/roaring_position_bitmap.h Outdated
Comment thread src/iceberg/deletes/position_delete_range_consumer.cc Outdated
@wgtmac wgtmac added the awaiting author action Author needs to address comments, resolve conflicts, answer questions, etc. label May 25, 2026
Add ForEachPositionDelete (the C++ equivalent of Java's
PositionDeleteRangeConsumer) and route DeleteLoader through it,
replacing the per-position PositionDeleteIndex::Delete(pos) call. The
function sniffs a 1024-position prefix and dispatches to either run
coalescing (CRoaring addRange) or bulk addMany grouped by
high-32-bit key.

Also rework DeleteLoader::LoadPositionDelete to read Arrow batches via
nanoarrow's ArrowArrayView directly. When the delete file's
referenced_data_file matches the target (V2 writer hint), positions
are passed as a zero-copy span; otherwise a per-batch staging vector
filters by path.

Local microbenchmarks: 2.2x-10.6x for ForEachPositionDelete and
2.1x-2.5x end-to-end through LoadPositionDeletes. Equivalent of
apache/iceberg#16052.
Adds an integration test that exercises the loader's referenced_data_file
fast path with enough rows (128) to clear the consumer's 64-element sniff
threshold, and an assertion on the existing mixed-paths test that locks
in the filter-path routing. Documents which branch each test covers so a
future refactor of PositionDeleteWriter or the loader can't silently take
the wrong path.
…rKey to span

Address review feedback on PR apache#645:

- AddManyForKey / BulkAddForKey now take std::span<const uint32_t>
  instead of pointer+length, matching the rest of the PR's style.
- ForEachPositionDelete takes a caller-owned scratch vector instead of
  a thread_local, removing the re-entrancy hazard documented on the
  prior API and giving the caller full control over buffer lifetime.
@Baunsgaard Baunsgaard force-pushed the coalesce-position-deletes-loader branch from 676e1b3 to 72f9eb6 Compare May 26, 2026 13:25
@Baunsgaard
Copy link
Copy Markdown
Contributor Author

Baunsgaard commented May 26, 2026

Thanks for the Look @wgtmac, I have addressed your comments!

It extended the implementation quite a bit across more files, I am not sure i like it, but let me know if you would prefer something less invasive, and if you find something else!

@Baunsgaard Baunsgaard requested a review from wgtmac May 26, 2026 15:06
@wgtmac wgtmac removed the awaiting author action Author needs to address comments, resolve conflicts, answer questions, etc. label May 26, 2026
Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement! Thanks @Baunsgaard!

Comment thread src/iceberg/meson.build
'data/position_delete_writer.cc',
'data/writer.cc',
'deletes/position_delete_index.cc',
'deletes/position_delete_range_consumer.cc',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to install deletes/position_delete_range_consumer.h in the meson.build.

@wgtmac wgtmac merged commit e599010 into apache:main May 27, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants