Skip to content

feat(reader): Add read_with_metrics() for scan I/O metrics#2349

Open
mbutrovich wants to merge 8 commits intoapache:mainfrom
mbutrovich:bytes_read
Open

feat(reader): Add read_with_metrics() for scan I/O metrics#2349
mbutrovich wants to merge 8 commits intoapache:mainfrom
mbutrovich:bytes_read

Conversation

@mbutrovich
Copy link
Copy Markdown
Collaborator

@mbutrovich mbutrovich commented Apr 20, 2026

Which issue does this PR close?

  • Closes #.

What changes are included in this PR?

Add always-on per-scan I/O metrics to ArrowReader.

Motivation: Downstream engines need per-scan byte counts for their UIs. For example, DataFusion Comet uses this to populate bytes_scanned on its Iceberg scan operator, which flows through to Spark UI via TaskMetrics.inputMetrics.setBytesRead(). This must be per-scan, not global. Concurrent scans against the same FileIO need independent counters. The approach matches DataFusion's pattern of wrapping AsyncFileReader with a counting layer and is storage-backend agnostic.

ArrowReader::read() now returns ScanResult

  • ScanResult wraps the record batch stream and ScanMetrics. Accessors: stream(), metrics(), into_parts().
  • Metrics are always collected. One fetch_add(Relaxed) per I/O request, negligible overhead.
  • Counter is created fresh per read() call, so cloned readers get independent metrics.

New file: crates/iceberg/src/arrow/scan_metrics.rs

  • CountingFileRead<F: FileRead>: generic wrapper that increments a shared AtomicU64 on each read().
  • ScanMetrics: public handle exposing bytes_read().
  • ScanResult: public struct returned by ArrowReader::read().

FileRead blanket impl for Box<dyn FileRead>

  • Enables generic CountingFileRead<F> to wrap the boxed reader returned by FileIO::reader().

Single open_parquet_file with counting

  • All Parquet opens (data files and delete files) go through the same open_parquet_file wrapped with CountingFileRead, so bytes_read reflects total scan I/O.
  • build_parquet_reader(): shared internals for reader construction and metadata loading.

FileScanTaskReader struct (refactor)

  • Extracted process_file_scan_task's parameters into a Clone struct with a process(self, task) method, resolving a clippy::too_many_arguments violation. Struct and impl are co-located.

Re-exports

  • ScanMetrics and ScanResult re-exported from iceberg::arrow and iceberg::scan.

Are these changes tested?

test_scan_metrics_bytes_read in reader.rs: asserts bytes_read() == 0 before stream consumption (the stream is lazy) and bytes_read() > 0 after. test_scan_metrics_includes_delete_file_bytes: reads the same data file with and without a positional delete file and asserts bytes_read is strictly greater when deletes are present. All existing reader and scan tests pass (updated to use ScanResult::stream()).

@mbutrovich
Copy link
Copy Markdown
Collaborator Author

@blackmwk I'm trying really hard not to add more to reader.rs. Let me know if you have any suggestions.

@mbutrovich mbutrovich requested review from CTTY and blackmwk April 20, 2026 22:15
Copy link
Copy Markdown
Collaborator

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just took a quick pass, love the direction!

Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Copy link
Copy Markdown
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich for this pr!

Comment thread crates/iceberg/src/arrow/scan_metrics.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
Comment thread crates/iceberg/src/arrow/reader.rs Outdated
@mbutrovich
Copy link
Copy Markdown
Collaborator Author

Thanks for the feedback @CTTY and @blackmwk! I will address the comments tomorrow.

@mbutrovich mbutrovich requested review from CTTY and blackmwk April 22, 2026 17:19
@mbutrovich
Copy link
Copy Markdown
Collaborator Author

I believe I addressed all of the feedback. Thanks for the first pass @CTTY @blackmwk, please take another look whenever you get a spare moment. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants