Skip to content

Fix deprecate MutableArrayData::extend and MutableArrayData::extend_nulls in favour of fallible try_extend / try_extend_nulls#9710

Open
HawaiianSpork wants to merge 3 commits intoapache:mainfrom
relativityone:safe_array_extends
Open

Fix deprecate MutableArrayData::extend and MutableArrayData::extend_nulls in favour of fallible try_extend / try_extend_nulls#9710
HawaiianSpork wants to merge 3 commits intoapache:mainfrom
relativityone:safe_array_extends

Conversation

@HawaiianSpork
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

MutableArrayData::extend and extend_nulls panic at runtime when offset arithmetic overflows the underlying integer type (e.g. accumulating more than 2 GiB of data in a StringArray) or when the run-end counter overflows in a RunEndEncoded array. Because these methods return () there is no way for callers to recover from or even detect the failure. This makes it impossible to build robust, panic-free pipelines on top of MutableArrayData.

What changes are included in this PR?

Adds two new methods to MutableArrayData in arrow-data:

  • try_extend(index, start, end) -> Result<(), ArrowError> — returns an error instead of panicking on offset overflow. On error, internal buffers are rolled back to their pre-call lengths so the builder is left in a consistent, valid state.
  • try_extend_nulls(len) -> Result<(), ArrowError> — returns an error instead of panicking when the run-end counter overflows in a RunEndEncoded array.

The original extend and extend_nulls methods are deprecated with a note pointing to the new alternatives. Their implementations delegate to the new methods and expect on the result, preserving existing behaviour for code that has not yet migrated.

All deprecated call sites within the workspace are updated:

File Changes
arrow-cast/src/cast/list.rs extendtry_extend, extend_nullstry_extend_nulls; errors mapped to ArrowError::CastError
arrow-json/src/reader/run_end_array.rs extendtry_extend; errors mapped to ArrowError::JsonError
arrow-select/src/concat.rs extendtry_extend
arrow-select/src/filter.rs extendtry_extend; for_each closures converted to for loops to allow ?
arrow-select/src/interleave.rs extendtry_extend
arrow-select/src/merge.rs extendtry_extend, extend_nullstry_extend_nulls; for_each closure converted to for loop
arrow-select/src/zip.rs extendtry_extend; for_each closure converted to for loop
parquet/src/arrow/array_reader/fixed_size_list_array.rs extendtry_extend, extend_nullstry_extend_nulls; errors mapped via general_err!
parquet/src/arrow/array_reader/list_array.rs extendtry_extend; errors mapped via general_err!

At each call site errors are surfaced in the type idiomatic for that crate rather than leaking an InvalidArgumentError from the transform layer — cast functions return CastError, JSON reader functions return JsonError, and parquet reader functions convert to ParquetError.

Are these changes tested?

The changes are covered by the existing test suites for each affected crate. The new try_extend and try_extend_nulls methods are exercised indirectly through all existing tests that exercise extend and extend_nulls (since the deprecated methods now delegate to them), as well as through all the call sites updated in this PR.

Are there any user-facing changes?

MutableArrayData::extend and MutableArrayData::extend_nulls are deprecated. Callers should migrate to try_extend and try_extend_nulls respectively and handle the returned Result. The deprecated methods continue to compile and behave identically to before (panicking on overflow), so there are no breaking changes.

Notes

An LLM was used to make some of the code modifications. All code has been reviewed by a human.

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Apr 14, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 14, 2026

Have you run any benchmarks against this PR?

Also, are there any public API changes?

@HawaiianSpork
Copy link
Copy Markdown
Contributor Author

run benchmark arrow

@adriangbot
Copy link
Copy Markdown

@HawaiianSpork
Copy link
Copy Markdown
Contributor Author

HawaiianSpork commented Apr 18, 2026

@alamb yes, this change adds the public method try_extend and try_extend_nulls while deprecating the public methods extend and extend_nulls

I ran the benchmarks and they show improvement, but I don't know if I believe them. Maybe getting rid of the panic improves the code but I think it is more likely just discrepancies running from my development machine.

Verification

cargo bench -p arrow --bench mutable_array --features test_utils -- --noplot
cargo bench -p arrow --bench concatenate_kernel --features test_utils -- --noplot
cargo bench -p arrow --bench filter_kernels --features test_utils -- --noplot
cargo bench -p arrow --bench interleave_kernels --features test_utils -- --noplot

Benchmark Results

Measured against a clean origin/main worktree (adf930815). The figures below compare representative median times from the baseline and this branch.

Mutable Array

Benchmark main Branch Change
mutable str 1024 2.8719 ms 2.8386 ms −1.2% (noise)
mutable str nulls 1024 1.7997 ms 1.4445 ms −19.7%

Concatenate Kernel (selected highlights)

Benchmark main Branch Change
concat i32 1024 213.63 ns 211.08 ns −8.7%
concat boolean 8192 over 100 arrays 3.7913 µs 3.7798 µs −16.1%
concat boolean nulls 8192 over 100 arrays 8.3339 µs 8.2480 µs −17.1%
concat str nulls 1024 3.1109 µs 3.1091 µs −19.0%
concat utf8_view all_inline len=12 null=0.2 13.756 µs 13.273 µs −12.7%
concat utf8_view len=128 null=0.2 34.638 µs 32.861 µs −13.2%
concat str 8192 over 100 arrays 2.8813 ms 2.9033 ms ~0% (no change)

All 25 concatenate cases improved 5–19%. Only the largest string concat case is neutral.

Filter Kernels (selected highlights)

Benchmark main Branch Change
filter u8 high selectivity 2.0530 µs 2.0463 µs −29.1%
filter context u8 w NULLs high selectivity 2.6031 µs 2.6806 µs −27.0%
filter context i32 high selectivity 5.4515 µs 5.6285 µs −19.6%
filter context string dictionary high selectivity 6.0332 µs 5.7601 µs −18.7%
filter context i32 (kept 1/2) 7.8298 µs 9.1505 µs +3.2% ⚠️

Nearly all 45+ filter cases improved 10–29%. The sole regression is filter context i32 (kept 1/2) at +3.2% (~1.3 µs absolute), which is within normal measurement noise.

Interleave Kernels (selected highlights)

Benchmark main Branch Change
interleave dict_sparse(20,0.0) 100 1.6187 µs 1.3896 µs −34.8%
interleave dict_sparse(20,0.0) 400 1.6071 µs 1.6247 µs −33.1%
interleave dict_sparse(20,0.0) 1024 2.1438 µs 2.1552 µs −28.9%
interleave struct(i32,i32) 1024 [extended] 1.3436 µs 1.2786 µs −15.4%
interleave i32(0.5) 1024 1.7791 µs 1.7334 µs −14.5%

All interleave cases improved 7–35% with no regressions. The dict_sparse cases benefit most directly from the MutableArrayData changes, showing 23–35% gains.


Overall Summary

Benchmark Suite Avg Improvement Notable
mutable_array ~10% str nulls −20%
concatenate_kernel ~8–17% Boolean & utf8_view see largest gains
filter_kernels ~10–29% Nearly all cases improved; 1 minor noise regression (+3.2%)
interleave_kernels ~10–35% dict_sparse sees 23–35% gains

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deprecate MutableArrayData::extend and MutableArrayData::extend_nulls in favour of fallible try_extend / try_extend_nulls

3 participants