[SPARK-57645][PYTHON][TESTS] Add ASV microbenchmark for SQL_GROUPED_AGG_PANDAS_ITER_UDF by Yicong-Huang · Pull Request #56730 · apache/spark

Yicong-Huang · 2026-06-24T07:11:42Z

What changes were proposed in this pull request?

Add an ASV microbenchmark for SQL_GROUPED_AGG_PANDAS_ITER_UDF to python/benchmarks/bench_eval_type.py, parallel to the existing GroupedAggArrowIterUDFTimeBench. New classes: _GroupedAggPandasIterBenchMixin, GroupedAggPandasIterUDFTimeBench, and GroupedAggPandasIterUDFPeakmemBench. The mixin reuses _write_scenario/_build_scenario/_scenario_configs from the non-iterator Pandas sibling and only overrides the eval type and the iterator-style UDFs (sum_udf, mean_multi_udf) that consume an Iterator[pd.Series].

Why are the changes needed?

SQL_GROUPED_AGG_PANDAS_ITER_UDF had no worker-level microbenchmark. This fills the coverage gap and provides a before/after baseline for an upcoming serializer refactor of this eval type.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. Benchmark-only addition. The worker output of the new iterator bench was verified to be byte-identical to the non-iterator Pandas grouped-agg bench across all scenario/UDF combinations (only the trailing timing telemetry differs).

ASV results (COLUMNS=120 asv run --bench GroupedAggPandasIterUDFTimeBench -a repeat=3 --python=same):

bench_eval_type.GroupedAggPandasIterUDFTimeBench.time_worker
================ ============ ================
--                            udf
---------------- -----------------------------
    scenario       sum_udf     mean_multi_udf
================ ============ ================
 few_groups_sm    40.6+-0.3ms    46.9+-0.4ms
 few_groups_lg    66.5+-0.4ms    74.7+-0.4ms
 many_groups_sm   1.53+-0.01s    1.77+-0.01s
 many_groups_lg    428+-3ms       487+-1ms
   wide_cols      397+-0.9ms      420+-2ms
================ ============ ================

Numbers are stable across two local runs (deltas < 3%).

Was this patch authored or co-authored using generative AI tooling?

No.

uros-b · 2026-06-24T10:58:46Z

The worker output of the new iterator bench was verified to be byte-identical to the non-iterator Pandas grouped-agg bench

Minor note regarding the PR description, please confirm - in worker.py: the non-iterator SQL_GROUPED_AGG_PANDAS_UDF writes via ArrowStreamGroupSerializer(write_start_stream=True) while the ITER variant uses ArrowStreamAggPandasUDFSerializer; genuinely different output serializers/markers, so the byte streams are not identical. Please update in order to avoid misleading a future reader.

uros-b

Thank you @Yicong-Huang!

test: add ASV microbenchmark for SQL_GROUPED_AGG_PANDAS_ITER_UDF

be36e52

uros-b approved these changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57645][PYTHON][TESTS] Add ASV microbenchmark for SQL_GROUPED_AGG_PANDAS_ITER_UDF#56730

[SPARK-57645][PYTHON][TESTS] Add ASV microbenchmark for SQL_GROUPED_AGG_PANDAS_ITER_UDF#56730
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-57645/bench/grouped-agg-pandas-iter

Yicong-Huang commented Jun 24, 2026 •

edited

Loading

Uh oh!

uros-b commented Jun 24, 2026

Uh oh!

uros-b left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yicong-Huang commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

uros-b commented Jun 24, 2026

Uh oh!

uros-b left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicong-Huang commented Jun 24, 2026 •

edited

Loading