Skip to content

[SPARK-57645][PYTHON][TESTS] Add ASV microbenchmark for SQL_GROUPED_AGG_PANDAS_ITER_UDF#56730

Open
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-57645/bench/grouped-agg-pandas-iter
Open

[SPARK-57645][PYTHON][TESTS] Add ASV microbenchmark for SQL_GROUPED_AGG_PANDAS_ITER_UDF#56730
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-57645/bench/grouped-agg-pandas-iter

Conversation

@Yicong-Huang

@Yicong-Huang Yicong-Huang commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add an ASV microbenchmark for SQL_GROUPED_AGG_PANDAS_ITER_UDF to python/benchmarks/bench_eval_type.py, parallel to the existing GroupedAggArrowIterUDFTimeBench. New classes: _GroupedAggPandasIterBenchMixin, GroupedAggPandasIterUDFTimeBench, and GroupedAggPandasIterUDFPeakmemBench. The mixin reuses _write_scenario/_build_scenario/_scenario_configs from the non-iterator Pandas sibling and only overrides the eval type and the iterator-style UDFs (sum_udf, mean_multi_udf) that consume an Iterator[pd.Series].

Why are the changes needed?

SQL_GROUPED_AGG_PANDAS_ITER_UDF had no worker-level microbenchmark. This fills the coverage gap and provides a before/after baseline for an upcoming serializer refactor of this eval type.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. Benchmark-only addition. The worker output of the new iterator bench was verified to be byte-identical to the non-iterator Pandas grouped-agg bench across all scenario/UDF combinations (only the trailing timing telemetry differs).

ASV results (COLUMNS=120 asv run --bench GroupedAggPandasIterUDFTimeBench -a repeat=3 --python=same):

bench_eval_type.GroupedAggPandasIterUDFTimeBench.time_worker
================ ============ ================
--                            udf
---------------- -----------------------------
    scenario       sum_udf     mean_multi_udf
================ ============ ================
 few_groups_sm    40.6+-0.3ms    46.9+-0.4ms
 few_groups_lg    66.5+-0.4ms    74.7+-0.4ms
 many_groups_sm   1.53+-0.01s    1.77+-0.01s
 many_groups_lg    428+-3ms       487+-1ms
   wide_cols      397+-0.9ms      420+-2ms
================ ============ ================

Numbers are stable across two local runs (deltas < 3%).

Was this patch authored or co-authored using generative AI tooling?

No.

@uros-b

uros-b commented Jun 24, 2026

Copy link
Copy Markdown
Member

The worker output of the new iterator bench was verified to be byte-identical to the non-iterator Pandas grouped-agg bench

Minor note regarding the PR description, please confirm - in worker.py: the non-iterator SQL_GROUPED_AGG_PANDAS_UDF writes via ArrowStreamGroupSerializer(write_start_stream=True) while the ITER variant uses ArrowStreamAggPandasUDFSerializer; genuinely different output serializers/markers, so the byte streams are not identical. Please update in order to avoid misleading a future reader.

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Yicong-Huang!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants