[SPARK-57648][SQL] Spread unmatchable left outer join rows across shuffle partitions by sunchao · Pull Request #56719 · apache/spark

sunchao · 2026-06-24T04:45:01Z

Why are the changes needed?

A shuffled LEFT OUTER equi-join can retain a residual ON predicate that references only the preserved left input:

SELECT *
FROM left_table l
LEFT OUTER JOIN right_table r
  ON l.k = r.k AND l.eligible

Rows where l.eligible is false or null cannot match any right-side row, but the join must still emit them as unmatched rows. Spark currently shuffles only by k, so a common non-null key can funnel all such rows into one reducer even though they do not need to be co-located.

This addresses SPARK-57648.

What changes were proposed in this PR?

For shuffled left outer joins, when spark.sql.shuffle.spreadNullJoinKeys.enabled is enabled, this PR recognizes a conservative subset of deterministic residual predicates that reference only the left input.

For eligible joins it appends a physical guard key:

left: IF(residual_condition, TRUE, NULL)
right: TRUE

Rows for which the residual is true retain normal hash co-location. Rows for which it is false or null receive a null guard and use the existing NullAwareHashPartitioning path, allowing them to spread across reducers. The original residual remains on the join.

A tag on the synthetic guard makes the shuffled join require every physical join key, preventing an existing partitioning on only the original equi-join keys from bypassing the guard. Ordinary null-aware joins keep their existing distribution behavior.

How was this PR tested?

./build/sbt "sql/testOnly org.apache.spark.sql.JoinSuite -- -z SPARK-57648" — 3 tests passed.
./build/sbt "sql/testOnly org.apache.spark.sql.connector.KeyGroupedPartitioningSuite -- -z SPARK-42038" — 9 tests passed.
./build/sbt "sql/testOnly org.apache.spark.sql.execution.joins.OuterJoinSuite -- -z ordinary" — 8 tests passed.
./build/sbt "sql/testOnly org.apache.spark.sql.execution.joins.ExistenceJoinSuite -- -z ordinary" — 3 tests passed.
./dev/lint-scala — Scalastyle and Scalafmt passed.
git diff --check

Does this PR introduce any user-facing change?

Yes, on master only when spark.sql.shuffle.spreadNullJoinKeys.enabled is enabled. Eligible shuffled left outer joins may distribute provably unmatchable rows across multiple shuffle partitions. Query results are unchanged, and the configuration remains disabled by default.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex GPT-5

[SPARK-57648][SQL] Spread unmatchable left outer join rows

04a0d94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57648][SQL] Spread unmatchable left outer join rows across shuffle partitions#56719

[SPARK-57648][SQL] Spread unmatchable left outer join rows across shuffle partitions#56719
sunchao wants to merge 1 commit into
apache:masterfrom
sunchao:codex/SPARK-57648-spread-unmatchable-left-outer

sunchao commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sunchao commented Jun 24, 2026

Why are the changes needed?

What changes were proposed in this PR?

How was this PR tested?

Does this PR introduce any user-facing change?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant