Skip to content

[SPARK-57648][SQL] Spread unmatchable left outer join rows across shuffle partitions#56719

Draft
sunchao wants to merge 1 commit into
apache:masterfrom
sunchao:codex/SPARK-57648-spread-unmatchable-left-outer
Draft

[SPARK-57648][SQL] Spread unmatchable left outer join rows across shuffle partitions#56719
sunchao wants to merge 1 commit into
apache:masterfrom
sunchao:codex/SPARK-57648-spread-unmatchable-left-outer

Conversation

@sunchao

@sunchao sunchao commented Jun 24, 2026

Copy link
Copy Markdown
Member

Why are the changes needed?

A shuffled LEFT OUTER equi-join can retain a residual ON predicate that references only the preserved left input:

SELECT *
FROM left_table l
LEFT OUTER JOIN right_table r
  ON l.k = r.k AND l.eligible

Rows where l.eligible is false or null cannot match any right-side row, but the join must still emit them as unmatched rows. Spark currently shuffles only by k, so a common non-null key can funnel all such rows into one reducer even though they do not need to be co-located.

This addresses SPARK-57648.

What changes were proposed in this PR?

For shuffled left outer joins, when spark.sql.shuffle.spreadNullJoinKeys.enabled is enabled, this PR recognizes a conservative subset of deterministic residual predicates that reference only the left input.

For eligible joins it appends a physical guard key:

  • left: IF(residual_condition, TRUE, NULL)
  • right: TRUE

Rows for which the residual is true retain normal hash co-location. Rows for which it is false or null receive a null guard and use the existing NullAwareHashPartitioning path, allowing them to spread across reducers. The original residual remains on the join.

A tag on the synthetic guard makes the shuffled join require every physical join key, preventing an existing partitioning on only the original equi-join keys from bypassing the guard. Ordinary null-aware joins keep their existing distribution behavior.

How was this PR tested?

  • ./build/sbt "sql/testOnly org.apache.spark.sql.JoinSuite -- -z SPARK-57648" — 3 tests passed.
  • ./build/sbt "sql/testOnly org.apache.spark.sql.connector.KeyGroupedPartitioningSuite -- -z SPARK-42038" — 9 tests passed.
  • ./build/sbt "sql/testOnly org.apache.spark.sql.execution.joins.OuterJoinSuite -- -z ordinary" — 8 tests passed.
  • ./build/sbt "sql/testOnly org.apache.spark.sql.execution.joins.ExistenceJoinSuite -- -z ordinary" — 3 tests passed.
  • ./dev/lint-scala — Scalastyle and Scalafmt passed.
  • git diff --check

Does this PR introduce any user-facing change?

Yes, on master only when spark.sql.shuffle.spreadNullJoinKeys.enabled is enabled. Eligible shuffled left outer joins may distribute provably unmatchable rows across multiple shuffle partitions. Query results are unchanged, and the configuration remains disabled by default.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex GPT-5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant