Skip to content

Broadcast hash join and sort-merge join can produce incorrect results on non-default collated keys [Spark 4] #4051

@andygrove

Description

@andygrove

Describe the bug

Follow-up from #4035, which rejects non-default collated strings in shuffle, sort, and aggregate but does not cover the join paths.

CometHashJoin.doConvert (used by both CometHashJoinExec and CometBroadcastHashJoinExec) and CometSortMergeJoinExec.supportedSortMergeJoinEqualType both accept any StringType as a join key without checking for non-default collation. Comet's native join performs byte-level key equality, so rows that compare equal under a collation (e.g., 'a' vs 'A' under utf8_lcase) will not match.

For sort-merge join this is masked today by the shuffle-side fallback added in #4035: the shuffle on a collated key falls back to Spark, which forces the whole stage to Spark. Broadcast hash join has no shuffle to protect it, so incorrect results can escape.

Steps to reproduce

With Spark 4.0.1:

SET spark.comet.exec.broadcastHashJoin.enabled = true;

CREATE OR REPLACE TEMP VIEW l AS SELECT * FROM (VALUES ('a'), ('B')) AS t(c);
CREATE OR REPLACE TEMP VIEW r AS SELECT * FROM (VALUES ('A'), ('b')) AS t(c);

SELECT l.c, r.c
FROM l JOIN r
ON l.c COLLATE utf8_lcase = r.c COLLATE utf8_lcase;

Spark returns two matching rows; Comet (with broadcast hash join enabled) returns zero because the byte-level equality in the native join does not match 'a' with 'A'.

Expected behavior

Comet should fall back to Spark for hash joins and sort-merge joins whose equality keys have a non-default collation, either by rejecting the key types in supportedSortMergeJoinEqualType / a new equivalent check in CometHashJoin.doConvert, or by funneling through the existing isStringCollationType predicate on CometTypeShim.

Additional context

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcorrectnesspriority:criticalData corruption, silent wrong results, security issuesspark 4

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions