Describe the bug
Follow-up from #4035, which rejects non-default collated strings in shuffle, sort, and aggregate but does not cover the join paths.
CometHashJoin.doConvert (used by both CometHashJoinExec and CometBroadcastHashJoinExec) and CometSortMergeJoinExec.supportedSortMergeJoinEqualType both accept any StringType as a join key without checking for non-default collation. Comet's native join performs byte-level key equality, so rows that compare equal under a collation (e.g., 'a' vs 'A' under utf8_lcase) will not match.
For sort-merge join this is masked today by the shuffle-side fallback added in #4035: the shuffle on a collated key falls back to Spark, which forces the whole stage to Spark. Broadcast hash join has no shuffle to protect it, so incorrect results can escape.
Steps to reproduce
With Spark 4.0.1:
SET spark.comet.exec.broadcastHashJoin.enabled = true;
CREATE OR REPLACE TEMP VIEW l AS SELECT * FROM (VALUES ('a'), ('B')) AS t(c);
CREATE OR REPLACE TEMP VIEW r AS SELECT * FROM (VALUES ('A'), ('b')) AS t(c);
SELECT l.c, r.c
FROM l JOIN r
ON l.c COLLATE utf8_lcase = r.c COLLATE utf8_lcase;
Spark returns two matching rows; Comet (with broadcast hash join enabled) returns zero because the byte-level equality in the native join does not match 'a' with 'A'.
Expected behavior
Comet should fall back to Spark for hash joins and sort-merge joins whose equality keys have a non-default collation, either by rejecting the key types in supportedSortMergeJoinEqualType / a new equivalent check in CometHashJoin.doConvert, or by funneling through the existing isStringCollationType predicate on CometTypeShim.
Additional context
Describe the bug
Follow-up from #4035, which rejects non-default collated strings in shuffle, sort, and aggregate but does not cover the join paths.
CometHashJoin.doConvert(used by bothCometHashJoinExecandCometBroadcastHashJoinExec) andCometSortMergeJoinExec.supportedSortMergeJoinEqualTypeboth accept anyStringTypeas a join key without checking for non-default collation. Comet's native join performs byte-level key equality, so rows that compare equal under a collation (e.g.,'a'vs'A'underutf8_lcase) will not match.For sort-merge join this is masked today by the shuffle-side fallback added in #4035: the shuffle on a collated key falls back to Spark, which forces the whole stage to Spark. Broadcast hash join has no shuffle to protect it, so incorrect results can escape.
Steps to reproduce
With Spark 4.0.1:
Spark returns two matching rows; Comet (with broadcast hash join enabled) returns zero because the byte-level equality in the native join does not match
'a'with'A'.Expected behavior
Comet should fall back to Spark for hash joins and sort-merge joins whose equality keys have a non-default collation, either by rejecting the key types in
supportedSortMergeJoinEqualType/ a new equivalent check inCometHashJoin.doConvert, or by funneling through the existingisStringCollationTypepredicate onCometTypeShim.Additional context
spark/src/main/scala/org/apache/spark/sql/comet/operators.scalaCometHashJoin.doConvert(around line 1697)spark/src/main/scala/org/apache/spark/sql/comet/operators.scalasupportedSortMergeJoinEqualType(around line 2138)common/src/main/spark-4.0/org/apache/comet/shims/CometTypeShim.scalaisStringCollationType