Tracking issue for the four remaining clusters of test failures on Spark 4.1 (4.1.1) once the profile, shims, diff, and SQL-test workflow entry are in place. Context PRs: #4093 (Spark 4.1.1 enablement) and #4097 (spark-4.1 profile + shims prep, no tests).
Status
Two earlier clusters are already cleared on the branch (commit 5a60be22d):
CometNativeWriteExec.newTaskTempFile String overload became abstract-throwing in 4.1; switched to the FileNameSpec overload. Cleared 17 parquet-write failures.
remainder function test expected [DIVIDE_BY_ZERO]; Spark 4.1 introduced [REMAINDER_BY_ZERO]. Branched the expected message on isSpark41Plus.
1. OneRowRelationExec not transformed by Comet
Where: ~30 failures in Spark 4.1, JDK 17/auto [expressions], all sql-file: tests like expressions/cast/cast.sql, expressions/datetime/*, expressions/struct/create_named_struct.sql, etc.
Symptom:
Expected only Comet native operators, but found Project.
plan: Project
+- Scan OneRowRelation [COMET: Scan OneRowRelation is not supported]
Root cause: Spark 4.1 added a new OneRowRelationExec physical leaf and stopped folding SELECT cast(literal) queries down to LocalRelation via ConvertToLocalRelation. In 4.0 those queries became LocalTableScanExec, which Comet has a wrapper for (CometLocalTableScanExec). In 4.1 they stay as Project + OneRowRelationExec and Comet's CometExecRule falls the whole subtree back to Spark.
Fix options (decision needed):
- (a) Add
CometOneRowRelationExec analogous to CometLocalTableScanExec. Real fix, biggest scope, needs a Rust-side serde for an empty-row scan.
- (b) Pre-rewrite
Project + OneRowRelationExec into LocalTableScanExec with a single empty row in a Comet planner rule.
- (c) Test-only allowlist (masks fallback, not recommended).
2. Native parquet reader: user-defined struct schema mismatch
Where: native reader - select struct field with user defined schema - native_datafusion and - native_iceberg_compat in both Spark 4.1, JDK 17/auto [parquet] and macos-14/Spark 4.1, JDK 17, Scala 2.13 [parquet].
Symptom: Results do not match for query, schema is c0: struct<y:int,x:string> over a parquet relation. Comet's native reader returns different rows than Spark.
Suspected root cause: Spark 4.1 changed how user-supplied struct schemas are reconciled with on-disk Parquet field order, or field pruning behaves differently. Compare Spark 4.0 vs 4.1 planning output for this query and check whether user-schema field-name-vs-position behavior changed in ParquetReadSupport or ParquetSchemaConverter.
3. Bloom filter result mismatch
Where: test BloomFilterMightContain from random input and bloom_filter_agg in Spark 4.1, JDK 17/auto [exec].
Symptom: Comet and Spark produce different might_contain results for the same input.
Suspected root cause: Spark 4.1 likely changed the bloom filter binary layout, hash seed, or default false-positive probability. Diff BloomFilterImpl / BloomFilterAggregate between 4.0 and 4.1, then mirror in Comet's bloom filter code in native/spark-expr.
4. bytesRead task metric off by 6 to 14 times
Where: native_datafusion scan reports task-level input metrics matching Spark, input metrics aggregate across multiple native scans in a join, ... in a union in Spark 4.1, JDK 17/auto [exec] (CometTaskMetricsSuite).
Symptom:
9.6 was greater than or equal to 0.7, but 9.6 was not less than or equal to 1.3
bytesRead ratio out of range: comet=90498, spark=9427, ratio=9.6
Two more failures with similar 6.4 and 13.9 ratios.
Suspected root cause: Spark 4.1 changed what inputMetrics.bytesRead accounts for, most likely now reports a smaller subset (e.g. only bytes actually read into row buffers, versus full Parquet footer plus row group). Compare ParquetFileReader / PartitionedFile accounting between 4.0 and 4.1 and adjust Comet's metric source accordingly.
Tracking issue for the four remaining clusters of test failures on Spark 4.1 (4.1.1) once the profile, shims, diff, and SQL-test workflow entry are in place. Context PRs: #4093 (Spark 4.1.1 enablement) and #4097 (spark-4.1 profile + shims prep, no tests).
Status
bytesReadtask metric off by 6 to 14 times (3 tests)Two earlier clusters are already cleared on the branch (commit
5a60be22d):CometNativeWriteExec.newTaskTempFileString overload became abstract-throwing in 4.1; switched to theFileNameSpecoverload. Cleared 17 parquet-write failures.remainder functiontest expected[DIVIDE_BY_ZERO]; Spark 4.1 introduced[REMAINDER_BY_ZERO]. Branched the expected message onisSpark41Plus.1.
OneRowRelationExecnot transformed by CometWhere: ~30 failures in
Spark 4.1, JDK 17/auto [expressions], allsql-file:tests likeexpressions/cast/cast.sql,expressions/datetime/*,expressions/struct/create_named_struct.sql, etc.Symptom:
Root cause: Spark 4.1 added a new
OneRowRelationExecphysical leaf and stopped foldingSELECT cast(literal)queries down toLocalRelationviaConvertToLocalRelation. In 4.0 those queries becameLocalTableScanExec, which Comet has a wrapper for (CometLocalTableScanExec). In 4.1 they stay asProject + OneRowRelationExecand Comet'sCometExecRulefalls the whole subtree back to Spark.Fix options (decision needed):
CometOneRowRelationExecanalogous toCometLocalTableScanExec. Real fix, biggest scope, needs a Rust-side serde for an empty-row scan.Project + OneRowRelationExecintoLocalTableScanExecwith a single empty row in a Comet planner rule.2. Native parquet reader: user-defined struct schema mismatch
Where:
native reader - select struct field with user defined schema - native_datafusionand- native_iceberg_compatin bothSpark 4.1, JDK 17/auto [parquet]andmacos-14/Spark 4.1, JDK 17, Scala 2.13 [parquet].Symptom:
Results do not match for query, schema isc0: struct<y:int,x:string>over a parquet relation. Comet's native reader returns different rows than Spark.Suspected root cause: Spark 4.1 changed how user-supplied struct schemas are reconciled with on-disk Parquet field order, or field pruning behaves differently. Compare Spark 4.0 vs 4.1 planning output for this query and check whether user-schema field-name-vs-position behavior changed in
ParquetReadSupportorParquetSchemaConverter.3. Bloom filter result mismatch
Where:
test BloomFilterMightContain from random inputandbloom_filter_agginSpark 4.1, JDK 17/auto [exec].Symptom: Comet and Spark produce different
might_containresults for the same input.Suspected root cause: Spark 4.1 likely changed the bloom filter binary layout, hash seed, or default false-positive probability. Diff
BloomFilterImpl/BloomFilterAggregatebetween 4.0 and 4.1, then mirror in Comet's bloom filter code innative/spark-expr.4.
bytesReadtask metric off by 6 to 14 timesWhere:
native_datafusion scan reports task-level input metrics matching Spark,input metrics aggregate across multiple native scans in a join,... in a unioninSpark 4.1, JDK 17/auto [exec](CometTaskMetricsSuite).Symptom:
Two more failures with similar 6.4 and 13.9 ratios.
Suspected root cause: Spark 4.1 changed what
inputMetrics.bytesReadaccounts for, most likely now reports a smaller subset (e.g. only bytes actually read into row buffers, versus full Parquet footer plus row group). CompareParquetFileReader/PartitionedFileaccounting between 4.0 and 4.1 and adjust Comet's metric source accordingly.