Description
When the native_datafusion scan reads a Parquet column whose physical type is BINARY (STRING) under a requested read schema of INT, it silently reinterprets the BINARY bytes as raw INT32 bytes and returns garbage values. Spark's vectorized reader throws on this mismatch on all supported versions, so this is a correctness gap (returns wrong answers without an error) rather than a strict-mode parity gap.
Reproduction
withSQLConf(
CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION,
SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
withTempPath { dir =>
val path = dir.getCanonicalPath
Seq("a", "b", "c").toDF("c").write.parquet(path)
val df = spark.read.schema("c int").parquet(path)
df.show() // returns 3 rows of meaningless integers; should throw
}
}
native_iceberg_compat correctly throws SparkException for this case (matches Spark).
Affected versions
All supported Spark profiles (3.4, 3.5, 4.0). Reproduced on Comet main while building #4087.
Expected behavior
The native reader should detect that the requested type (INT) is not byte-compatible with the physical column type (BINARY/UTF8) and raise an exception, matching Spark's SchemaColumnConvertNotSupportedException.
Test coverage
Documented in ParquetSchemaMismatchSuite (added in #4087) under the test name string read as int: native_datafusion. The test currently asserts the buggy behavior so future fixes will need to update the assertion (and the matrix in the file header) when this is resolved.
Parent issue
Split from #3720.
Description
When the
native_datafusionscan reads a Parquet column whose physical type isBINARY(STRING) under a requested read schema ofINT, it silently reinterprets the BINARY bytes as raw INT32 bytes and returns garbage values. Spark's vectorized reader throws on this mismatch on all supported versions, so this is a correctness gap (returns wrong answers without an error) rather than a strict-mode parity gap.Reproduction
native_iceberg_compatcorrectly throwsSparkExceptionfor this case (matches Spark).Affected versions
All supported Spark profiles (3.4, 3.5, 4.0). Reproduced on Comet
mainwhile building #4087.Expected behavior
The native reader should detect that the requested type (
INT) is not byte-compatible with the physical column type (BINARY/UTF8) and raise an exception, matching Spark'sSchemaColumnConvertNotSupportedException.Test coverage
Documented in
ParquetSchemaMismatchSuite(added in #4087) under the test namestring read as int: native_datafusion. The test currently asserts the buggy behavior so future fixes will need to update the assertion (and the matrix in the file header) when this is resolved.Parent issue
Split from #3720.