Skip to content

native_datafusion: STRING column read as INT silently returns garbage values #4088

@andygrove

Description

@andygrove

Description

When the native_datafusion scan reads a Parquet column whose physical type is BINARY (STRING) under a requested read schema of INT, it silently reinterprets the BINARY bytes as raw INT32 bytes and returns garbage values. Spark's vectorized reader throws on this mismatch on all supported versions, so this is a correctness gap (returns wrong answers without an error) rather than a strict-mode parity gap.

Reproduction

withSQLConf(
  CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION,
  SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
  withTempPath { dir =>
    val path = dir.getCanonicalPath
    Seq("a", "b", "c").toDF("c").write.parquet(path)
    val df = spark.read.schema("c int").parquet(path)
    df.show() // returns 3 rows of meaningless integers; should throw
  }
}

native_iceberg_compat correctly throws SparkException for this case (matches Spark).

Affected versions

All supported Spark profiles (3.4, 3.5, 4.0). Reproduced on Comet main while building #4087.

Expected behavior

The native reader should detect that the requested type (INT) is not byte-compatible with the physical column type (BINARY/UTF8) and raise an exception, matching Spark's SchemaColumnConvertNotSupportedException.

Test coverage

Documented in ParquetSchemaMismatchSuite (added in #4087) under the test name string read as int: native_datafusion. The test currently asserts the buggy behavior so future fixes will need to update the assertion (and the matrix in the file header) when this is resolved.

Parent issue

Split from #3720.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions