Skip to content

[SPARK-57644][SQL] Support generated column values on V2 table writes#56712

Open
szehon-ho wants to merge 1 commit into
apache:masterfrom
szehon-ho:generated-column-write
Open

[SPARK-57644][SQL] Support generated column values on V2 table writes#56712
szehon-ho wants to merge 1 commit into
apache:masterfrom
szehon-ho:generated-column-write

Conversation

@szehon-ho

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Add support for auto-filling generated column values and enforcing generated column constraints during V2 table writes (INSERT).

When a catalog declares SUPPORT_GENERATED_COLUMN_ON_WRITE:

  • Missing generated columns are auto-filled using the generation expression via a Project node added during analysis
  • User-provided generated column values are validated against the generation expression via CheckInvariant(EqualNullSafe(col, genExpr))
  • MERGE, UPDATE, and streaming writes with generated columns are blocked until support is implemented

Key changes:

  • TableCatalogCapability: New SUPPORT_GENERATED_COLUMN_ON_WRITE capability
  • CatalogV2Util: Encode generation expression in V2-to-StructField metadata roundtrip
  • TableOutputResolver: Fill missing generated columns in by-name and by-position write paths (checking generation expression before null defaults)
  • ResolveTableConstraints: Add generated column constraints using V2 expression with SQL parser fallback, scoped to user-provided columns only
  • Analyzer (ResolveOutputRelation): Gate on capability; strip generation expression metadata from auto-filled columns so constraints are scoped correctly
  • RewriteRowLevelCommand: Block MERGE/UPDATE with generated columns
  • ResolveWriteToStream: Block streaming writes with generated columns
  • SQLConf: New spark.sql.generatedColumn.allowNullableIngest.enabled config

Why are the changes needed?

Generated columns (defined with GENERATED ALWAYS AS (expr)) currently only support DDL — the generation expression is stored but never evaluated during writes. This means connectors must handle generated column values themselves (e.g., Delta Lake does this in its own execution layer).

This PR adds Spark-native support so that any V2 catalog can opt in to having Spark auto-compute generated column values during INSERT. This brings generated columns closer to parity with default column values, which Spark already handles during writes.

Does this PR introduce any user-facing change?

Yes. For catalogs that declare SUPPORT_GENERATED_COLUMN_ON_WRITE:

  • Users can omit generated columns from INSERT statements and the values are auto-computed
  • Users who explicitly provide generated column values will get a CHECK_CONSTRAINT_VIOLATION error if the value doesn't match the generation expression
  • MERGE, UPDATE, and streaming writes to tables with generated columns will fail with UNSUPPORTED_FEATURE.TABLE_OPERATION
  • New config spark.sql.generatedColumn.allowNullableIngest.enabled (default true) controls whether nullable non-generated columns can be omitted when writing to tables with generated columns

How was this patch tested?

New GeneratedColumnWriteSuite with 30 tests covering:

  • Auto-fill by name and by position
  • Constraint validation (matching/non-matching/NULL values)
  • Multiple generated columns, multi-column expressions, complex expressions (CAST)
  • Type coercion, column reordering, case-insensitive matching
  • INSERT OVERWRITE, INSERT SELECT, CTAS, DataFrame API
  • Generated column in middle of schema, non-trailing by position (error)
  • Generated partition columns
  • Config for nullable ingest
  • Combined with table CHECK constraints
  • Plan inspection (CheckInvariant present for user-provided, absent for auto-filled)
  • Streaming write blocking
  • Mix of auto-filled and user-provided generated columns

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-6)

Add support for auto-filling generated column values and enforcing
generated column constraints during V2 table writes (INSERT).

When a catalog declares SUPPORT_GENERATED_COLUMN_ON_WRITE:
- Missing generated columns are auto-filled using the generation expression
- User-provided generated column values are validated against the generation
  expression via CheckInvariant(EqualNullSafe(col, genExpr))
- MERGE, UPDATE, and streaming writes with generated columns are blocked
  until support is implemented for those operations

Key changes:
- TableCatalogCapability: Add SUPPORT_GENERATED_COLUMN_ON_WRITE
- CatalogV2Util: Encode generation expression in V2-to-StructField roundtrip
- TableOutputResolver: Fill missing generated columns in by-name and
  by-position write paths, checking generation expression before null defaults
- ResolveTableConstraints: Add generated column constraints using V2 expression
  with SQL parser fallback, only for user-provided columns
- Analyzer (ResolveOutputRelation): Gate on capability, strip generation
  expression metadata from auto-filled columns so constraints are scoped
  correctly
- RewriteRowLevelCommand: Block MERGE/UPDATE with generated columns
- ResolveWriteToStream: Block streaming writes with generated columns
- SQLConf: Add generatedColumn.allowNullableIngest.enabled config

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant