[SPARK-57644][SQL] Support generated column values on V2 table writes#56712
Open
szehon-ho wants to merge 1 commit into
Open
[SPARK-57644][SQL] Support generated column values on V2 table writes#56712szehon-ho wants to merge 1 commit into
szehon-ho wants to merge 1 commit into
Conversation
Add support for auto-filling generated column values and enforcing generated column constraints during V2 table writes (INSERT). When a catalog declares SUPPORT_GENERATED_COLUMN_ON_WRITE: - Missing generated columns are auto-filled using the generation expression - User-provided generated column values are validated against the generation expression via CheckInvariant(EqualNullSafe(col, genExpr)) - MERGE, UPDATE, and streaming writes with generated columns are blocked until support is implemented for those operations Key changes: - TableCatalogCapability: Add SUPPORT_GENERATED_COLUMN_ON_WRITE - CatalogV2Util: Encode generation expression in V2-to-StructField roundtrip - TableOutputResolver: Fill missing generated columns in by-name and by-position write paths, checking generation expression before null defaults - ResolveTableConstraints: Add generated column constraints using V2 expression with SQL parser fallback, only for user-provided columns - Analyzer (ResolveOutputRelation): Gate on capability, strip generation expression metadata from auto-filled columns so constraints are scoped correctly - RewriteRowLevelCommand: Block MERGE/UPDATE with generated columns - ResolveWriteToStream: Block streaming writes with generated columns - SQLConf: Add generatedColumn.allowNullableIngest.enabled config Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add support for auto-filling generated column values and enforcing generated column constraints during V2 table writes (INSERT).
When a catalog declares
SUPPORT_GENERATED_COLUMN_ON_WRITE:Projectnode added during analysisCheckInvariant(EqualNullSafe(col, genExpr))Key changes:
TableCatalogCapability: NewSUPPORT_GENERATED_COLUMN_ON_WRITEcapabilityCatalogV2Util: Encode generation expression in V2-to-StructField metadata roundtripTableOutputResolver: Fill missing generated columns in by-name and by-position write paths (checking generation expression before null defaults)ResolveTableConstraints: Add generated column constraints using V2 expression with SQL parser fallback, scoped to user-provided columns onlyAnalyzer(ResolveOutputRelation): Gate on capability; strip generation expression metadata from auto-filled columns so constraints are scoped correctlyRewriteRowLevelCommand: Block MERGE/UPDATE with generated columnsResolveWriteToStream: Block streaming writes with generated columnsSQLConf: Newspark.sql.generatedColumn.allowNullableIngest.enabledconfigWhy are the changes needed?
Generated columns (defined with
GENERATED ALWAYS AS (expr)) currently only support DDL — the generation expression is stored but never evaluated during writes. This means connectors must handle generated column values themselves (e.g., Delta Lake does this in its own execution layer).This PR adds Spark-native support so that any V2 catalog can opt in to having Spark auto-compute generated column values during INSERT. This brings generated columns closer to parity with default column values, which Spark already handles during writes.
Does this PR introduce any user-facing change?
Yes. For catalogs that declare
SUPPORT_GENERATED_COLUMN_ON_WRITE:CHECK_CONSTRAINT_VIOLATIONerror if the value doesn't match the generation expressionUNSUPPORTED_FEATURE.TABLE_OPERATIONspark.sql.generatedColumn.allowNullableIngest.enabled(defaulttrue) controls whether nullable non-generated columns can be omitted when writing to tables with generated columnsHow was this patch tested?
New
GeneratedColumnWriteSuitewith 30 tests covering:Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-opus-4-6)