Skip to content

[SPARK-56750] default path config#55717

Open
srielau wants to merge 10 commits intoapache:masterfrom
srielau:SPARK-56750-default-path
Open

[SPARK-56750] default path config#55717
srielau wants to merge 10 commits intoapache:masterfrom
srielau:SPARK-56750-default-path

Conversation

@srielau
Copy link
Copy Markdown
Contributor

@srielau srielau commented May 6, 2026

What changes were proposed in this pull request?

This PR implements SPARK-56750 by adding a default SQL PATH (parallel to the default catalog) and wiring path-driven semantics through function and namespace resolution when spark.sql.path.enabled is true.

Core feature

  • New config spark.sql.defaultPath (SQLConf.DEFAULT_PATH):

    • String session conf (default empty), accepts the full SET PATH grammar.
    • When PATH is enabled and no explicit SET PATH was issued, this value becomes the effective SQL PATH (mirroring how currentCatalog falls back to spark.sql.defaultCatalog).
    • SET PATH = DEFAULT_PATH expands to this configured default.
    • An inner DEFAULT_PATH token inside the conf value resolves to the Spark built-in default ordering (cycle-safe expansion).
    • The conf is validated for syntax at set time; redundant entries are tolerated at lookup time (first-match resolution makes them harmless).
  • Shared SET PATH grammar: A new singlePathElementList rule and parsePathElements parser entry let spark.sql.defaultPath reuse the exact SET PATH grammar without re-implementation.

  • CatalogManager PATH materialization: A single sessionPathEntries resolves stored SET PATH, parsed DEFAULT_PATH, and the seeded defaultPathOrder fallback into one effective path. CURRENT_PATH() and unqualified function/relation resolution all read from this single source of truth.

Path-driven function resolution

  • Fast-path kinds derived from the live PATH: SessionCatalog's unqualified-lookup fast path receives an ordered list of system.builtin / system.session kinds from the effective PATH (in path order) via a path-driven provider. Replaces the previous reliance on spark.sql.functionResolution.sessionOrder (which is now used solely as a seed for defaultPathOrder and may eventually be retired).

  • The path is the path: current_path is treated like current_catalog and current_schema — we don't distinguish whether it was set via SET PATH, via the spark.sql.defaultPath conf, or by the seeded default. Whatever is on the path is searched as written. Placing system.session after a user catalog is the user's authorization for unqualified temp functions to resolve through that position.

  • Security check via path order: Creating a temp function with a builtin's name is blocked when system.session is searched before system.builtin. The "no injection of temp over persistent" property is enforced by path order itself: when a user catalog precedes system.session, persistent resolution wins for unqualified names.

  • COUNT(*) → COUNT(1) rewrite gate: The analyzer rewrite skips when a temp count would shadow the builtin under the live PATH. Shares the same canonical predicate (FunctionResolution.isSessionBeforeBuiltinInPath) as the security check.

Layering

  • Resolution-order logic lives at the Strategy layer (FunctionResolution), not Coordination (CatalogManager). CatalogManager provides path materialization (sqlResolutionPathEntries) and pure data helpers (systemFunctionKindsFromPath) — it doesn't decide resolution order. SessionCatalog's default kinds provider derives from conf.systemPathOrder (the seeded default path) and never branches on the legacy resolution-order conf.

Deadlock-safe SessionCatalog lookups

  • lookupBuiltinOrTempTableFunction and lookupFunctionInfo are intentionally not synchronized on SessionCatalog. The path-driven kinds provider may acquire CatalogManager.synchronized; another thread holding that lock can call into SessionCatalog (e.g. via setCurrentNamespace), so synchronizing the public lookup methods would introduce lock-order inversion. Tighter synchronized blocks are kept around the temp-view-context post-processing where they're actually needed.

Duplicate validation

  • SET PATH rejects static duplicates (literal-vs-literal, repeated current_schema / current_database) at statement time as a typo guard.
  • A literal that happens to match the live current_schema is not flagged: the duplicate is USE SCHEMA-state-dependent and harmless under first-match resolution.
  • spark.sql.defaultPath runs no semantic duplicate check at lookup time — the previous lookup-time validate could wedge a session by throwing on every unqualified resolution after USE SCHEMA collided a literal with current_schema.

DROP TEMPORARY FUNCTION correctness

  • Unqualified DROP TEMPORARY FUNCTION <builtin> without IF EXISTS raises FORBIDDEN_OPERATION against the builtin name as captured.
  • IF EXISTS remains a no-op when no temp function exists.
  • Qualified temporary namespaces (session.foo, system.session.foo) always target the temp registry.

PathElement AST

  • New PathElement AST in catalyst (org.apache.spark.sql.connector.catalog.PathElement, private[sql]) with expand and validateNoStaticDuplicates helpers shared between SetPathCommand and CatalogManager.confDefaultPathEntries.

Performance

  • CatalogManager.confDefaultPathEntries caches expanded entries by (trimmed conf value, sessionFunctionResolutionOrder); on conf-stable sessions the lookup hot path is one AtomicReference.get. CurrentSchemaEntry markers are preserved unresolved so the cache stays valid across USE SCHEMA.

Tests

  • SetPathSuite (47 tests) covers: lazy fallback, explicit override, SET PATH = DEFAULT_PATH, cycle break, invalid value, PATH disabled, path-driven security, user-only path, duplicate tolerance for DEFAULT_PATH, static duplicate rejection at SET PATH, the explicit-interleaved path case (system.builtin, <user>, system.session resolving a temp), and the leading-system-kinds else-branch.
  • FunctionQualificationSuite and RelationQualificationSuite continue to pass unchanged (130/130 across the three suites).

Why are the changes needed?

Users need an OSS way to define a default SQL PATH before any explicit SET PATH, mirroring how spark.sql.defaultCatalog provides a session default for the current catalog. The change also fixes a semantic gap where path-driven function-resolution shortcuts (the COUNT(*) rewrite gate, the temp-vs-builtin security check) ignored SET PATH and only consulted the legacy spark.sql.functionResolution.sessionOrder proxy — so SET PATH = system.session, system.builtin, ... did not block creating a temp count until the order conf was also flipped.

Does this PR introduce any user-facing change?

Yes.

  • New conf: spark.sql.defaultPath.
  • With spark.sql.path.enabled=true, sessions without explicit SET PATH resolve using the configured default path.
  • SET PATH = DEFAULT_PATH and CURRENT_PATH() reflect the configured default.
  • Function resolution order for unqualified names follows the effective PATH (post-SET PATH, with DEFAULT_PATH and defaultPathOrder fallbacks). The temp-vs-builtin security check and the COUNT(*) → COUNT(1) rewrite gate now reflect SET PATH changes.
  • Subtle behavior change in the previously-impossible "temp-only function in sessionFunctionResolutionOrder=last" case: the unqualified call now resolves to the temp instead of UNRESOLVED_ROUTINE. This matches the documented intent of "builtin → persistent → session". The "no temp injection over persistent" property is preserved by path order.

How was this patch tested?

  • build/sbt "sql/testOnly org.apache.spark.sql.SetPathSuite"
  • build/sbt "sql/testOnly *FunctionQualificationSuite *RelationQualificationSuite"
  • build/sbt "sql/testOnly *DDLSuite -- -z \"drop built-in function\""
  • build/sbt "catalyst/Compile/compile" "sql/Compile/compile" "sql/Test/compile"

Was this patch authored or co-authored using generative AI tooling?

Yes. Some implementation and tests were iterated with coding-assistant tooling; the author reviewed and owns the final patch.

@srielau srielau force-pushed the SPARK-56750-default-path branch from 8d33747 to 83dcf21 Compare May 6, 2026 17:59
Copy link
Copy Markdown
Contributor Author

@srielau srielau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(stale debug review — see the AI-assisted self-review further down for the actual content)

Copy link
Copy Markdown
Contributor Author

@srielau srielau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(stale debug review — see the AI-assisted self-review further down for the actual content)

Copy link
Copy Markdown
Contributor Author

@srielau srielau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(stale debug review — see the AI-assisted self-review further down for the actual content)

Copy link
Copy Markdown
Contributor Author

@srielau srielau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(stale debug review — see the AI-assisted self-review further down for the actual content)

Copy link
Copy Markdown
Contributor Author

@srielau srielau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(stale debug review — see the AI-assisted self-review further down for the actual content)

Copy link
Copy Markdown
Contributor Author

@srielau srielau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI-assisted self-review. The high-level direction (mirror spark.sql.defaultCatalog for PATH; align routine-resolution shortcuts with the live PATH) is sound. Findings below are mostly correctness/perf concerns and a few naming/docs nits — I've flagged the deadlock + test-default + last-mode behavior change as the higher-priority items.

Summary

  • Prior state: SET PATH exists, but with no admin-side default — every session must SET PATH itself, and SET PATH = DEFAULT_PATH only resolves to a fixed defaultPathOrder. Function-resolution shortcuts (1-part lookup, the security gate that blocks temp-shadowing-builtin, the count(*)→count(1) gate) read spark.sql.legacy.sessionFunctionResolutionOrder directly rather than the live PATH, so a SET PATH = system.session, system.builtin, ... doesn't actually flip those gates.
  • Design approach: new spark.sql.defaultPath mirrors spark.sql.defaultCatalog — session-level conf, accepts the full SET PATH grammar via a reused entry rule (singlePathElementList + parseSqlPathElements), with an isWorkspaceDefaultExpansion cycle break for inner DEFAULT_PATH tokens. Routine-resolution shortcuts now read CatalogManager.leadingSystemFunctionKinds / isSessionBeforeBuiltinInPath, both derived from the effective PATH.
  • Key decisions: (1) parse the conf on every read, no caching (see A3); (2) path-derive the system-kinds shortcut order rather than keep the legacy switch (see A4 for the "last" mode consequence); (3) wire kinds via a var callback set from the CatalogManager constructor body (see A1/A2); (4) validate SchemaInPath arity at expansion time, not parse time (see A8).
  • Implementation sketch: parser adds singlePathElementList + parseSqlPathElements; AST PathElement moves into catalyst (connector.catalog) so CatalogManager can call PathElement.expand from its conf-fallback path while SetPathCommand still calls it from run(); CatalogManager grows workspaceDefaultPathEntries, leadingSystemFunctionKinds, isSessionBeforeBuiltinInPath; SessionCatalog exposes sessionFunctionKindsProvider (var) wired post-construction.

Inline comments are split across this review and the next (was hitting an API issue with a single 12-comment payload).

Copy link
Copy Markdown
Contributor Author

@srielau srielau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(continuation of the AI-assisted self-review — see the parent review above for the full summary; this entry hosts inline comments 7-12 because the GitHub API rejected the 12-comment payload as a single review)

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated
@srielau
Copy link
Copy Markdown
Contributor Author

srielau commented May 6, 2026

Review follow-up (SPARK-56750 / #55717)

Thanks for the detailed review — here is how the latest commits address it.

Correctness / threading

  • Deadlock (SC ⇄ CM): Removed synchronized from SessionCatalog.lookupBuiltinOrTempTableFunction so the temp metadata path does not hold SessionCatalog while invoking sessionFunctionKindsProviderCatalogManager (re-entrant synchronized on CM is still possible; the AB-BA lock order with setCurrentNamespace is avoided).
  • sessionFunctionKindsProvider default: Restored the SESSION_FUNCTION_RESOLUTION_ORDER-based default when no CatalogManager wiring is present (direct SessionCatalog in tests).
  • @volatile on the provider var for safe publication after CatalogManager ctor wiring.

PATH → builtin/session “shortcut” semantics

  • \"last\" mode / DESCRIBE / quick-check: System kinds are taken from the prefix before the first user-catalog entry when that prefix contains any system.* entries (legacy builtin-only shortcut preserved).
  • If that prefix has no system entries but the full path does (e.g. user schema before system.builtin), kinds come from the full path (no magic default list).
  • Removed the implicit Seq(Builtin, Temp) fallback when no system segments match; added a test that a user-only PATH does not resolve unqualified builtins (UNRESOLVED_ROUTINE for abs).
  • isSystemBuiltinPathEntry / isSystemSessionPathEntry: Now case-insensitive so stored paths like System.Builtin still classify as system entries (fixes empty kinds + current_path() for case-preserved literals).

Performance / naming / validation

  • spark.sql.defaultPath: Renamed workspace*confDefaultPath* / isConfDefaultExpansion; parseSqlPathElementsparsePathElements on AbstractSqlParser.
  • Parse cache: Cached parsed PathElement list by conf string (ANTLR only); expand + duplicate validation still run on each materialization so CURRENT_SCHEMA / live catalog stay correct.
  • Duplicates: Shared PathElement.validateNoDuplicateResolvedPath for SET PATH and conf default materialization (same rules as command run).
  • Single-part schema in PATH: Validated in AstBuilder.visitPathElement (fail at parse/analysis with path position), not on every expand.
  • DEFAULT_PATH: .checkValue parses via parsePathElements; doc string fixed (no broken [[...]] in user-facing text; references spark.sql.functionResolution.sessionOrder by key).

DROP TEMPORARY FUNCTION IF EXISTS session.count (cleanup in tests)

  • This was not the analyzer binding session.count to the builtin. DropFunctionCommand stripped to count and then threw cannotDropBuiltinFuncError when no temp existed but a builtin collided — before calling dropTempFunction, which made IF EXISTS cleanup fail incorrectly.
  • Fix: Removed that guard; dropTempFunction only touches the temp registry and already honors ignoreIfNotExists. Tests use SQL DROP TEMPORARY FUNCTION IF EXISTS session.count again.

Non-action / follow-up

  • “Coordination vs strategy” note on where leadingSystemFunctionKinds lives: acknowledged; left as-is for this PR to avoid a larger refactor (happy to move toward FunctionResolution in a follow-up if reviewers want that split explicitly).

If anything above should be tweaked before merge, call it out inline and I will adjust.

@srielau srielau force-pushed the SPARK-56750-default-path branch from 78e29c0 to 8c57bb6 Compare May 7, 2026 14:51
@srielau
Copy link
Copy Markdown
Contributor Author

srielau commented May 7, 2026

Review (SPARK-56750 / default PATH conf)

Overall

Strong direction: spark.sql.defaultPath mirrors default-catalog style admin defaults; shared PATH grammar for conf vs SET PATH avoids drift; DEFAULT_PATH cycle break, duplicate validation shared with SET PATH, and case-insensitive system.* path entries are all solid.

Wiring unqualified function shortcuts (COUNT rewrite, temp-vs-builtin security) to the live effective PATH via CatalogManager fixes a real semantic gap vs spark.sql.functionResolution.sessionOrder alone.

Deadlock / locking

Removing synchronized from the outer lookupBuiltinOrTempTableFunction / lookupFunctionInfo while keeping narrower sync for temp/view-context paths is a reasonable fix for SessionCatalog ⇄ CatalogManager lock order. Follow-up suggestion: a one-line Scaladoc note that a single lookup is best-effort consistent if PATH or temp registry changes mid-flight—sets expectations for reviewers and future readers.

Other follow-ups (non-blocking)

  • confDefaultPathElementCache: confirm any path that mutates spark.sql.defaultPath without a full catalog reset still invalidates or replaces the cache as needed (session SET likely OK).
  • SQLConf.DEFAULT_PATH doc: double-check the empty-default story matches code (defaultPathOrder / sessionFunctionResolutionOrder vs sessionPathEntries when PATH is on and both stored path and conf are empty).
  • MiMa / API: if PathElement moved packages, note any connector/binary compatibility expectations (if it was previously only private[sql], probably fine).

DDL fix

Scoping DROP TEMPORARY FUNCTION builtin collision to unqualified + !ifExists and delegating to dropTempFunction for IF EXISTS matches the described bug and looks correct.

Tests

SetPathSuite coverage for lazy fallback, overrides, SET PATH = DEFAULT_PATH, cycle break, invalid conf, PATH off, PATH-driven security, user-only PATH (no implicit builtins), and duplicates is appropriate.

No blockers from this pass; approve with minor doc/concurrency nits above if useful.

srielau added 8 commits May 7, 2026 17:02
- Make PathElement and parsePathElements private[sql].
- Document the interleaved-PATH limitation in leadingSystemFunctionKinds.
- Add SetPathSuite coverage for the leadingSystemFunctionKinds else branch.
- Canonicalize isSessionBeforeBuiltinInPath via leadingSystemFunctionKinds
  so the analyzer rewrite gate and SessionCatalog's security check share
  one primitive (Finding 4).
- Document syntactic-only DEFAULT_PATH validation and USE-induced duplicate
  surfacing in spark.sql.defaultPath doc (Findings 6 + 7).
- Replace @volatile var sessionFunctionKindsProvider with private field
  plus setSessionFunctionKindsProvider, making the single-writer wiring
  point grep-able (Finding 9).
First-match resolution at lookup time makes redundant entries dead code,
not errors. The previous validateNoDuplicateResolvedPath running on every
unqualified function lookup served only as typo detection AND could wedge
a session: a benign DEFAULT_PATH containing CURRENT_SCHEMA + a literal
would throw on every lookup once USE SCHEMA collided them.

- PathElement.validateNoStaticDuplicates replaces validateNoDuplicateResolvedPath:
  rejects literal-vs-literal duplicates and repeated CurrentSchemaEntry markers
  (cross-alias case), but no longer compares LiteralPathEntry against
  CurrentSchemaEntry. The `current_schema` collision is USE-state-dependent
  and harmless under first-match resolution.
- SetPathCommand uses the static check (still rejects user-input typos).
- CatalogManager.confDefaultPathEntries skips the duplicate check entirely
  and caches expanded entries by (trimmed conf, sessionFunctionResolutionOrder)
  so the lookup hot path is one AtomicReference.get on conf-stable sessions.
- Update SetPathSuite tests for the new error parameter on cross-alias
  duplicates ("current_schema" instead of the resolved schema name) and
  for the now-tolerated literal + CurrentSchemaEntry collision.
- Update SQLConf.DEFAULT_PATH doc to reflect the new tolerance.

Addresses the Finding 6 USE-induced wedge and Finding 8 hot-path overhead
in one go.
Two related changes that align fast-path kinds resolution with the user's
authorization model and remove a stray dependency on the legacy
sessionFunctionResolutionOrder conf:

CatalogManager.leadingSystemFunctionKinds now distinguishes explicit vs
implicit paths. When SET PATH or DEFAULT_PATH is in effect, every
`system.builtin` / `system.session` entry on the resolved path is honored
as written -- placing `system.session` (even after a user catalog) is the
user's authorization for unqualified temp functions to resolve. The
prefix-before-user-catalog pruning still applies for implicit paths
(seeded by defaultPathOrder) so the no-SET-PATH form preserves the
security property that temp functions stay unreachable for unqualified
names.

This fixes a regression where `SET PATH = system.builtin, <user>,
system.session` falsely threw UNRESOLVED_ROUTINE for a temp function
that was demonstrably on the path. New SetPathSuite test exercises this
case.

SessionCatalog.sessionFunctionKindsProvider default no longer switches
on `conf.sessionFunctionResolutionOrder`. The conf is meant solely as
a seed for `defaultPathOrder`; downstream code should not branch on its
value. The default now derives kinds from `conf.systemPathOrder` (which
IS the seeded default path), so the order conf can eventually be retired
without touching SC.
The PATH is the path. We don't distinguish whether `current_catalog`
came from `USE CATALOG` or from `defaultCatalog`; we shouldn't
distinguish how PATH was set either. `pathPrefixBeforeFirstUserCatalog`
and the explicit/implicit branch were workarounds for a "last"-mode
behavior that itself was an artifact of the old conf-driven kinds.

Kinds are now just the `system.builtin` / `system.session` entries on
the effective resolved PATH, in path order, regardless of source.

This subtly changes "last"-mode behavior in the previously-impossible
case of a temp-only function: before, an unqualified call returned
UNRESOLVED_ROUTINE because the fast path saw kinds=[Builtin] and the
slow path filtered out `system.session.*`. After, it resolves to the
temp -- which matches the documented intent of "builtin -> persistent
-> session". The "no injection" property is preserved by path order:
persistent precedes session in last mode, so the strategy layer
resolves persistent first when both exist. Existing 13d/13e tests
(SECTION 13d/13e in FunctionQualificationSuite) keep passing.
Per Spark's name-resolution layer model, CatalogManager (Coordination)
should manage catalogs and provide context, not decide resolution order.
isSessionBeforeBuiltinInPath was a Strategy predicate sitting on CM;
move it to FunctionResolution where it belongs.

- FunctionResolution.isSessionBeforeBuiltinInPath: instance method that
  reads the live PATH from CM and applies the kinds-extraction helper.
  Used by Analyzer's count(*) rewrite gate (replacing the call to
  catalogManager.isSessionBeforeBuiltinInPath).
- CatalogManager.systemFunctionKindsFromPath (companion): pure data
  conversion -- given a path, extract `system.builtin` / `system.session`
  entries as SessionFunctionKind. No decision-making; callers decide
  whether and how to use the list.
- CatalogManager's wiring lambda calls the companion helper directly;
  the instance no longer has its own systemFunctionKinds() or
  isSessionBeforeBuiltinInPath methods.
- SessionCatalog's default kinds provider also calls the companion
  helper, deduplicating the inline kinds extraction.

No behavior change. Pure refactor: same code, owned by the right layer.
The JIRA reference belongs in the PR description, not in source comments
(rots over time and is redundant once the PR lands). Keep the why -- the
deadlock rationale -- intact.
- Revert cosmetic funcName -> identifier.funcName in DropFunctionCommand
  (both branches of the surrounding if-else assign identifier.funcName to
  funcName, so the strings are identical; the change had no effect).
- Drop the comment that just restated dropTempFunction's behavior.
- Replace stale "workspace default conf" wording on sqlResolutionPathEntries
  Scaladoc with a precise reference to SQLConf.DEFAULT_PATH.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant