[SPARK-56750] default path config#55717
Conversation
8d33747 to
83dcf21
Compare
There was a problem hiding this comment.
AI-assisted self-review. The high-level direction (mirror spark.sql.defaultCatalog for PATH; align routine-resolution shortcuts with the live PATH) is sound. Findings below are mostly correctness/perf concerns and a few naming/docs nits — I've flagged the deadlock + test-default + last-mode behavior change as the higher-priority items.
Summary
- Prior state:
SET PATHexists, but with no admin-side default — every session mustSET PATHitself, andSET PATH = DEFAULT_PATHonly resolves to a fixeddefaultPathOrder. Function-resolution shortcuts (1-part lookup, the security gate that blocks temp-shadowing-builtin, thecount(*)→count(1)gate) readspark.sql.legacy.sessionFunctionResolutionOrderdirectly rather than the live PATH, so aSET PATH = system.session, system.builtin, ...doesn't actually flip those gates. - Design approach: new
spark.sql.defaultPathmirrorsspark.sql.defaultCatalog— session-level conf, accepts the fullSET PATHgrammar via a reused entry rule (singlePathElementList+parseSqlPathElements), with anisWorkspaceDefaultExpansioncycle break for innerDEFAULT_PATHtokens. Routine-resolution shortcuts now readCatalogManager.leadingSystemFunctionKinds/isSessionBeforeBuiltinInPath, both derived from the effective PATH. - Key decisions: (1) parse the conf on every read, no caching (see A3); (2) path-derive the system-kinds shortcut order rather than keep the legacy switch (see A4 for the "last" mode consequence); (3) wire kinds via a
varcallback set from theCatalogManagerconstructor body (see A1/A2); (4) validateSchemaInPatharity at expansion time, not parse time (see A8). - Implementation sketch: parser adds
singlePathElementList+parseSqlPathElements; ASTPathElementmoves into catalyst (connector.catalog) soCatalogManagercan callPathElement.expandfrom its conf-fallback path whileSetPathCommandstill calls it fromrun();CatalogManagergrowsworkspaceDefaultPathEntries,leadingSystemFunctionKinds,isSessionBeforeBuiltinInPath;SessionCatalogexposessessionFunctionKindsProvider(var) wired post-construction.
Inline comments are split across this review and the next (was hitting an API issue with a single 12-comment payload).
Review follow-up (SPARK-56750 / #55717)Thanks for the detailed review — here is how the latest commits address it. Correctness / threading
PATH → builtin/session “shortcut” semantics
Performance / naming / validation
|
78e29c0 to
8c57bb6
Compare
Review (SPARK-56750 / default PATH conf)OverallStrong direction: Wiring unqualified function shortcuts (COUNT rewrite, temp-vs-builtin security) to the live effective PATH via Deadlock / lockingRemoving Other follow-ups (non-blocking)
DDL fixScoping Tests
No blockers from this pass; approve with minor doc/concurrency nits above if useful. |
- Make PathElement and parsePathElements private[sql]. - Document the interleaved-PATH limitation in leadingSystemFunctionKinds. - Add SetPathSuite coverage for the leadingSystemFunctionKinds else branch.
- Canonicalize isSessionBeforeBuiltinInPath via leadingSystemFunctionKinds so the analyzer rewrite gate and SessionCatalog's security check share one primitive (Finding 4). - Document syntactic-only DEFAULT_PATH validation and USE-induced duplicate surfacing in spark.sql.defaultPath doc (Findings 6 + 7). - Replace @volatile var sessionFunctionKindsProvider with private field plus setSessionFunctionKindsProvider, making the single-writer wiring point grep-able (Finding 9).
First-match resolution at lookup time makes redundant entries dead code,
not errors. The previous validateNoDuplicateResolvedPath running on every
unqualified function lookup served only as typo detection AND could wedge
a session: a benign DEFAULT_PATH containing CURRENT_SCHEMA + a literal
would throw on every lookup once USE SCHEMA collided them.
- PathElement.validateNoStaticDuplicates replaces validateNoDuplicateResolvedPath:
rejects literal-vs-literal duplicates and repeated CurrentSchemaEntry markers
(cross-alias case), but no longer compares LiteralPathEntry against
CurrentSchemaEntry. The `current_schema` collision is USE-state-dependent
and harmless under first-match resolution.
- SetPathCommand uses the static check (still rejects user-input typos).
- CatalogManager.confDefaultPathEntries skips the duplicate check entirely
and caches expanded entries by (trimmed conf, sessionFunctionResolutionOrder)
so the lookup hot path is one AtomicReference.get on conf-stable sessions.
- Update SetPathSuite tests for the new error parameter on cross-alias
duplicates ("current_schema" instead of the resolved schema name) and
for the now-tolerated literal + CurrentSchemaEntry collision.
- Update SQLConf.DEFAULT_PATH doc to reflect the new tolerance.
Addresses the Finding 6 USE-induced wedge and Finding 8 hot-path overhead
in one go.
Two related changes that align fast-path kinds resolution with the user's authorization model and remove a stray dependency on the legacy sessionFunctionResolutionOrder conf: CatalogManager.leadingSystemFunctionKinds now distinguishes explicit vs implicit paths. When SET PATH or DEFAULT_PATH is in effect, every `system.builtin` / `system.session` entry on the resolved path is honored as written -- placing `system.session` (even after a user catalog) is the user's authorization for unqualified temp functions to resolve. The prefix-before-user-catalog pruning still applies for implicit paths (seeded by defaultPathOrder) so the no-SET-PATH form preserves the security property that temp functions stay unreachable for unqualified names. This fixes a regression where `SET PATH = system.builtin, <user>, system.session` falsely threw UNRESOLVED_ROUTINE for a temp function that was demonstrably on the path. New SetPathSuite test exercises this case. SessionCatalog.sessionFunctionKindsProvider default no longer switches on `conf.sessionFunctionResolutionOrder`. The conf is meant solely as a seed for `defaultPathOrder`; downstream code should not branch on its value. The default now derives kinds from `conf.systemPathOrder` (which IS the seeded default path), so the order conf can eventually be retired without touching SC.
The PATH is the path. We don't distinguish whether `current_catalog` came from `USE CATALOG` or from `defaultCatalog`; we shouldn't distinguish how PATH was set either. `pathPrefixBeforeFirstUserCatalog` and the explicit/implicit branch were workarounds for a "last"-mode behavior that itself was an artifact of the old conf-driven kinds. Kinds are now just the `system.builtin` / `system.session` entries on the effective resolved PATH, in path order, regardless of source. This subtly changes "last"-mode behavior in the previously-impossible case of a temp-only function: before, an unqualified call returned UNRESOLVED_ROUTINE because the fast path saw kinds=[Builtin] and the slow path filtered out `system.session.*`. After, it resolves to the temp -- which matches the documented intent of "builtin -> persistent -> session". The "no injection" property is preserved by path order: persistent precedes session in last mode, so the strategy layer resolves persistent first when both exist. Existing 13d/13e tests (SECTION 13d/13e in FunctionQualificationSuite) keep passing.
Per Spark's name-resolution layer model, CatalogManager (Coordination) should manage catalogs and provide context, not decide resolution order. isSessionBeforeBuiltinInPath was a Strategy predicate sitting on CM; move it to FunctionResolution where it belongs. - FunctionResolution.isSessionBeforeBuiltinInPath: instance method that reads the live PATH from CM and applies the kinds-extraction helper. Used by Analyzer's count(*) rewrite gate (replacing the call to catalogManager.isSessionBeforeBuiltinInPath). - CatalogManager.systemFunctionKindsFromPath (companion): pure data conversion -- given a path, extract `system.builtin` / `system.session` entries as SessionFunctionKind. No decision-making; callers decide whether and how to use the list. - CatalogManager's wiring lambda calls the companion helper directly; the instance no longer has its own systemFunctionKinds() or isSessionBeforeBuiltinInPath methods. - SessionCatalog's default kinds provider also calls the companion helper, deduplicating the inline kinds extraction. No behavior change. Pure refactor: same code, owned by the right layer.
The JIRA reference belongs in the PR description, not in source comments (rots over time and is redundant once the PR lands). Keep the why -- the deadlock rationale -- intact.
- Revert cosmetic funcName -> identifier.funcName in DropFunctionCommand (both branches of the surrounding if-else assign identifier.funcName to funcName, so the strings are identical; the change had no effect). - Drop the comment that just restated dropTempFunction's behavior. - Replace stale "workspace default conf" wording on sqlResolutionPathEntries Scaladoc with a precise reference to SQLConf.DEFAULT_PATH.
What changes were proposed in this pull request?
This PR implements SPARK-56750 by adding a default SQL PATH (parallel to the default catalog) and wiring path-driven semantics through function and namespace resolution when
spark.sql.path.enabledis true.Core feature
New config
spark.sql.defaultPath(SQLConf.DEFAULT_PATH):SET PATHwas issued, this value becomes the effective SQL PATH (mirroring howcurrentCatalogfalls back tospark.sql.defaultCatalog).SET PATH = DEFAULT_PATHexpands to this configured default.DEFAULT_PATHtoken inside the conf value resolves to the Spark built-in default ordering (cycle-safe expansion).Shared SET PATH grammar: A new
singlePathElementListrule andparsePathElementsparser entry letspark.sql.defaultPathreuse the exactSET PATHgrammar without re-implementation.CatalogManagerPATH materialization: A singlesessionPathEntriesresolves storedSET PATH, parsedDEFAULT_PATH, and the seededdefaultPathOrderfallback into one effective path.CURRENT_PATH()and unqualified function/relation resolution all read from this single source of truth.Path-driven function resolution
Fast-path kinds derived from the live PATH:
SessionCatalog's unqualified-lookup fast path receives an ordered list ofsystem.builtin/system.sessionkinds from the effective PATH (in path order) via a path-driven provider. Replaces the previous reliance onspark.sql.functionResolution.sessionOrder(which is now used solely as a seed fordefaultPathOrderand may eventually be retired).The path is the path:
current_pathis treated likecurrent_catalogandcurrent_schema— we don't distinguish whether it was set viaSET PATH, via thespark.sql.defaultPathconf, or by the seeded default. Whatever is on the path is searched as written. Placingsystem.sessionafter a user catalog is the user's authorization for unqualified temp functions to resolve through that position.Security check via path order: Creating a temp function with a builtin's name is blocked when
system.sessionis searched beforesystem.builtin. The "no injection of temp over persistent" property is enforced by path order itself: when a user catalog precedessystem.session, persistent resolution wins for unqualified names.COUNT(*) → COUNT(1)rewrite gate: The analyzer rewrite skips when a tempcountwould shadow the builtin under the live PATH. Shares the same canonical predicate (FunctionResolution.isSessionBeforeBuiltinInPath) as the security check.Layering
FunctionResolution), not Coordination (CatalogManager).CatalogManagerprovides path materialization (sqlResolutionPathEntries) and pure data helpers (systemFunctionKindsFromPath) — it doesn't decide resolution order.SessionCatalog's default kinds provider derives fromconf.systemPathOrder(the seeded default path) and never branches on the legacy resolution-order conf.Deadlock-safe
SessionCataloglookupslookupBuiltinOrTempTableFunctionandlookupFunctionInfoare intentionally not synchronized onSessionCatalog. The path-driven kinds provider may acquireCatalogManager.synchronized; another thread holding that lock can call intoSessionCatalog(e.g. viasetCurrentNamespace), so synchronizing the public lookup methods would introduce lock-order inversion. Tightersynchronizedblocks are kept around the temp-view-context post-processing where they're actually needed.Duplicate validation
SET PATHrejects static duplicates (literal-vs-literal, repeatedcurrent_schema/current_database) at statement time as a typo guard.current_schemais not flagged: the duplicate isUSE SCHEMA-state-dependent and harmless under first-match resolution.spark.sql.defaultPathruns no semantic duplicate check at lookup time — the previous lookup-time validate could wedge a session by throwing on every unqualified resolution afterUSE SCHEMAcollided a literal withcurrent_schema.DROP TEMPORARY FUNCTIONcorrectnessDROP TEMPORARY FUNCTION <builtin>withoutIF EXISTSraisesFORBIDDEN_OPERATIONagainst the builtin name as captured.IF EXISTSremains a no-op when no temp function exists.session.foo,system.session.foo) always target the temp registry.PathElementASTPathElementAST in catalyst (org.apache.spark.sql.connector.catalog.PathElement,private[sql]) withexpandandvalidateNoStaticDuplicateshelpers shared betweenSetPathCommandandCatalogManager.confDefaultPathEntries.Performance
CatalogManager.confDefaultPathEntriescaches expanded entries by(trimmed conf value, sessionFunctionResolutionOrder); on conf-stable sessions the lookup hot path is oneAtomicReference.get.CurrentSchemaEntrymarkers are preserved unresolved so the cache stays valid acrossUSE SCHEMA.Tests
SetPathSuite(47 tests) covers: lazy fallback, explicit override,SET PATH = DEFAULT_PATH, cycle break, invalid value, PATH disabled, path-driven security, user-only path, duplicate tolerance forDEFAULT_PATH, static duplicate rejection at SET PATH, the explicit-interleaved path case (system.builtin, <user>, system.sessionresolving a temp), and the leading-system-kinds else-branch.FunctionQualificationSuiteandRelationQualificationSuitecontinue to pass unchanged (130/130 across the three suites).Why are the changes needed?
Users need an OSS way to define a default SQL PATH before any explicit
SET PATH, mirroring howspark.sql.defaultCatalogprovides a session default for the current catalog. The change also fixes a semantic gap where path-driven function-resolution shortcuts (theCOUNT(*)rewrite gate, the temp-vs-builtin security check) ignoredSET PATHand only consulted the legacyspark.sql.functionResolution.sessionOrderproxy — soSET PATH = system.session, system.builtin, ...did not block creating a tempcountuntil the order conf was also flipped.Does this PR introduce any user-facing change?
Yes.
spark.sql.defaultPath.spark.sql.path.enabled=true, sessions without explicitSET PATHresolve using the configured default path.SET PATH = DEFAULT_PATHandCURRENT_PATH()reflect the configured default.SET PATH, withDEFAULT_PATHanddefaultPathOrderfallbacks). The temp-vs-builtin security check and theCOUNT(*) → COUNT(1)rewrite gate now reflectSET PATHchanges.sessionFunctionResolutionOrder=last" case: the unqualified call now resolves to the temp instead ofUNRESOLVED_ROUTINE. This matches the documented intent of "builtin → persistent → session". The "no temp injection over persistent" property is preserved by path order.How was this patch tested?
build/sbt "sql/testOnly org.apache.spark.sql.SetPathSuite"build/sbt "sql/testOnly *FunctionQualificationSuite *RelationQualificationSuite"build/sbt "sql/testOnly *DDLSuite -- -z \"drop built-in function\""build/sbt "catalyst/Compile/compile" "sql/Compile/compile" "sql/Test/compile"Was this patch authored or co-authored using generative AI tooling?
Yes. Some implementation and tests were iterated with coding-assistant tooling; the author reviewed and owns the final patch.