[SPARK-57655][CONNECT][PYTHON] Avoid re-entrant Spark Connect ML cache cleanup RPC hang#56725
[SPARK-57655][CONNECT][PYTHON] Avoid re-entrant Spark Connect ML cache cleanup RPC hang#56725HyukjinKwon wants to merge 1 commit into
Conversation
…e cleanup RPC hang ### What changes were proposed in this pull request? Add a same-thread re-entrancy guard around the best-effort ML-cache RPCs `SparkConnectClient._cleanup_ml_cache` / `_delete_ml_cache`. If one is already in flight on the current thread, the nested call is skipped and a WARNING is logged instead of issuing a second blocking RPC. ### Why are the changes needed? A rare CI hang (e.g. `pyspark.ml.tests.connect.test_parity_clustering` timing out at 450s) traces to a re-entrant ML-cache RPC: while a cleanup/delete RPC is blocked in gRPC with the GIL released, CPython runs a pending `RemoteModelRef` finalizer (`__del__` -> `del_remote_cache` -> `_delete_ml_cache`) on the same thread, issuing a second blocking RPC that deadlocks the channel until the process/test timeout. The nested call is redundant (the in-flight RPC is already releasing server state, also evicted on session end), so skipping it is safe. The WARNING turns a silent multi-minute hang into an observable, attributable signal if it recurs in scheduled jobs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The underlying hang is a rare, timing-dependent flake that cannot be reproduced on demand, so this cannot be proven to eliminate it. It is a no-regression safety net: in normal operation no ML-cache RPC is in flight when another is issued. Verified on a fork by building Spark Connect and running `test_parity_clustering` 15x (each run actually executed, ~30s; not skipped): all 15 passed. ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude Code Co-authored-by: Isaac
|
Automated code review (posted as a comment, not an approval) — via Verdict: no blocking findings. Small, targeted, no-regression guard; the diagnosis (faulthandler dump showing same-thread finalizer re-entrancy) is sound and the fix matches it. Design (Pass A): The shared Notes (non-blocking):
Verification caveat (already stated in the PR): the underlying hang is a rare (~1/2-months), timing-dependent flake that can't be reproduced on demand, so this can't be proven to eliminate it — it's a no-regression safety net plus a |
|
Merged to master and branch-4.x. |
…e cleanup RPC hang ### What changes were proposed in this pull request? Add a same-thread re-entrancy guard around the best-effort ML-cache RPCs `SparkConnectClient._cleanup_ml_cache` / `_delete_ml_cache`. If one is already in flight on the current thread, the nested call is skipped and a `WARNING` is logged instead of issuing a second blocking RPC. ### Why are the changes needed? A rare CI hang — e.g. `pyspark.ml.tests.connect.test_parity_clustering` timing out at 450s — traces to a **re-entrant ML-cache RPC**. While a cleanup/delete RPC is blocked in gRPC with the GIL released, CPython runs a pending `RemoteModelRef` finalizer (`__del__` → `del_remote_cache` → `_delete_ml_cache`) **on the same thread**, issuing a second blocking RPC that deadlocks the channel until the process/test timeout (faulthandler dump confirmed the re-entrant stack). The nested call is redundant — the in-flight RPC is already releasing server-side state, which is also evicted on session end — so skipping it is safe, and the `WARNING` turns a silent multi-minute hang into an observable, attributable signal if it recurs in scheduled jobs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The underlying hang is a rare, timing-dependent flake (observed ~once in two months) that cannot be reproduced on demand, so this **cannot be proven to eliminate it** — it is a no-regression safety net plus a diagnostic. In normal operation no ML-cache RPC is in flight when another is issued, so behavior is unchanged. Verified on a fork by building Spark Connect and running `test_parity_clustering` **15×**, each run actually executing (~30s, not skipped) — all 15 passed. - ❌ Before (450s hang, scheduled `Build / Non-ANSI (branch-4.x, ...)`, module `pyspark-ml-connect`): https://github.com/apache/spark/actions/runs/28004040195 - ✅ After (this fix, `test_parity_clustering` ×15 actually executing, all green): https://github.com/HyukjinKwon/spark/actions/runs/28075822467 ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude Code This pull request and its description were written by Isaac. Closes #56725 from HyukjinKwon/SPARK-57655. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com> (cherry picked from commit 424330a) Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
What changes were proposed in this pull request?
Add a same-thread re-entrancy guard around the best-effort ML-cache RPCs
SparkConnectClient._cleanup_ml_cache/_delete_ml_cache. If one is already in flight on the current thread, the nested call is skipped and aWARNINGis logged instead of issuing a second blocking RPC.Why are the changes needed?
A rare CI hang — e.g.
pyspark.ml.tests.connect.test_parity_clusteringtiming out at 450s — traces to a re-entrant ML-cache RPC. While a cleanup/delete RPC is blocked in gRPC with the GIL released, CPython runs a pendingRemoteModelReffinalizer (__del__→del_remote_cache→_delete_ml_cache) on the same thread, issuing a second blocking RPC that deadlocks the channel until the process/test timeout (faulthandler dump confirmed the re-entrant stack). The nested call is redundant — the in-flight RPC is already releasing server-side state, which is also evicted on session end — so skipping it is safe, and theWARNINGturns a silent multi-minute hang into an observable, attributable signal if it recurs in scheduled jobs.Does this PR introduce any user-facing change?
No.
How was this patch tested?
The underlying hang is a rare, timing-dependent flake (observed ~once in two months) that cannot be reproduced on demand, so this cannot be proven to eliminate it — it is a no-regression safety net plus a diagnostic. In normal operation no ML-cache RPC is in flight when another is issued, so behavior is unchanged. Verified on a fork by building Spark Connect and running
test_parity_clustering15×, each run actually executing (~30s, not skipped) — all 15 passed.Build / Non-ANSI (branch-4.x, ...), modulepyspark-ml-connect): https://github.com/apache/spark/actions/runs/28004040195test_parity_clustering×15 actually executing, all green): https://github.com/HyukjinKwon/spark/actions/runs/28075822467Was this patch authored or co-authored using generative AI tooling?
Yes, Generated-by: Claude Code
This pull request and its description were written by Isaac.