[SPARK-57655][CONNECT][PYTHON] Avoid re-entrant Spark Connect ML cache cleanup RPC hang by HyukjinKwon · Pull Request #56725 · apache/spark

HyukjinKwon · 2026-06-24T05:24:49Z

What changes were proposed in this pull request?

Add a same-thread re-entrancy guard around the best-effort ML-cache RPCs SparkConnectClient._cleanup_ml_cache / _delete_ml_cache. If one is already in flight on the current thread, the nested call is skipped and a WARNING is logged instead of issuing a second blocking RPC.

Why are the changes needed?

A rare CI hang — e.g. pyspark.ml.tests.connect.test_parity_clustering timing out at 450s — traces to a re-entrant ML-cache RPC. While a cleanup/delete RPC is blocked in gRPC with the GIL released, CPython runs a pending RemoteModelRef finalizer (__del__ → del_remote_cache → _delete_ml_cache) on the same thread, issuing a second blocking RPC that deadlocks the channel until the process/test timeout (faulthandler dump confirmed the re-entrant stack). The nested call is redundant — the in-flight RPC is already releasing server-side state, which is also evicted on session end — so skipping it is safe, and the WARNING turns a silent multi-minute hang into an observable, attributable signal if it recurs in scheduled jobs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The underlying hang is a rare, timing-dependent flake (observed ~once in two months) that cannot be reproduced on demand, so this cannot be proven to eliminate it — it is a no-regression safety net plus a diagnostic. In normal operation no ML-cache RPC is in flight when another is issued, so behavior is unchanged. Verified on a fork by building Spark Connect and running test_parity_clustering 15×, each run actually executing (~30s, not skipped) — all 15 passed.

❌ Before (450s hang, scheduled Build / Non-ANSI (branch-4.x, ...), module pyspark-ml-connect): https://github.com/apache/spark/actions/runs/28004040195
✅ After (this fix, test_parity_clustering ×15 actually executing, all green): https://github.com/HyukjinKwon/spark/actions/runs/28075822467

Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude Code

This pull request and its description were written by Isaac.

…e cleanup RPC hang ### What changes were proposed in this pull request? Add a same-thread re-entrancy guard around the best-effort ML-cache RPCs `SparkConnectClient._cleanup_ml_cache` / `_delete_ml_cache`. If one is already in flight on the current thread, the nested call is skipped and a WARNING is logged instead of issuing a second blocking RPC. ### Why are the changes needed? A rare CI hang (e.g. `pyspark.ml.tests.connect.test_parity_clustering` timing out at 450s) traces to a re-entrant ML-cache RPC: while a cleanup/delete RPC is blocked in gRPC with the GIL released, CPython runs a pending `RemoteModelRef` finalizer (`__del__` -> `del_remote_cache` -> `_delete_ml_cache`) on the same thread, issuing a second blocking RPC that deadlocks the channel until the process/test timeout. The nested call is redundant (the in-flight RPC is already releasing server state, also evicted on session end), so skipping it is safe. The WARNING turns a silent multi-minute hang into an observable, attributable signal if it recurs in scheduled jobs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The underlying hang is a rare, timing-dependent flake that cannot be reproduced on demand, so this cannot be proven to eliminate it. It is a no-regression safety net: in normal operation no ML-cache RPC is in flight when another is issued. Verified on a fork by building Spark Connect and running `test_parity_clustering` 15x (each run actually executed, ~30s; not skipped): all 15 passed. ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude Code Co-authored-by: Isaac

HyukjinKwon · 2026-06-24T06:02:50Z

Automated code review (posted as a comment, not an approval) — via spark-dev:code-review.

Verdict: no blocking findings. Small, targeted, no-regression guard; the diagnosis (faulthandler dump showing same-thread finalizer re-entrancy) is sound and the fix matches it.

Design (Pass A): The shared _ml_cache_rpc_thread slot is declared as a class attribute used as a default — writes go through self. so each client instance gets its own value; there's no cross-instance leakage. try/finally correctly clears it. threading is imported (line 32). The guard is scoped exactly to the failure mode (same-thread re-entrancy via threading.get_ident()), so it won't false-skip a legitimately concurrent call from another thread.

Notes (non-blocking):

Skip isn't always strictly redundant. The PR text frames the nested delete as redundant, which is exactly true when the in-flight RPC is the full clean_cache (the observed test_parity_clustering case). But the guard also fires when a targeted _delete_ml_cache(refsA) is in flight and a finalizer requests _delete_ml_cache(refsB) — those refsB are then not proactively released and linger until session-end eviction. That's a safe tradeoff (avoiding a multi-minute hang >> proactive cache release) and the WARNING records it, but worth being aware of if stricter eviction is ever needed.
Single-slot flag, not concurrency-general. If one client is driven by multiple threads issuing ML-cache RPCs concurrently, the flag is a single slot — the failure mode is only "the guard may not fire" (no false skip, no incorrect result), since the target is same-thread finalizer re-entrancy. Fine for the intended scenario.

Verification caveat (already stated in the PR): the underlying hang is a rare (~1/2-months), timing-dependent flake that can't be reproduced on demand, so this can't be proven to eliminate it — it's a no-regression safety net plus a WARNING diagnostic that makes a recurrence in scheduled jobs observable. Confirmed no-regression by running test_parity_clustering 15× (each actually executing, ~30s) — all green: https://github.com/HyukjinKwon/spark/actions/runs/28075822467

HyukjinKwon · 2026-06-24T10:42:58Z

Merged to master and branch-4.x.

…e cleanup RPC hang ### What changes were proposed in this pull request? Add a same-thread re-entrancy guard around the best-effort ML-cache RPCs `SparkConnectClient._cleanup_ml_cache` / `_delete_ml_cache`. If one is already in flight on the current thread, the nested call is skipped and a `WARNING` is logged instead of issuing a second blocking RPC. ### Why are the changes needed? A rare CI hang — e.g. `pyspark.ml.tests.connect.test_parity_clustering` timing out at 450s — traces to a **re-entrant ML-cache RPC**. While a cleanup/delete RPC is blocked in gRPC with the GIL released, CPython runs a pending `RemoteModelRef` finalizer (`__del__` → `del_remote_cache` → `_delete_ml_cache`) **on the same thread**, issuing a second blocking RPC that deadlocks the channel until the process/test timeout (faulthandler dump confirmed the re-entrant stack). The nested call is redundant — the in-flight RPC is already releasing server-side state, which is also evicted on session end — so skipping it is safe, and the `WARNING` turns a silent multi-minute hang into an observable, attributable signal if it recurs in scheduled jobs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? The underlying hang is a rare, timing-dependent flake (observed ~once in two months) that cannot be reproduced on demand, so this **cannot be proven to eliminate it** — it is a no-regression safety net plus a diagnostic. In normal operation no ML-cache RPC is in flight when another is issued, so behavior is unchanged. Verified on a fork by building Spark Connect and running `test_parity_clustering` **15×**, each run actually executing (~30s, not skipped) — all 15 passed. - ❌ Before (450s hang, scheduled `Build / Non-ANSI (branch-4.x, ...)`, module `pyspark-ml-connect`): https://github.com/apache/spark/actions/runs/28004040195 - ✅ After (this fix, `test_parity_clustering` ×15 actually executing, all green): https://github.com/HyukjinKwon/spark/actions/runs/28075822467 ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude Code This pull request and its description were written by Isaac. Closes #56725 from HyukjinKwon/SPARK-57655. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com> (cherry picked from commit 424330a) Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>

dongjoon-hyun approved these changes Jun 24, 2026

View reviewed changes

HyukjinKwon closed this in 424330a Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57655][CONNECT][PYTHON] Avoid re-entrant Spark Connect ML cache cleanup RPC hang#56725

[SPARK-57655][CONNECT][PYTHON] Avoid re-entrant Spark Connect ML cache cleanup RPC hang#56725
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:SPARK-57655

HyukjinKwon commented Jun 24, 2026

Uh oh!

HyukjinKwon commented Jun 24, 2026

Uh oh!

HyukjinKwon commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HyukjinKwon commented Jun 24, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Jun 24, 2026

Uh oh!

HyukjinKwon commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants