Skip to content

[SPARK-57655][CONNECT][PYTHON] Avoid re-entrant Spark Connect ML cache cleanup RPC hang#56725

Closed
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:SPARK-57655
Closed

[SPARK-57655][CONNECT][PYTHON] Avoid re-entrant Spark Connect ML cache cleanup RPC hang#56725
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:SPARK-57655

Conversation

@HyukjinKwon

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Add a same-thread re-entrancy guard around the best-effort ML-cache RPCs SparkConnectClient._cleanup_ml_cache / _delete_ml_cache. If one is already in flight on the current thread, the nested call is skipped and a WARNING is logged instead of issuing a second blocking RPC.

Why are the changes needed?

A rare CI hang — e.g. pyspark.ml.tests.connect.test_parity_clustering timing out at 450s — traces to a re-entrant ML-cache RPC. While a cleanup/delete RPC is blocked in gRPC with the GIL released, CPython runs a pending RemoteModelRef finalizer (__del__del_remote_cache_delete_ml_cache) on the same thread, issuing a second blocking RPC that deadlocks the channel until the process/test timeout (faulthandler dump confirmed the re-entrant stack). The nested call is redundant — the in-flight RPC is already releasing server-side state, which is also evicted on session end — so skipping it is safe, and the WARNING turns a silent multi-minute hang into an observable, attributable signal if it recurs in scheduled jobs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The underlying hang is a rare, timing-dependent flake (observed ~once in two months) that cannot be reproduced on demand, so this cannot be proven to eliminate it — it is a no-regression safety net plus a diagnostic. In normal operation no ML-cache RPC is in flight when another is issued, so behavior is unchanged. Verified on a fork by building Spark Connect and running test_parity_clustering 15×, each run actually executing (~30s, not skipped) — all 15 passed.

Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude Code

This pull request and its description were written by Isaac.

…e cleanup RPC hang

### What changes were proposed in this pull request?
Add a same-thread re-entrancy guard around the best-effort ML-cache RPCs
`SparkConnectClient._cleanup_ml_cache` / `_delete_ml_cache`. If one is already in flight on
the current thread, the nested call is skipped and a WARNING is logged instead of issuing a
second blocking RPC.

### Why are the changes needed?
A rare CI hang (e.g. `pyspark.ml.tests.connect.test_parity_clustering` timing out at 450s)
traces to a re-entrant ML-cache RPC: while a cleanup/delete RPC is blocked in gRPC with the GIL
released, CPython runs a pending `RemoteModelRef` finalizer
(`__del__` -> `del_remote_cache` -> `_delete_ml_cache`) on the same thread, issuing a second
blocking RPC that deadlocks the channel until the process/test timeout. The nested call is
redundant (the in-flight RPC is already releasing server state, also evicted on session end), so
skipping it is safe. The WARNING turns a silent multi-minute hang into an observable, attributable
signal if it recurs in scheduled jobs.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
The underlying hang is a rare, timing-dependent flake that cannot be reproduced on demand, so this
cannot be proven to eliminate it. It is a no-regression safety net: in normal operation no ML-cache
RPC is in flight when another is issued. Verified on a fork by building Spark Connect and running
`test_parity_clustering` 15x (each run actually executed, ~30s; not skipped): all 15 passed.

### Was this patch authored or co-authored using generative AI tooling?
Yes, Generated-by: Claude Code

Co-authored-by: Isaac
@HyukjinKwon

Copy link
Copy Markdown
Member Author

Automated code review (posted as a comment, not an approval) — via spark-dev:code-review.

Verdict: no blocking findings. Small, targeted, no-regression guard; the diagnosis (faulthandler dump showing same-thread finalizer re-entrancy) is sound and the fix matches it.

Design (Pass A): The shared _ml_cache_rpc_thread slot is declared as a class attribute used as a default — writes go through self. so each client instance gets its own value; there's no cross-instance leakage. try/finally correctly clears it. threading is imported (line 32). The guard is scoped exactly to the failure mode (same-thread re-entrancy via threading.get_ident()), so it won't false-skip a legitimately concurrent call from another thread.

Notes (non-blocking):

  1. Skip isn't always strictly redundant. The PR text frames the nested delete as redundant, which is exactly true when the in-flight RPC is the full clean_cache (the observed test_parity_clustering case). But the guard also fires when a targeted _delete_ml_cache(refsA) is in flight and a finalizer requests _delete_ml_cache(refsB) — those refsB are then not proactively released and linger until session-end eviction. That's a safe tradeoff (avoiding a multi-minute hang >> proactive cache release) and the WARNING records it, but worth being aware of if stricter eviction is ever needed.
  2. Single-slot flag, not concurrency-general. If one client is driven by multiple threads issuing ML-cache RPCs concurrently, the flag is a single slot — the failure mode is only "the guard may not fire" (no false skip, no incorrect result), since the target is same-thread finalizer re-entrancy. Fine for the intended scenario.

Verification caveat (already stated in the PR): the underlying hang is a rare (~1/2-months), timing-dependent flake that can't be reproduced on demand, so this can't be proven to eliminate it — it's a no-regression safety net plus a WARNING diagnostic that makes a recurrence in scheduled jobs observable. Confirmed no-regression by running test_parity_clustering 15× (each actually executing, ~30s) — all green: https://github.com/HyukjinKwon/spark/actions/runs/28075822467

@HyukjinKwon

Copy link
Copy Markdown
Member Author

Merged to master and branch-4.x.

HyukjinKwon added a commit that referenced this pull request Jun 24, 2026
…e cleanup RPC hang

### What changes were proposed in this pull request?
Add a same-thread re-entrancy guard around the best-effort ML-cache RPCs `SparkConnectClient._cleanup_ml_cache` / `_delete_ml_cache`. If one is already in flight on the current thread, the nested call is skipped and a `WARNING` is logged instead of issuing a second blocking RPC.

### Why are the changes needed?
A rare CI hang — e.g. `pyspark.ml.tests.connect.test_parity_clustering` timing out at 450s — traces to a **re-entrant ML-cache RPC**. While a cleanup/delete RPC is blocked in gRPC with the GIL released, CPython runs a pending `RemoteModelRef` finalizer (`__del__` → `del_remote_cache` → `_delete_ml_cache`) **on the same thread**, issuing a second blocking RPC that deadlocks the channel until the process/test timeout (faulthandler dump confirmed the re-entrant stack). The nested call is redundant — the in-flight RPC is already releasing server-side state, which is also evicted on session end — so skipping it is safe, and the `WARNING` turns a silent multi-minute hang into an observable, attributable signal if it recurs in scheduled jobs.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
The underlying hang is a rare, timing-dependent flake (observed ~once in two months) that cannot be reproduced on demand, so this **cannot be proven to eliminate it** — it is a no-regression safety net plus a diagnostic. In normal operation no ML-cache RPC is in flight when another is issued, so behavior is unchanged. Verified on a fork by building Spark Connect and running `test_parity_clustering` **15×**, each run actually executing (~30s, not skipped) — all 15 passed.

- ❌ Before (450s hang, scheduled `Build / Non-ANSI (branch-4.x, ...)`, module `pyspark-ml-connect`): https://github.com/apache/spark/actions/runs/28004040195
- ✅ After (this fix, `test_parity_clustering` ×15 actually executing, all green): https://github.com/HyukjinKwon/spark/actions/runs/28075822467

### Was this patch authored or co-authored using generative AI tooling?
Yes, Generated-by: Claude Code

This pull request and its description were written by Isaac.

Closes #56725 from HyukjinKwon/SPARK-57655.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
(cherry picked from commit 424330a)
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants