[SPARK-57656][CONNECT][TESTS] Stabilize flaky SparkSessionE2ESuite interrupt-all tests by HyukjinKwon · Pull Request #56728 · apache/spark

HyukjinKwon · 2026-06-24T05:32:21Z

What changes were proposed in this pull request?

Make the two SparkSessionE2ESuite "interrupt all" tests robust against two flakiness sources:

Class-fetch race. Run each long-running typed map query through a single call site and warm
it up once (sleep=0) before any interrupt, so the closure/TypeTag artifact classes are already
loaded on the executor. Otherwise an interruptAll() landing during the executor's first-time
remote fetch of $$typecreatorNN surfaces as RemoteClassLoaderError instead of
OPERATION_CANCELED.
Leaked interruptor / cascade. Wrap the foreground-interrupt test body in
try/finally { finished = true }. Previously, if an assertion failed, the background interruptor
Future kept calling interruptAll() and canceled the operations of subsequent tests — turning
one failure into a cascade of OPERATION_CANCELED failures across the whole suite.

Why are the changes needed?

SparkSessionE2ESuite intermittently fails in CI. Confirmed flaky (the same module group passes on
other runs of the same commit).

Before (failing in apache/spark CI): one RemoteClassLoaderError cascading into 7 failures —
https://github.com/apache/spark/actions/runs/28037082466/job/82997066297

After (this change, validated on a fork): full connect module Total 1614, Failed 0, and
SparkSessionE2ESuite re-run 8x with 0 failures —
https://github.com/HyukjinKwon/spark/actions/runs/28066097371 and
https://github.com/HyukjinKwon/spark/actions/runs/28074772169

Does this PR introduce any user-facing change?

No, test-only.

How was this patch tested?

Re-ran SparkSessionE2ESuite 8x plus the full connect module on CI (links above); all green.

Was this patch authored or co-authored using generative AI tooling?

Yes, drafted with Claude Code.

…terrupt-all tests ### What changes were proposed in this pull request? Make the two `SparkSessionE2ESuite` "interrupt all" tests robust: 1. Run each long-running typed `map` query through a single call site and warm it up once (sleep=0) before any interrupt, so the closure/`TypeTag` artifact classes are already loaded on the executor. Otherwise an interrupt landing during that first-time remote class fetch surfaces as `RemoteClassLoaderError` (`...SparkSessionE2ESuite$$typecreatorNN$1.class`) instead of `OPERATION_CANCELED`. 2. Wrap the foreground-interrupt test body in `try/finally { finished = true }` so a failed assertion cannot leave the background `interruptor` Future running and canceling the operations of subsequent tests (which previously cascaded into ~7 failures across the suite). ### Why are the changes needed? `SparkSessionE2ESuite` intermittently fails in master push Build (SBT) and Maven (Scala 2.13, JDK 21/25). Confirmed flaky (the same module group passes on other runs of the same commit). ### Does this PR introduce any user-facing change? No, test-only. ### How was this patch tested? Re-ran `SparkSessionE2ESuite` 8x plus the full connect module on CI; all green. ### Was this patch authored or co-authored using generative AI tooling? Yes, drafted with Claude Code. Co-authored-by: Isaac

HyukjinKwon

Automated review (Claude Code, /spark-dev:review) — posted as a comment, not an approval.

Verdict: looks correct; test-only, low risk. No blocking findings.

Design

The flakiness has two distinct causes and the fix addresses both at the right level: (1) routing the long-running query through a single call site + warming it up means the closure/TypeTag artifact classes are loaded on the executor before any interrupt, closing the first-time-class-fetch race that produced RemoteClassLoaderError; (2) try/finally { finished = true } stops the background interruptor from leaking into subsequent tests. The cascade (1 → 7 failures) was the leaked interruptor, so the finally is the more important of the two.

Correctness

Warm-up reuses the same def runMapQuery call site as the timed runs, so it loads the exact closure/TypeTag classes the interrupted runs need — the warm-up is only effective because it's the same call site, which the single-def refactor guarantees. Good.
interrupted.length == 2 still holds: the warm-up runs before the interruptor starts, so it isn't counted.

Non-blocking

The same warm-up was added to interrupt all - background queries, foreground interrupt, which wasn't the reported-flaky one — fine as defensive hardening (same race exists), just noting the scope is slightly broader than the failing test.

Validated by re-running the suite 8× + the full connect module on a fork (links in the PR description).

HyukjinKwon · 2026-06-24T10:44:36Z

Merged to master and branch-4.x.

…errupt-all tests ### What changes were proposed in this pull request? Make the two `SparkSessionE2ESuite` "interrupt all" tests robust against two flakiness sources: 1. **Class-fetch race.** Run each long-running typed `map` query through a single call site and warm it up once (`sleep=0`) before any interrupt, so the closure/`TypeTag` artifact classes are already loaded on the executor. Otherwise an `interruptAll()` landing during the executor's first-time remote fetch of `$$typecreatorNN` surfaces as `RemoteClassLoaderError` instead of `OPERATION_CANCELED`. 2. **Leaked interruptor / cascade.** Wrap the foreground-interrupt test body in `try/finally { finished = true }`. Previously, if an assertion failed, the background `interruptor` Future kept calling `interruptAll()` and canceled the operations of *subsequent* tests — turning one failure into a cascade of `OPERATION_CANCELED` failures across the whole suite. ### Why are the changes needed? `SparkSessionE2ESuite` intermittently fails in CI. Confirmed flaky (the same module group passes on other runs of the same commit). **Before (failing in apache/spark CI):** one `RemoteClassLoaderError` cascading into 7 failures — https://github.com/apache/spark/actions/runs/28037082466/job/82997066297 **After (this change, validated on a fork):** full connect module `Total 1614, Failed 0`, and `SparkSessionE2ESuite` re-run 8x with 0 failures — https://github.com/HyukjinKwon/spark/actions/runs/28066097371 and https://github.com/HyukjinKwon/spark/actions/runs/28074772169 ### Does this PR introduce any user-facing change? No, test-only. ### How was this patch tested? Re-ran `SparkSessionE2ESuite` 8x plus the full connect module on CI (links above); all green. ### Was this patch authored or co-authored using generative AI tooling? Yes, drafted with Claude Code. Closes #56728 from HyukjinKwon/SPARK-57656. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com> (cherry picked from commit dae1847) Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>

HyukjinKwon mentioned this pull request Jun 24, 2026

[DO-NOT-MERGE][CONNECT] Stabilize flaky SparkSessionE2ESuite + ClientStreamingQuerySuite listener events #56716

Closed

dongjoon-hyun approved these changes Jun 24, 2026

View reviewed changes

HyukjinKwon commented Jun 24, 2026

View reviewed changes

HyukjinKwon closed this in dae1847 Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57656][CONNECT][TESTS] Stabilize flaky SparkSessionE2ESuite interrupt-all tests#56728

[SPARK-57656][CONNECT][TESTS] Stabilize flaky SparkSessionE2ESuite interrupt-all tests#56728
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:SPARK-57656

HyukjinKwon commented Jun 24, 2026

Uh oh!

HyukjinKwon left a comment

Uh oh!

HyukjinKwon commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HyukjinKwon commented Jun 24, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants