Skip to content

[SPARK-57656][CONNECT][TESTS] Stabilize flaky SparkSessionE2ESuite interrupt-all tests#56728

Closed
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:SPARK-57656
Closed

[SPARK-57656][CONNECT][TESTS] Stabilize flaky SparkSessionE2ESuite interrupt-all tests#56728
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:SPARK-57656

Conversation

@HyukjinKwon

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Make the two SparkSessionE2ESuite "interrupt all" tests robust against two flakiness sources:

  1. Class-fetch race. Run each long-running typed map query through a single call site and warm
    it up once (sleep=0) before any interrupt, so the closure/TypeTag artifact classes are already
    loaded on the executor. Otherwise an interruptAll() landing during the executor's first-time
    remote fetch of $$typecreatorNN surfaces as RemoteClassLoaderError instead of
    OPERATION_CANCELED.
  2. Leaked interruptor / cascade. Wrap the foreground-interrupt test body in
    try/finally { finished = true }. Previously, if an assertion failed, the background interruptor
    Future kept calling interruptAll() and canceled the operations of subsequent tests — turning
    one failure into a cascade of OPERATION_CANCELED failures across the whole suite.

Why are the changes needed?

SparkSessionE2ESuite intermittently fails in CI. Confirmed flaky (the same module group passes on
other runs of the same commit).

Before (failing in apache/spark CI): one RemoteClassLoaderError cascading into 7 failures —
https://github.com/apache/spark/actions/runs/28037082466/job/82997066297

After (this change, validated on a fork): full connect module Total 1614, Failed 0, and
SparkSessionE2ESuite re-run 8x with 0 failures —
https://github.com/HyukjinKwon/spark/actions/runs/28066097371 and
https://github.com/HyukjinKwon/spark/actions/runs/28074772169

Does this PR introduce any user-facing change?

No, test-only.

How was this patch tested?

Re-ran SparkSessionE2ESuite 8x plus the full connect module on CI (links above); all green.

Was this patch authored or co-authored using generative AI tooling?

Yes, drafted with Claude Code.

…terrupt-all tests

### What changes were proposed in this pull request?

Make the two `SparkSessionE2ESuite` "interrupt all" tests robust:
1. Run each long-running typed `map` query through a single call site and warm it up once
   (sleep=0) before any interrupt, so the closure/`TypeTag` artifact classes are already loaded
   on the executor. Otherwise an interrupt landing during that first-time remote class fetch
   surfaces as `RemoteClassLoaderError` (`...SparkSessionE2ESuite$$typecreatorNN$1.class`) instead
   of `OPERATION_CANCELED`.
2. Wrap the foreground-interrupt test body in `try/finally { finished = true }` so a failed
   assertion cannot leave the background `interruptor` Future running and canceling the operations
   of subsequent tests (which previously cascaded into ~7 failures across the suite).

### Why are the changes needed?

`SparkSessionE2ESuite` intermittently fails in master push Build (SBT) and Maven (Scala 2.13,
JDK 21/25). Confirmed flaky (the same module group passes on other runs of the same commit).

### Does this PR introduce any user-facing change?

No, test-only.

### How was this patch tested?

Re-ran `SparkSessionE2ESuite` 8x plus the full connect module on CI; all green.

### Was this patch authored or co-authored using generative AI tooling?

Yes, drafted with Claude Code.

Co-authored-by: Isaac

@HyukjinKwon HyukjinKwon left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review (Claude Code, /spark-dev:review) — posted as a comment, not an approval.

Verdict: looks correct; test-only, low risk. No blocking findings.

Design

  • The flakiness has two distinct causes and the fix addresses both at the right level: (1) routing the long-running query through a single call site + warming it up means the closure/TypeTag artifact classes are loaded on the executor before any interrupt, closing the first-time-class-fetch race that produced RemoteClassLoaderError; (2) try/finally { finished = true } stops the background interruptor from leaking into subsequent tests. The cascade (1 → 7 failures) was the leaked interruptor, so the finally is the more important of the two.

Correctness

  • Warm-up reuses the same def runMapQuery call site as the timed runs, so it loads the exact closure/TypeTag classes the interrupted runs need — the warm-up is only effective because it's the same call site, which the single-def refactor guarantees. Good.
  • interrupted.length == 2 still holds: the warm-up runs before the interruptor starts, so it isn't counted.

Non-blocking

  • The same warm-up was added to interrupt all - background queries, foreground interrupt, which wasn't the reported-flaky one — fine as defensive hardening (same race exists), just noting the scope is slightly broader than the failing test.

Validated by re-running the suite 8× + the full connect module on a fork (links in the PR description).

@HyukjinKwon

Copy link
Copy Markdown
Member Author

Merged to master and branch-4.x.

HyukjinKwon added a commit that referenced this pull request Jun 24, 2026
…errupt-all tests

### What changes were proposed in this pull request?

Make the two `SparkSessionE2ESuite` "interrupt all" tests robust against two flakiness sources:

1. **Class-fetch race.** Run each long-running typed `map` query through a single call site and warm
   it up once (`sleep=0`) before any interrupt, so the closure/`TypeTag` artifact classes are already
   loaded on the executor. Otherwise an `interruptAll()` landing during the executor's first-time
   remote fetch of `$$typecreatorNN` surfaces as `RemoteClassLoaderError` instead of
   `OPERATION_CANCELED`.
2. **Leaked interruptor / cascade.** Wrap the foreground-interrupt test body in
   `try/finally { finished = true }`. Previously, if an assertion failed, the background `interruptor`
   Future kept calling `interruptAll()` and canceled the operations of *subsequent* tests — turning
   one failure into a cascade of `OPERATION_CANCELED` failures across the whole suite.

### Why are the changes needed?

`SparkSessionE2ESuite` intermittently fails in CI. Confirmed flaky (the same module group passes on
other runs of the same commit).

**Before (failing in apache/spark CI):** one `RemoteClassLoaderError` cascading into 7 failures —
https://github.com/apache/spark/actions/runs/28037082466/job/82997066297

**After (this change, validated on a fork):** full connect module `Total 1614, Failed 0`, and
`SparkSessionE2ESuite` re-run 8x with 0 failures —
https://github.com/HyukjinKwon/spark/actions/runs/28066097371 and
https://github.com/HyukjinKwon/spark/actions/runs/28074772169

### Does this PR introduce any user-facing change?

No, test-only.

### How was this patch tested?

Re-ran `SparkSessionE2ESuite` 8x plus the full connect module on CI (links above); all green.

### Was this patch authored or co-authored using generative AI tooling?

Yes, drafted with Claude Code.

Closes #56728 from HyukjinKwon/SPARK-57656.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
(cherry picked from commit dae1847)
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants