[SPARK-57656][CONNECT][TESTS] Stabilize flaky SparkSessionE2ESuite interrupt-all tests#56728
Closed
HyukjinKwon wants to merge 1 commit into
Closed
[SPARK-57656][CONNECT][TESTS] Stabilize flaky SparkSessionE2ESuite interrupt-all tests#56728HyukjinKwon wants to merge 1 commit into
HyukjinKwon wants to merge 1 commit into
Conversation
…terrupt-all tests
### What changes were proposed in this pull request?
Make the two `SparkSessionE2ESuite` "interrupt all" tests robust:
1. Run each long-running typed `map` query through a single call site and warm it up once
(sleep=0) before any interrupt, so the closure/`TypeTag` artifact classes are already loaded
on the executor. Otherwise an interrupt landing during that first-time remote class fetch
surfaces as `RemoteClassLoaderError` (`...SparkSessionE2ESuite$$typecreatorNN$1.class`) instead
of `OPERATION_CANCELED`.
2. Wrap the foreground-interrupt test body in `try/finally { finished = true }` so a failed
assertion cannot leave the background `interruptor` Future running and canceling the operations
of subsequent tests (which previously cascaded into ~7 failures across the suite).
### Why are the changes needed?
`SparkSessionE2ESuite` intermittently fails in master push Build (SBT) and Maven (Scala 2.13,
JDK 21/25). Confirmed flaky (the same module group passes on other runs of the same commit).
### Does this PR introduce any user-facing change?
No, test-only.
### How was this patch tested?
Re-ran `SparkSessionE2ESuite` 8x plus the full connect module on CI; all green.
### Was this patch authored or co-authored using generative AI tooling?
Yes, drafted with Claude Code.
Co-authored-by: Isaac
dongjoon-hyun
approved these changes
Jun 24, 2026
HyukjinKwon
commented
Jun 24, 2026
HyukjinKwon
left a comment
Member
Author
There was a problem hiding this comment.
Automated review (Claude Code, /spark-dev:review) — posted as a comment, not an approval.
Verdict: looks correct; test-only, low risk. No blocking findings.
Design
- The flakiness has two distinct causes and the fix addresses both at the right level: (1) routing the long-running query through a single call site + warming it up means the closure/
TypeTagartifact classes are loaded on the executor before any interrupt, closing the first-time-class-fetch race that producedRemoteClassLoaderError; (2)try/finally { finished = true }stops the background interruptor from leaking into subsequent tests. The cascade (1 → 7 failures) was the leaked interruptor, so thefinallyis the more important of the two.
Correctness
- Warm-up reuses the same
def runMapQuerycall site as the timed runs, so it loads the exact closure/TypeTag classes the interrupted runs need — the warm-up is only effective because it's the same call site, which the single-defrefactor guarantees. Good. interrupted.length == 2still holds: the warm-up runs before the interruptor starts, so it isn't counted.
Non-blocking
- The same warm-up was added to
interrupt all - background queries, foreground interrupt, which wasn't the reported-flaky one — fine as defensive hardening (same race exists), just noting the scope is slightly broader than the failing test.
Validated by re-running the suite 8× + the full connect module on a fork (links in the PR description).
Member
Author
|
Merged to master and branch-4.x. |
HyukjinKwon
added a commit
that referenced
this pull request
Jun 24, 2026
…errupt-all tests
### What changes were proposed in this pull request?
Make the two `SparkSessionE2ESuite` "interrupt all" tests robust against two flakiness sources:
1. **Class-fetch race.** Run each long-running typed `map` query through a single call site and warm
it up once (`sleep=0`) before any interrupt, so the closure/`TypeTag` artifact classes are already
loaded on the executor. Otherwise an `interruptAll()` landing during the executor's first-time
remote fetch of `$$typecreatorNN` surfaces as `RemoteClassLoaderError` instead of
`OPERATION_CANCELED`.
2. **Leaked interruptor / cascade.** Wrap the foreground-interrupt test body in
`try/finally { finished = true }`. Previously, if an assertion failed, the background `interruptor`
Future kept calling `interruptAll()` and canceled the operations of *subsequent* tests — turning
one failure into a cascade of `OPERATION_CANCELED` failures across the whole suite.
### Why are the changes needed?
`SparkSessionE2ESuite` intermittently fails in CI. Confirmed flaky (the same module group passes on
other runs of the same commit).
**Before (failing in apache/spark CI):** one `RemoteClassLoaderError` cascading into 7 failures —
https://github.com/apache/spark/actions/runs/28037082466/job/82997066297
**After (this change, validated on a fork):** full connect module `Total 1614, Failed 0`, and
`SparkSessionE2ESuite` re-run 8x with 0 failures —
https://github.com/HyukjinKwon/spark/actions/runs/28066097371 and
https://github.com/HyukjinKwon/spark/actions/runs/28074772169
### Does this PR introduce any user-facing change?
No, test-only.
### How was this patch tested?
Re-ran `SparkSessionE2ESuite` 8x plus the full connect module on CI (links above); all green.
### Was this patch authored or co-authored using generative AI tooling?
Yes, drafted with Claude Code.
Closes #56728 from HyukjinKwon/SPARK-57656.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
(cherry picked from commit dae1847)
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Make the two
SparkSessionE2ESuite"interrupt all" tests robust against two flakiness sources:mapquery through a single call site and warmit up once (
sleep=0) before any interrupt, so the closure/TypeTagartifact classes are alreadyloaded on the executor. Otherwise an
interruptAll()landing during the executor's first-timeremote fetch of
$$typecreatorNNsurfaces asRemoteClassLoaderErrorinstead ofOPERATION_CANCELED.try/finally { finished = true }. Previously, if an assertion failed, the backgroundinterruptorFuture kept calling
interruptAll()and canceled the operations of subsequent tests — turningone failure into a cascade of
OPERATION_CANCELEDfailures across the whole suite.Why are the changes needed?
SparkSessionE2ESuiteintermittently fails in CI. Confirmed flaky (the same module group passes onother runs of the same commit).
Before (failing in apache/spark CI): one
RemoteClassLoaderErrorcascading into 7 failures —https://github.com/apache/spark/actions/runs/28037082466/job/82997066297
After (this change, validated on a fork): full connect module
Total 1614, Failed 0, andSparkSessionE2ESuitere-run 8x with 0 failures —https://github.com/HyukjinKwon/spark/actions/runs/28066097371 and
https://github.com/HyukjinKwon/spark/actions/runs/28074772169
Does this PR introduce any user-facing change?
No, test-only.
How was this patch tested?
Re-ran
SparkSessionE2ESuite8x plus the full connect module on CI (links above); all green.Was this patch authored or co-authored using generative AI tooling?
Yes, drafted with Claude Code.