[SPARK-57650][YARN][TESTS] Allow AMs to use the whole queue in BaseYarnClusterSuite to fix ACCEPTED-state timeouts#56715
Open
HyukjinKwon wants to merge 1 commit into
Conversation
…rnClusterSuite YarnClusterSuite tests intermittently fail on memory-constrained CI runners with 'handle.getState().isFinal() was false' after a 3-minute timeout. The mini CapacityScheduler set up in BaseYarnClusterSuite never sets maximum-am-resource-percent, so it defaults to 0.1: the queue's total AM resource budget becomes ~10% of capacity (~1GB on CI), which is smaller than the 1-2GB AM/ driver memory the tests request. Applications then wedge in the ACCEPTED state (never activated) and the suite times out. Set maximum-am-resource-percent to 1.0 (global and root.default) so AMs can use the whole test queue and applications are always activated. Co-authored-by: Isaac
9816494 to
f631b5b
Compare
HyukjinKwon
commented
Jun 24, 2026
HyukjinKwon
left a comment
Member
Author
There was a problem hiding this comment.
Self-review (code-review skill, recall-biased) — comments only, not an approval.
Reviewed the diff across correctness, removed-behavior, cross-file impact, and conventions. Conclusion: the change is correct and safe.
- Correctness ✅
yarn.scheduler.capacity.maximum-am-resource-percentand theroot.defaultvariant are valid CapacityScheduler keys;1.0f(100%) is read as a float in [0,1]. This only lifts an artificial AM cap; it does not change executor scheduling. - No regression to failure-path tests ✅ Tests that expect an app not to succeed —
run Spark in yarn-cluster mode unsuccessfully,timeout to get SparkContext in cluster mode triggers failure(viaAM_MAX_WAIT_TIME) — fail for reasons unrelated to AM resource starvation, so raising the AM percent doesn't mask them. - Inherited cleanly ✅
YarnShuffleIntegrationSuitealso extendsBaseYarnClusterSuiteand simply benefits from the same fix.
Minor (non-blocking) observation:
- Setting the global
maximum-am-resource-percentis technically redundant with the per-queueroot.default.maximum-am-resource-percent, since every app submits toroot.defaultand the per-queue value takes precedence. Keeping both is harmless/defensive; feel free to drop the global line if you prefer minimal config.
Validated: resource-managers/yarn tests on a fork — YarnClusterSuite 30/30 (the 6 formerly-failing tests now ~11s vs the prior 180s timeout).
uros-b
approved these changes
Jun 24, 2026
uros-b
left a comment
Member
There was a problem hiding this comment.
Thank you @HyukjinKwon, this is a minimal, correct, and low-risk test-only fix. LGTM!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
BaseYarnClusterSuiteconfigures a miniCapacitySchedulerbut never setsyarn.scheduler.capacity.maximum-am-resource-percent, so it defaults to0.1. On memory-constrained CI runners the queue's total AM resource budget becomes ~1GB, which is smaller than the 1–2GB AM/driver memory these tests request. Applications then wedge in theACCEPTEDstate (never activated) and the suite times out after 3 minutes withhandle.getState().isFinal() was false.This sets
maximum-am-resource-percentto1.0(global +root.default) so AMs can use the whole test queue and applications are always activated.Why are the changes needed?
YarnClusterSuitefails 6 tests with a 3-minuteeventuallytimeout on the scheduled Maven builds (resource-managers#yarnmodule):The YARN diagnostics show
Queue's AM resource limit exceeded. AM Resource Request = <memory:2048>; Queue Resource Limit for AM = <memory:1024>repeated >1000 times.Failing job (before): https://github.com/apache/spark/actions/runs/28045133937/job/83029837948 —
Build / Maven (branch-4.2, Scala 2.13, JDK 21),resource-managers#yarn(6 failures).Passing job (with this fix): https://github.com/HyukjinKwon/spark/actions/runs/28066027338/job/83090387029 —
resource-managers/yarntests:YarnClusterSuite30/30 pass, the 6 formerly-failing tests now complete in ~11s each (was 180s timeout).Does this PR introduce any user-facing change?
No. Test-only.
How was this patch tested?
Ran the
resource-managers/yarnmodule tests on a fork (link above);YarnClusterSuitepasses 30/30.This pull request and its description were written by Isaac.