[SPARK-57650][YARN][TESTS] Allow AMs to use the whole queue in BaseYarnClusterSuite to fix ACCEPTED-state timeouts by HyukjinKwon · Pull Request #56715 · apache/spark

HyukjinKwon · 2026-06-24T00:15:21Z

What changes were proposed in this pull request?

BaseYarnClusterSuite configures a mini CapacityScheduler but never sets yarn.scheduler.capacity.maximum-am-resource-percent, so it defaults to 0.1. On memory-constrained CI runners the queue's total AM resource budget becomes ~1GB, which is smaller than the 1–2GB AM/driver memory these tests request. Applications then wedge in the ACCEPTED state (never activated) and the suite times out after 3 minutes with handle.getState().isFinal() was false.

This sets maximum-am-resource-percent to 1.0 (global + root.default) so AMs can use the whole test queue and applications are always activated.

Why are the changes needed?

YarnClusterSuite fails 6 tests with a 3-minute eventually timeout on the scheduled Maven builds (resource-managers#yarn module):

run Spark in yarn-client/cluster mode with different configurations, ensuring redaction
yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630)
SPARK-35672: additional jar using URI scheme 'local' (client, cluster, client + gateway-replacement)

The YARN diagnostics show Queue's AM resource limit exceeded. AM Resource Request = <memory:2048>; Queue Resource Limit for AM = <memory:1024> repeated >1000 times.

Failing job (before): https://github.com/apache/spark/actions/runs/28045133937/job/83029837948 — Build / Maven (branch-4.2, Scala 2.13, JDK 21), resource-managers#yarn (6 failures).

Passing job (with this fix): https://github.com/HyukjinKwon/spark/actions/runs/28066027338/job/83090387029 — resource-managers/yarn tests: YarnClusterSuite 30/30 pass, the 6 formerly-failing tests now complete in ~11s each (was 180s timeout).

Does this PR introduce any user-facing change?

No. Test-only.

How was this patch tested?

Ran the resource-managers/yarn module tests on a fork (link above); YarnClusterSuite passes 30/30.

This pull request and its description were written by Isaac.

…rnClusterSuite YarnClusterSuite tests intermittently fail on memory-constrained CI runners with 'handle.getState().isFinal() was false' after a 3-minute timeout. The mini CapacityScheduler set up in BaseYarnClusterSuite never sets maximum-am-resource-percent, so it defaults to 0.1: the queue's total AM resource budget becomes ~10% of capacity (~1GB on CI), which is smaller than the 1-2GB AM/ driver memory the tests request. Applications then wedge in the ACCEPTED state (never activated) and the suite times out. Set maximum-am-resource-percent to 1.0 (global and root.default) so AMs can use the whole test queue and applications are always activated. Co-authored-by: Isaac

HyukjinKwon

Self-review (code-review skill, recall-biased) — comments only, not an approval.

Reviewed the diff across correctness, removed-behavior, cross-file impact, and conventions. Conclusion: the change is correct and safe.

Correctness ✅ yarn.scheduler.capacity.maximum-am-resource-percent and the root.default variant are valid CapacityScheduler keys; 1.0f (100%) is read as a float in [0,1]. This only lifts an artificial AM cap; it does not change executor scheduling.
No regression to failure-path tests ✅ Tests that expect an app not to succeed — run Spark in yarn-cluster mode unsuccessfully, timeout to get SparkContext in cluster mode triggers failure (via AM_MAX_WAIT_TIME) — fail for reasons unrelated to AM resource starvation, so raising the AM percent doesn't mask them.
Inherited cleanly ✅ YarnShuffleIntegrationSuite also extends BaseYarnClusterSuite and simply benefits from the same fix.

Minor (non-blocking) observation:

Setting the global maximum-am-resource-percent is technically redundant with the per-queue root.default.maximum-am-resource-percent, since every app submits to root.default and the per-queue value takes precedence. Keeping both is harmless/defensive; feel free to drop the global line if you prefer minimal config.

Validated: resource-managers/yarn tests on a fork — YarnClusterSuite 30/30 (the 6 formerly-failing tests now ~11s vs the prior 180s timeout).

uros-b

Thank you @HyukjinKwon, this is a minimal, correct, and low-risk test-only fix. LGTM!

HyukjinKwon force-pushed the ci-fix/yarn-cluster-am-resource-percent branch from 9816494 to f631b5b Compare June 24, 2026 05:21

HyukjinKwon changed the title ~~[DO-NOT-MERGE][YARN][TESTS] Allow AMs to use the whole queue in BaseYarnClusterSuite to fix ACCEPTED-state hangs~~ [SPARK-57650][YARN][TESTS] Allow AMs to use the whole queue in BaseYarnClusterSuite to fix ACCEPTED-state timeouts Jun 24, 2026

HyukjinKwon marked this pull request as ready for review June 24, 2026 05:22

HyukjinKwon commented Jun 24, 2026

View reviewed changes

uros-b approved these changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57650][YARN][TESTS] Allow AMs to use the whole queue in BaseYarnClusterSuite to fix ACCEPTED-state timeouts#56715

[SPARK-57650][YARN][TESTS] Allow AMs to use the whole queue in BaseYarnClusterSuite to fix ACCEPTED-state timeouts#56715
HyukjinKwon wants to merge 1 commit into
apache:branch-4.2from
HyukjinKwon:ci-fix/yarn-cluster-am-resource-percent

HyukjinKwon commented Jun 24, 2026 •

edited

Loading

Uh oh!

HyukjinKwon left a comment

Uh oh!

uros-b left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HyukjinKwon commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

uros-b left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HyukjinKwon commented Jun 24, 2026 •

edited

Loading