Skip to content

[SPARK-57650][YARN][TESTS] Allow AMs to use the whole queue in BaseYarnClusterSuite to fix ACCEPTED-state timeouts#56715

Open
HyukjinKwon wants to merge 1 commit into
apache:branch-4.2from
HyukjinKwon:ci-fix/yarn-cluster-am-resource-percent
Open

[SPARK-57650][YARN][TESTS] Allow AMs to use the whole queue in BaseYarnClusterSuite to fix ACCEPTED-state timeouts#56715
HyukjinKwon wants to merge 1 commit into
apache:branch-4.2from
HyukjinKwon:ci-fix/yarn-cluster-am-resource-percent

Conversation

@HyukjinKwon

@HyukjinKwon HyukjinKwon commented Jun 24, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

BaseYarnClusterSuite configures a mini CapacityScheduler but never sets yarn.scheduler.capacity.maximum-am-resource-percent, so it defaults to 0.1. On memory-constrained CI runners the queue's total AM resource budget becomes ~1GB, which is smaller than the 1–2GB AM/driver memory these tests request. Applications then wedge in the ACCEPTED state (never activated) and the suite times out after 3 minutes with handle.getState().isFinal() was false.

This sets maximum-am-resource-percent to 1.0 (global + root.default) so AMs can use the whole test queue and applications are always activated.

Why are the changes needed?

YarnClusterSuite fails 6 tests with a 3-minute eventually timeout on the scheduled Maven builds (resource-managers#yarn module):

  • run Spark in yarn-client/cluster mode with different configurations, ensuring redaction
  • yarn-cluster should respect conf overrides in SparkHadoopUtil (SPARK-16414, SPARK-23630)
  • SPARK-35672: additional jar using URI scheme 'local' (client, cluster, client + gateway-replacement)

The YARN diagnostics show Queue's AM resource limit exceeded. AM Resource Request = <memory:2048>; Queue Resource Limit for AM = <memory:1024> repeated >1000 times.

Failing job (before): https://github.com/apache/spark/actions/runs/28045133937/job/83029837948Build / Maven (branch-4.2, Scala 2.13, JDK 21), resource-managers#yarn (6 failures).

Passing job (with this fix): https://github.com/HyukjinKwon/spark/actions/runs/28066027338/job/83090387029resource-managers/yarn tests: YarnClusterSuite 30/30 pass, the 6 formerly-failing tests now complete in ~11s each (was 180s timeout).

Does this PR introduce any user-facing change?

No. Test-only.

How was this patch tested?

Ran the resource-managers/yarn module tests on a fork (link above); YarnClusterSuite passes 30/30.

This pull request and its description were written by Isaac.

…rnClusterSuite

YarnClusterSuite tests intermittently fail on memory-constrained CI runners with
'handle.getState().isFinal() was false' after a 3-minute timeout. The mini
CapacityScheduler set up in BaseYarnClusterSuite never sets
maximum-am-resource-percent, so it defaults to 0.1: the queue's total AM resource
budget becomes ~10% of capacity (~1GB on CI), which is smaller than the 1-2GB AM/
driver memory the tests request. Applications then wedge in the ACCEPTED state
(never activated) and the suite times out.

Set maximum-am-resource-percent to 1.0 (global and root.default) so AMs can use the
whole test queue and applications are always activated.

Co-authored-by: Isaac
@HyukjinKwon HyukjinKwon force-pushed the ci-fix/yarn-cluster-am-resource-percent branch from 9816494 to f631b5b Compare June 24, 2026 05:21
@HyukjinKwon HyukjinKwon changed the title [DO-NOT-MERGE][YARN][TESTS] Allow AMs to use the whole queue in BaseYarnClusterSuite to fix ACCEPTED-state hangs [SPARK-57650][YARN][TESTS] Allow AMs to use the whole queue in BaseYarnClusterSuite to fix ACCEPTED-state timeouts Jun 24, 2026
@HyukjinKwon HyukjinKwon marked this pull request as ready for review June 24, 2026 05:22

@HyukjinKwon HyukjinKwon left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review (code-review skill, recall-biased) — comments only, not an approval.

Reviewed the diff across correctness, removed-behavior, cross-file impact, and conventions. Conclusion: the change is correct and safe.

  • Correctness ✅ yarn.scheduler.capacity.maximum-am-resource-percent and the root.default variant are valid CapacityScheduler keys; 1.0f (100%) is read as a float in [0,1]. This only lifts an artificial AM cap; it does not change executor scheduling.
  • No regression to failure-path tests ✅ Tests that expect an app not to succeed — run Spark in yarn-cluster mode unsuccessfully, timeout to get SparkContext in cluster mode triggers failure (via AM_MAX_WAIT_TIME) — fail for reasons unrelated to AM resource starvation, so raising the AM percent doesn't mask them.
  • Inherited cleanly ✅ YarnShuffleIntegrationSuite also extends BaseYarnClusterSuite and simply benefits from the same fix.

Minor (non-blocking) observation:

  • Setting the global maximum-am-resource-percent is technically redundant with the per-queue root.default.maximum-am-resource-percent, since every app submits to root.default and the per-queue value takes precedence. Keeping both is harmless/defensive; feel free to drop the global line if you prefer minimal config.

Validated: resource-managers/yarn tests on a fork — YarnClusterSuite 30/30 (the 6 formerly-failing tests now ~11s vs the prior 180s timeout).

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @HyukjinKwon, this is a minimal, correct, and low-risk test-only fix. LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants