[SPARK-57654][SQL][TESTS] Deflake MetricsFailureInjectionSuite 'Force checksum mismatch aborts a downstream ResultStage' by HyukjinKwon · Pull Request #56724 · apache/spark

HyukjinKwon · 2026-06-24T05:24:31Z

What changes were proposed in this pull request?

Group the abort-path test MetricsFailureInjectionSuite."Force checksum mismatch aborts a downstream ResultStage" by the high-cardinality id column instead of the 5-value low_cardinality_col, so every one of the 20 reducer partitions reads the corrupted mapper-0.

Why are the changes needed?

The test was flaky under Maven (~3/10 scheduled runs; it always passed on SBT). Only ~5 of the 20 reducer partitions held mapper-0's few low-cardinality keys, and the mapper-0 corruption is applied asynchronously after the first result task succeeds (RESULT_STAGE_DELAY=1). The indeterminate-stage abort therefore only fired if one of those few partitions happened to be scheduled after the corruption landed — a scheduling race. Grouping by the high-cardinality id makes every reducer depend on mapper-0, so once the corruption lands the remaining result tasks (dispatched only after the first completes, on local[2]) deterministically hit it and the abort always fires.

Does this PR introduce any user-facing change?

No, test only.

How was this patch tested?

Ran the suite 20× under Maven (the environment where it flaked) on a fork — all 20 passed.

❌ Before (flaky, scheduled Build / Maven (Scala 2.13, JDK 21)): https://github.com/apache/spark/actions/runs/28035705490
❌ Before (flaky, scheduled Build / Maven (Scala 2.13, JDK 25)): https://github.com/apache/spark/actions/runs/28035606804
✅ After (this fix, MetricsFailureInjectionSuite ×20 under Maven, all green): https://github.com/HyukjinKwon/spark/actions/runs/28066715792

Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude Code

This pull request and its description were written by Isaac.

… checksum mismatch aborts a downstream ResultStage' ### What changes were proposed in this pull request? Group the abort-path test by the high-cardinality `id` column instead of the 5-value `low_cardinality_col`, so every one of the 20 reducer partitions reads the corrupted mapper-0. ### Why are the changes needed? The test 'Force checksum mismatch aborts a downstream ResultStage' was flaky under Maven (~3/10 scheduled runs; always passed on SBT). With only ~5 of 20 reducer partitions reading mapper-0, and the mapper-0 corruption applied asynchronously after the first result task succeeds (RESULT_STAGE_DELAY=1), the indeterminate-stage abort only fired if one of those few partitions happened to be scheduled after the corruption -- a scheduling race. Grouping by the high-cardinality `id` makes every reducer depend on mapper-0, so once corruption lands the remaining result tasks (dispatched after the first completes on local[2]) deterministically hit it and the abort always fires. ### Does this PR introduce any user-facing change? No, test only. ### How was this patch tested? Ran the suite 20x under Maven (the environment where it flaked) on a fork: all 20 passed. ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude Code Co-authored-by: Isaac

HyukjinKwon · 2026-06-24T06:01:48Z

Automated code review (posted as a comment, not an approval) — via spark-dev:code-review.

Verdict: no blocking findings. Test-only change; the fix is sound and well-justified.

What it does: Groups the abort-path test by the high-cardinality id column instead of the 5-value low_cardinality_col, so all 20 reducer partitions depend on the corrupted mapper-0 and the indeterminate-stage abort fires deterministically (previously only ~5 partitions read mapper-0, so the abort hinged on scheduling order — the flake).

Design (Pass A): Clean. No architecture/layer impact. The local[2] + SHUFFLE_PARTITIONS=20 reasoning matches the sibling test in the same suite (MetricsFailureInjectionSuite.scala:597), so it's consistent with the established pattern. The scenario the test is named for (2-stage query → ResultStage abort under RESULT_STAGE_DELAY=1 + forced checksum mismatch) is unchanged — only the grouping key's cardinality changes.

Correctness (Pass B): id exists in the fixture (setUpTestTable, line 48). The assertion (intercept[SparkException] containing "indeterminate") is unaffected. No metric-value assertions depend on the grouping key in this test (unlike the sibling tests), so switching to id is safe.

Nit (non-blocking): the new comment says every one of the 20 reducers reads mapper-0 and at least one is guaranteed to hit the corrupted mapper. That's a practical near-certainty rather than a strict guarantee — it depends on 300 id values hashing across all 20 partitions and on local[2] dispatching the remaining result tasks after the corruption event is processed. Both hold overwhelmingly, but "guaranteed"/"every" slightly overstate it; consider "effectively always" to be precise. Not worth blocking on.

Verification: suite run 20× under Maven (the env where it flaked) — all green: https://github.com/HyukjinKwon/spark/actions/runs/28066715792

HyukjinKwon · 2026-06-24T10:42:03Z

Merged to master and branch-4.x.

…checksum mismatch aborts a downstream ResultStage' ### What changes were proposed in this pull request? Group the abort-path test `MetricsFailureInjectionSuite."Force checksum mismatch aborts a downstream ResultStage"` by the high-cardinality `id` column instead of the 5-value `low_cardinality_col`, so every one of the 20 reducer partitions reads the corrupted mapper-0. ### Why are the changes needed? The test was flaky under Maven (~3/10 scheduled runs; it always passed on SBT). Only ~5 of the 20 reducer partitions held mapper-0's few low-cardinality keys, and the mapper-0 corruption is applied **asynchronously** after the first result task succeeds (`RESULT_STAGE_DELAY=1`). The indeterminate-stage abort therefore only fired if one of those few partitions happened to be scheduled *after* the corruption landed — a scheduling race. Grouping by the high-cardinality `id` makes every reducer depend on mapper-0, so once the corruption lands the remaining result tasks (dispatched only after the first completes, on `local[2]`) deterministically hit it and the abort always fires. ### Does this PR introduce any user-facing change? No, test only. ### How was this patch tested? Ran the suite **20×** under Maven (the environment where it flaked) on a fork — all 20 passed. - ❌ Before (flaky, scheduled `Build / Maven (Scala 2.13, JDK 21)`): https://github.com/apache/spark/actions/runs/28035705490 - ❌ Before (flaky, scheduled `Build / Maven (Scala 2.13, JDK 25)`): https://github.com/apache/spark/actions/runs/28035606804 - ✅ After (this fix, MetricsFailureInjectionSuite ×20 under Maven, all green): https://github.com/HyukjinKwon/spark/actions/runs/28066715792 ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude Code This pull request and its description were written by Isaac. Closes #56724 from HyukjinKwon/SPARK-57654. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com> (cherry picked from commit 0ba8d2a) Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>

dongjoon-hyun approved these changes Jun 24, 2026

View reviewed changes

HyukjinKwon closed this in 0ba8d2a Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57654][SQL][TESTS] Deflake MetricsFailureInjectionSuite 'Force checksum mismatch aborts a downstream ResultStage'#56724

[SPARK-57654][SQL][TESTS] Deflake MetricsFailureInjectionSuite 'Force checksum mismatch aborts a downstream ResultStage'#56724
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:SPARK-57654

HyukjinKwon commented Jun 24, 2026

Uh oh!

HyukjinKwon commented Jun 24, 2026

Uh oh!

HyukjinKwon commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HyukjinKwon commented Jun 24, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Jun 24, 2026

Uh oh!

HyukjinKwon commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants