[SPARK-57654][SQL][TESTS] Deflake MetricsFailureInjectionSuite 'Force checksum mismatch aborts a downstream ResultStage'#56724
Conversation
… checksum mismatch aborts a downstream ResultStage' ### What changes were proposed in this pull request? Group the abort-path test by the high-cardinality `id` column instead of the 5-value `low_cardinality_col`, so every one of the 20 reducer partitions reads the corrupted mapper-0. ### Why are the changes needed? The test 'Force checksum mismatch aborts a downstream ResultStage' was flaky under Maven (~3/10 scheduled runs; always passed on SBT). With only ~5 of 20 reducer partitions reading mapper-0, and the mapper-0 corruption applied asynchronously after the first result task succeeds (RESULT_STAGE_DELAY=1), the indeterminate-stage abort only fired if one of those few partitions happened to be scheduled after the corruption -- a scheduling race. Grouping by the high-cardinality `id` makes every reducer depend on mapper-0, so once corruption lands the remaining result tasks (dispatched after the first completes on local[2]) deterministically hit it and the abort always fires. ### Does this PR introduce any user-facing change? No, test only. ### How was this patch tested? Ran the suite 20x under Maven (the environment where it flaked) on a fork: all 20 passed. ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude Code Co-authored-by: Isaac
|
Automated code review (posted as a comment, not an approval) — via Verdict: no blocking findings. Test-only change; the fix is sound and well-justified. What it does: Groups the abort-path test by the high-cardinality Design (Pass A): Clean. No architecture/layer impact. The Correctness (Pass B): Nit (non-blocking): the new comment says every one of the 20 reducers reads mapper-0 and at least one is guaranteed to hit the corrupted mapper. That's a practical near-certainty rather than a strict guarantee — it depends on 300 Verification: suite run 20× under Maven (the env where it flaked) — all green: https://github.com/HyukjinKwon/spark/actions/runs/28066715792 |
|
Merged to master and branch-4.x. |
…checksum mismatch aborts a downstream ResultStage' ### What changes were proposed in this pull request? Group the abort-path test `MetricsFailureInjectionSuite."Force checksum mismatch aborts a downstream ResultStage"` by the high-cardinality `id` column instead of the 5-value `low_cardinality_col`, so every one of the 20 reducer partitions reads the corrupted mapper-0. ### Why are the changes needed? The test was flaky under Maven (~3/10 scheduled runs; it always passed on SBT). Only ~5 of the 20 reducer partitions held mapper-0's few low-cardinality keys, and the mapper-0 corruption is applied **asynchronously** after the first result task succeeds (`RESULT_STAGE_DELAY=1`). The indeterminate-stage abort therefore only fired if one of those few partitions happened to be scheduled *after* the corruption landed — a scheduling race. Grouping by the high-cardinality `id` makes every reducer depend on mapper-0, so once the corruption lands the remaining result tasks (dispatched only after the first completes, on `local[2]`) deterministically hit it and the abort always fires. ### Does this PR introduce any user-facing change? No, test only. ### How was this patch tested? Ran the suite **20×** under Maven (the environment where it flaked) on a fork — all 20 passed. - ❌ Before (flaky, scheduled `Build / Maven (Scala 2.13, JDK 21)`): https://github.com/apache/spark/actions/runs/28035705490 - ❌ Before (flaky, scheduled `Build / Maven (Scala 2.13, JDK 25)`): https://github.com/apache/spark/actions/runs/28035606804 - ✅ After (this fix, MetricsFailureInjectionSuite ×20 under Maven, all green): https://github.com/HyukjinKwon/spark/actions/runs/28066715792 ### Was this patch authored or co-authored using generative AI tooling? Yes, Generated-by: Claude Code This pull request and its description were written by Isaac. Closes #56724 from HyukjinKwon/SPARK-57654. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com> (cherry picked from commit 0ba8d2a) Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
What changes were proposed in this pull request?
Group the abort-path test
MetricsFailureInjectionSuite."Force checksum mismatch aborts a downstream ResultStage"by the high-cardinalityidcolumn instead of the 5-valuelow_cardinality_col, so every one of the 20 reducer partitions reads the corrupted mapper-0.Why are the changes needed?
The test was flaky under Maven (~3/10 scheduled runs; it always passed on SBT). Only ~5 of the 20 reducer partitions held mapper-0's few low-cardinality keys, and the mapper-0 corruption is applied asynchronously after the first result task succeeds (
RESULT_STAGE_DELAY=1). The indeterminate-stage abort therefore only fired if one of those few partitions happened to be scheduled after the corruption landed — a scheduling race. Grouping by the high-cardinalityidmakes every reducer depend on mapper-0, so once the corruption lands the remaining result tasks (dispatched only after the first completes, onlocal[2]) deterministically hit it and the abort always fires.Does this PR introduce any user-facing change?
No, test only.
How was this patch tested?
Ran the suite 20× under Maven (the environment where it flaked) on a fork — all 20 passed.
Build / Maven (Scala 2.13, JDK 21)): https://github.com/apache/spark/actions/runs/28035705490Build / Maven (Scala 2.13, JDK 25)): https://github.com/apache/spark/actions/runs/28035606804Was this patch authored or co-authored using generative AI tooling?
Yes, Generated-by: Claude Code
This pull request and its description were written by Isaac.