Skip to content

[SPARK-57654][SQL][TESTS] Deflake MetricsFailureInjectionSuite 'Force checksum mismatch aborts a downstream ResultStage'#56724

Closed
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:SPARK-57654
Closed

[SPARK-57654][SQL][TESTS] Deflake MetricsFailureInjectionSuite 'Force checksum mismatch aborts a downstream ResultStage'#56724
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:SPARK-57654

Conversation

@HyukjinKwon

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Group the abort-path test MetricsFailureInjectionSuite."Force checksum mismatch aborts a downstream ResultStage" by the high-cardinality id column instead of the 5-value low_cardinality_col, so every one of the 20 reducer partitions reads the corrupted mapper-0.

Why are the changes needed?

The test was flaky under Maven (~3/10 scheduled runs; it always passed on SBT). Only ~5 of the 20 reducer partitions held mapper-0's few low-cardinality keys, and the mapper-0 corruption is applied asynchronously after the first result task succeeds (RESULT_STAGE_DELAY=1). The indeterminate-stage abort therefore only fired if one of those few partitions happened to be scheduled after the corruption landed — a scheduling race. Grouping by the high-cardinality id makes every reducer depend on mapper-0, so once the corruption lands the remaining result tasks (dispatched only after the first completes, on local[2]) deterministically hit it and the abort always fires.

Does this PR introduce any user-facing change?

No, test only.

How was this patch tested?

Ran the suite 20× under Maven (the environment where it flaked) on a fork — all 20 passed.

Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude Code

This pull request and its description were written by Isaac.

… checksum mismatch aborts a downstream ResultStage'

### What changes were proposed in this pull request?
Group the abort-path test by the high-cardinality `id` column instead of the 5-value
`low_cardinality_col`, so every one of the 20 reducer partitions reads the corrupted mapper-0.

### Why are the changes needed?
The test 'Force checksum mismatch aborts a downstream ResultStage' was flaky under Maven
(~3/10 scheduled runs; always passed on SBT). With only ~5 of 20 reducer partitions reading
mapper-0, and the mapper-0 corruption applied asynchronously after the first result task
succeeds (RESULT_STAGE_DELAY=1), the indeterminate-stage abort only fired if one of those few
partitions happened to be scheduled after the corruption -- a scheduling race. Grouping by the
high-cardinality `id` makes every reducer depend on mapper-0, so once corruption lands the
remaining result tasks (dispatched after the first completes on local[2]) deterministically hit
it and the abort always fires.

### Does this PR introduce any user-facing change?
No, test only.

### How was this patch tested?
Ran the suite 20x under Maven (the environment where it flaked) on a fork: all 20 passed.

### Was this patch authored or co-authored using generative AI tooling?
Yes, Generated-by: Claude Code

Co-authored-by: Isaac
@HyukjinKwon

Copy link
Copy Markdown
Member Author

Automated code review (posted as a comment, not an approval) — via spark-dev:code-review.

Verdict: no blocking findings. Test-only change; the fix is sound and well-justified.

What it does: Groups the abort-path test by the high-cardinality id column instead of the 5-value low_cardinality_col, so all 20 reducer partitions depend on the corrupted mapper-0 and the indeterminate-stage abort fires deterministically (previously only ~5 partitions read mapper-0, so the abort hinged on scheduling order — the flake).

Design (Pass A): Clean. No architecture/layer impact. The local[2] + SHUFFLE_PARTITIONS=20 reasoning matches the sibling test in the same suite (MetricsFailureInjectionSuite.scala:597), so it's consistent with the established pattern. The scenario the test is named for (2-stage query → ResultStage abort under RESULT_STAGE_DELAY=1 + forced checksum mismatch) is unchanged — only the grouping key's cardinality changes.

Correctness (Pass B): id exists in the fixture (setUpTestTable, line 48). The assertion (intercept[SparkException] containing "indeterminate") is unaffected. No metric-value assertions depend on the grouping key in this test (unlike the sibling tests), so switching to id is safe.

Nit (non-blocking): the new comment says every one of the 20 reducers reads mapper-0 and at least one is guaranteed to hit the corrupted mapper. That's a practical near-certainty rather than a strict guarantee — it depends on 300 id values hashing across all 20 partitions and on local[2] dispatching the remaining result tasks after the corruption event is processed. Both hold overwhelmingly, but "guaranteed"/"every" slightly overstate it; consider "effectively always" to be precise. Not worth blocking on.

Verification: suite run 20× under Maven (the env where it flaked) — all green: https://github.com/HyukjinKwon/spark/actions/runs/28066715792

@HyukjinKwon

Copy link
Copy Markdown
Member Author

Merged to master and branch-4.x.

HyukjinKwon added a commit that referenced this pull request Jun 24, 2026
…checksum mismatch aborts a downstream ResultStage'

### What changes were proposed in this pull request?
Group the abort-path test `MetricsFailureInjectionSuite."Force checksum mismatch aborts a downstream ResultStage"` by the high-cardinality `id` column instead of the 5-value `low_cardinality_col`, so every one of the 20 reducer partitions reads the corrupted mapper-0.

### Why are the changes needed?
The test was flaky under Maven (~3/10 scheduled runs; it always passed on SBT). Only ~5 of the 20 reducer partitions held mapper-0's few low-cardinality keys, and the mapper-0 corruption is applied **asynchronously** after the first result task succeeds (`RESULT_STAGE_DELAY=1`). The indeterminate-stage abort therefore only fired if one of those few partitions happened to be scheduled *after* the corruption landed — a scheduling race. Grouping by the high-cardinality `id` makes every reducer depend on mapper-0, so once the corruption lands the remaining result tasks (dispatched only after the first completes, on `local[2]`) deterministically hit it and the abort always fires.

### Does this PR introduce any user-facing change?
No, test only.

### How was this patch tested?
Ran the suite **20×** under Maven (the environment where it flaked) on a fork — all 20 passed.

- ❌ Before (flaky, scheduled `Build / Maven (Scala 2.13, JDK 21)`): https://github.com/apache/spark/actions/runs/28035705490
- ❌ Before (flaky, scheduled `Build / Maven (Scala 2.13, JDK 25)`): https://github.com/apache/spark/actions/runs/28035606804
- ✅ After (this fix, MetricsFailureInjectionSuite ×20 under Maven, all green): https://github.com/HyukjinKwon/spark/actions/runs/28066715792

### Was this patch authored or co-authored using generative AI tooling?
Yes, Generated-by: Claude Code

This pull request and its description were written by Isaac.

Closes #56724 from HyukjinKwon/SPARK-57654.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
(cherry picked from commit 0ba8d2a)
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants