[SPARK-57658][SS] Report available snapshot versions when a RocksDB state snapshot load fails#56718
Conversation
…load fails When loading state from a snapshot (e.g. reading with the snapshotStartBatchId option) the snapshot zip for the requested version can be missing, most commonly because the asynchronous maintenance thread has not uploaded it yet. Today this surfaces only as: CANNOT_LOAD_STATE_STORE.UNCATEGORIZED Caused by: java.io.FileNotFoundException: .../state/0/1/2.zip does not exist which gives no clue whether the snapshot was never created or just not uploaded in time. This has caused hard-to-diagnose intermittent failures in scheduled CI (snapshotStartBatchId transformWithState tests). Enrich the FileNotFoundException thrown from RocksDBFileManager with the snapshot (.zip) and changelog (.changelog) files that ARE present in the DFS checkpoint root, so any future occurrence is self-diagnosing from the logs. The listing is best-effort and never throws. A unit test in RocksDBSuite validates the message. Co-authored-by: Isaac
f917ca5 to
af41d59
Compare
HyukjinKwon
left a comment
There was a problem hiding this comment.
Self-review (code-review skill, recall-biased) — comments only, not an approval.
Reviewed across correctness, removed-behavior, cross-file impact, and altitude. The change is correct and well-placed: the diagnostic lives in unzipBatchZipFileFromDfs, so it covers both snapshot-load callers (loadCheckpointFromDfs and getImmutableFilesFromVersionZip); the catch only intercepts FileNotFoundException and rethrows the same type, so existing callers are unaffected; and listDfsFilesForDiagnostics is best-effort and swallows NonFatal, so the diagnostic can never mask the real failure.
Findings (both non-blocking):
- Lost cause / stack trace (minor, but ironic for a diagnostics change). The handler rethrows
new FileNotFoundException(msg)without chaining the originale, so the underlying stack trace (the exactRawLocalFileSystem/unzipframe) is dropped. Consider preserving it:val enriched = new FileNotFoundException(s"${e.getMessage}\n...") enriched.initCause(e) throw enriched
- Message redundancy (cosmetic).
e.getMessagealready contains the missing zip path, and the enriched message repeats it viafrom ${dfsBatchZipFile(...)}. Harmless, but the explicitversion/checkpointUniqueIdis the part that adds value; the path repeat could be dropped.
Validated: new RocksDBSuite test missing snapshot during load reports the available versions passes, and the snapshot suites (base + WithRowChecksum + CheckpointV2) ran 5× clean on a fork. Happy to fold in the initCause tweak if you'd like.
Please don't forget to update this part. |
| // The version-1 snapshot that does exist must be surfaced in the diagnostic. | ||
| assert(ex.getMessage.contains("snapshots=[1.zip]")) | ||
| } | ||
|
|
There was a problem hiding this comment.
Nit regarding coverage gap: the new test only exercises the else-branch (loadCheckpointFromDfs with no checkpointUniqueId, local fs.open). The if (checkpointUniqueId.isDefined) fm.open (V2/checksum) path is not directly tested.
What changes were proposed in this pull request?
When loading state from a snapshot (e.g. reading with the
snapshotStartBatchIdoption) the snapshot zip for the requested version can be missing — most often because the asynchronous maintenance thread has not uploaded it yet. Today that surfaces only as:with no indication of whether the snapshot was never created or merely not uploaded in time. This PR enriches the
FileNotFoundExceptionthrown fromRocksDBFileManagerwith the snapshot (.zip) and changelog (.changelog) files that are present in the DFS checkpoint root, so the situation is self-diagnosing from logs (e.g. asked for 2.zip, onlysnapshots=[1.zip]present clearly indicates the async-upload race). The listing is best-effort and never throws.Why are the changes needed?
The
snapshotStartBatchId ... transformWithStatetests have failed intermittently in scheduled Maven jobs with exactly this opaqueFileNotFoundException, in different suite variants per run:sql#core - slow tests: https://github.com/apache/spark/actions/runs/28048347820 (plainStateDataSourceTransformWithStateSuite)sql#core - slow tests: https://github.com/apache/spark/actions/runs/28045133937/job/83029837937 (StateDataSourceTransformWithStateSuiteWithRowChecksum)The deterministic test-side wait was handled separately (SPARK-57XXX); this change makes any future recurrence — in any suite or scheduled job — immediately diagnosable instead of requiring artifact spelunking.
Does this PR introduce any user-facing change?
No (improves an error message on an already-failing path).
How was this patch tested?
RocksDBFileManager: missing snapshot during load reports the available versionsinRocksDBSuite.This pull request and its description were written by Isaac.