Skip to content

[SPARK-57658][SS] Report available snapshot versions when a RocksDB state snapshot load fails#56718

Open
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:ci-fix/snapshot-load-diagnostics
Open

[SPARK-57658][SS] Report available snapshot versions when a RocksDB state snapshot load fails#56718
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:ci-fix/snapshot-load-diagnostics

Conversation

@HyukjinKwon

@HyukjinKwon HyukjinKwon commented Jun 24, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

When loading state from a snapshot (e.g. reading with the snapshotStartBatchId option) the snapshot zip for the requested version can be missing — most often because the asynchronous maintenance thread has not uploaded it yet. Today that surfaces only as:

[CANNOT_LOAD_STATE_STORE.UNCATEGORIZED] ...
Caused by: java.io.FileNotFoundException: .../state/0/1/2.zip does not exist

with no indication of whether the snapshot was never created or merely not uploaded in time. This PR enriches the FileNotFoundException thrown from RocksDBFileManager with the snapshot (.zip) and changelog (.changelog) files that are present in the DFS checkpoint root, so the situation is self-diagnosing from logs (e.g. asked for 2.zip, only snapshots=[1.zip] present clearly indicates the async-upload race). The listing is best-effort and never throws.

Why are the changes needed?

The snapshotStartBatchId ... transformWithState tests have failed intermittently in scheduled Maven jobs with exactly this opaque FileNotFoundException, in different suite variants per run:

The deterministic test-side wait was handled separately (SPARK-57XXX); this change makes any future recurrence — in any suite or scheduled job — immediately diagnosable instead of requiring artifact spelunking.

Does this PR introduce any user-facing change?

No (improves an error message on an already-failing path).

How was this patch tested?

This pull request and its description were written by Isaac.

…load fails

When loading state from a snapshot (e.g. reading with the snapshotStartBatchId
option) the snapshot zip for the requested version can be missing, most commonly
because the asynchronous maintenance thread has not uploaded it yet. Today this
surfaces only as:

  CANNOT_LOAD_STATE_STORE.UNCATEGORIZED
  Caused by: java.io.FileNotFoundException: .../state/0/1/2.zip does not exist

which gives no clue whether the snapshot was never created or just not uploaded
in time. This has caused hard-to-diagnose intermittent failures in scheduled CI
(snapshotStartBatchId transformWithState tests).

Enrich the FileNotFoundException thrown from RocksDBFileManager with the snapshot
(.zip) and changelog (.changelog) files that ARE present in the DFS checkpoint
root, so any future occurrence is self-diagnosing from the logs. The listing is
best-effort and never throws. A unit test in RocksDBSuite validates the message.

Co-authored-by: Isaac
@HyukjinKwon HyukjinKwon force-pushed the ci-fix/snapshot-load-diagnostics branch from f917ca5 to af41d59 Compare June 24, 2026 05:41
@HyukjinKwon HyukjinKwon changed the title [DO-NOT-MERGE][SS] Report available snapshot versions when a state snapshot load fails [SPARK-57658][SS] Report available snapshot versions when a RocksDB state snapshot load fails Jun 24, 2026
@HyukjinKwon HyukjinKwon marked this pull request as ready for review June 24, 2026 05:41

@HyukjinKwon HyukjinKwon left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review (code-review skill, recall-biased) — comments only, not an approval.

Reviewed across correctness, removed-behavior, cross-file impact, and altitude. The change is correct and well-placed: the diagnostic lives in unzipBatchZipFileFromDfs, so it covers both snapshot-load callers (loadCheckpointFromDfs and getImmutableFilesFromVersionZip); the catch only intercepts FileNotFoundException and rethrows the same type, so existing callers are unaffected; and listDfsFilesForDiagnostics is best-effort and swallows NonFatal, so the diagnostic can never mask the real failure.

Findings (both non-blocking):

  1. Lost cause / stack trace (minor, but ironic for a diagnostics change). The handler rethrows new FileNotFoundException(msg) without chaining the original e, so the underlying stack trace (the exact RawLocalFileSystem/unzip frame) is dropped. Consider preserving it:
    val enriched = new FileNotFoundException(s"${e.getMessage}\n...")
    enriched.initCause(e)
    throw enriched
  2. Message redundancy (cosmetic). e.getMessage already contains the missing zip path, and the enriched message repeats it via from ${dfsBatchZipFile(...)}. Harmless, but the explicit version/checkpointUniqueId is the part that adds value; the path repeat could be dropped.

Validated: new RocksDBSuite test missing snapshot during load reports the available versions passes, and the snapshot suites (base + WithRowChecksum + CheckpointV2) ran 5× clean on a fork. Happy to fold in the initCause tweak if you'd like.

@uros-b

uros-b commented Jun 24, 2026

Copy link
Copy Markdown
Member

The deterministic test-side wait was handled separately (SPARK-57XXX)

Please don't forget to update this part.

// The version-1 snapshot that does exist must be surfaced in the diagnostic.
assert(ex.getMessage.contains("snapshots=[1.zip]"))
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit regarding coverage gap: the new test only exercises the else-branch (loadCheckpointFromDfs with no checkpointUniqueId, local fs.open). The if (checkpointUniqueId.isDefined) fm.open (V2/checksum) path is not directly tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants