Fix validation skipped when IterableDataset exhausts early#21552
Open
avocardio wants to merge 2 commits intoLightning-AI:masterfrom
Open
Fix validation skipped when IterableDataset exhausts early#21552avocardio wants to merge 2 commits intoLightning-AI:masterfrom
avocardio wants to merge 2 commits intoLightning-AI:masterfrom
Conversation
…length When an IterableDataset reports a length via __len__ but produces fewer batches (due to shard boundaries, rounding, or drop_last=True with multiple workers), StopIteration is raised in _DataFetcher.__next__ before fetched >= length. This StopIteration propagates to the training epoch loop's run() method, where `except StopIteration: break` exits the loop — skipping on_advance_end() and the validation check it contains. The fix adds a post-loop validation check: when the data fetcher is done (StopIteration was caught) and validation should run at the epoch boundary, we set is_last_batch=True and run the validation check that was skipped. Fixes Lightning-AI#19624
for more information, see https://pre-commit.ci
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #19624
Problem
If an
IterableDatasetimplements__len__but yields fewer batches than expected (common with webdataset, DALI, or any streaming dataset where shard boundaries /drop_last/ worker splitting cause the actual count to differ from the estimate), validation never runs — not just for that epoch, but for every subsequent epoch too.The root cause is in
_TrainingEpochLoop.run(). When_DataFetcher.__next__()hits the end of the underlying iterator, it setsdone = Trueand re-raisesStopIteration. Theexcept StopIteration: breakin the training loop then skipson_advance_end(), which is where the validation check lives. Since no training batch was actually processed on that final iteration (the fetch itself failed), skipping the per-batch bookkeeping is fine — but the end-of-epoch validation should still fire.Fix
After the while loop exits, check if the fetcher was exhausted and run validation if appropriate. The check mirrors the existing logic in
on_advance_endbut only triggers the validation part, since no batch was processed.I ran into this while training a ViT-B/32 model on webdataset shards. Validation was silently skipped every epoch until I traced it to this codepath. After the fix, validation fires reliably — confirmed across 20+ epochs on two separate multi-GPU runs.
Test
Added a regression test with a minimal
IterableDatasetthat reportslen=10but only yields 8 samples.📚 Documentation preview 📚: https://pytorch-lightning--21552.org.readthedocs.build/en/21552/