Skip to content

Integrate SmartDiskCache for hash-based persistent caching#1411

Open
BitcrushedHeart wants to merge 11 commits intoNerogar:masterfrom
BitcrushedHeart:SmartCache
Open

Integrate SmartDiskCache for hash-based persistent caching#1411
BitcrushedHeart wants to merge 11 commits intoNerogar:masterfrom
BitcrushedHeart:SmartCache

Conversation

@BitcrushedHeart
Copy link
Copy Markdown
Contributor

@BitcrushedHeart BitcrushedHeart commented Apr 6, 2026

SmartDiskCache Integration

What This Is

Wires OneTrainer into the new 'SmartDiskCache' module from the companion mgds PR (Nerogar/mgds#49). The cache becomes persistent and content-addressed. It grows over time and only rebuilds what's genuinely stale, rather than wiping and rebuilding every time a file changes.

What Changed

Config

'sourceless_training' field added to TrainConfig with migration (migration_10). Default 'False'. 'clear_cache_before_training' default changed to False since SmartCache makes forced rebuilds unnecessary in most cases.

UI

  • Sourceless Training toggle in the Data tab - trains from cached .pt files without source images/text
  • Clean Cache button in the Data tab — shows a preview of orphaned cache files (count + MB) before deleting anything, handles both text and image cache directories
  • Updated clear_cache_before_training tooltip to reflect that SmartCache validates incrementally and detects model type changes automatically

Dataloaders

All dataloaders that previously used DiskCache now use SmartDiskCache through DataLoaderText2ImageMixin._cache_modules(). The mixin passes modeltype, source_path_in_name, and sourceless to the SmartDiskCache constructor.

When 'sourceless_training' and 'latent_caching' are both enabled, '_create_dataset()' short-circuits to '[cache_modules, output_modules]', skipping file enumeration, loading, augmentation, and preparation modules entirely.

Interruptible Caching

Pressing "Stop Training" during caching now finishes the current file, saves the cache index, and stops gracefully. The next run picks up where it left off.

GenericTrainer

'__clear_cache()' now prints a message explaining that SmartCache makes clearing unnecessary. The wipe logic is preserved (deletes image/, text/, and epoch-* dirs) but the default is off.

Dependencies

Requires Nerogar/mgds#49 (SmartDiskCache module).

Testing

Test branch: 'SmartcacheTests' on the mgds repo contains 69 tests covering the full cache system.


Closes #280
Closes #109

Replaces DiskCache with SmartDiskCache in all dataloaders, adds
sourceless_training config field with UI toggle, adds Clean Cache button
with preview dialog, updates clear_cache_before_training default to
False, and adds xxhash to requirements.
SmartCache validates incrementally and detects model type changes
automatically, so the old warning about disabling cache clearing is
no longer accurate.
SmartDiskCache import was placed after CollectPaths/DecodeVAE instead of
in alphabetical order after SingleAspectCalculation.
Text encoder training requires re-tokenizing prompts from source files,
which are not available in sourceless mode. Raise a clear error at
dataset creation time rather than failing mid-training.
- Fix source_path_in_name: prompt_path -> image_path for text cache
- Add stop_check_fun to SmartDiskCache for interruptible caching
- Catch CachingStoppedException in trainer epoch loop
- Closes Nerogar#109
- Text cache now validates against sample_prompt_path instead of image_path
- Clean button disabled while training is running to prevent concurrent access
Upstream mgds SmartCache added f65c2de 'Add fast validation to skip
expensive per-file cache checks', replacing the 20+ min full stat
walk with a directory-mtime + sampled spot-check path that returns
in under a second on unchanged datasets.
Upstream mgds SmartCache now caches validated source filepaths in a
per-process set and short-circuits start-of-epoch validation when every
required path is already in that set. Before, even with the fast-validate
path available, each epoch still re-stat'd the dataset. After, only the
first epoch validates; every epoch after that returns immediately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feat]: Smarter caching instead of all or nothing - graceful fails, reduce repeated work, portable cache [Feat] "Stop Training" should stop caching

1 participant