Add SmartDiskCache module with hash-based persistent caching by BitcrushedHeart · Pull Request #49 · Nerogar/mgds

BitcrushedHeart · 2026-04-06T07:46:53Z

SmartDiskCache - Hash-Based Persistent Caching

What This Is

A replacement for 'DiskCache' that makes caching persistent and content-addressed rather than ephemeral. Adding one image to a 100k dataset caches one file, not 100k. Editing one caption recaches one text embedding, not all of them. Moving files between concepts (same content, different path) reuses existing cache via hash matching. Switching between training configs that differ only in non-cache-relevant settings never triggers recaching.

The cache becomes a content-addressed store that grows over time and only rebuilds what's genuinely stale.

How It Works

Hashing

Every source file gets an xxhash64 hash of its contents. xxhash64 is faster than MD5/SHA-256 and has excellent collision resistance for non-cryptographic purposes. The full 64-bit hash is used internally for comparison. Cache filenames use a 12 hex char truncation (48 bits, ~281 trillion possible values) to keep paths manageable.

Image cache files: '{hash12}{resolution}{variation}.pt'
Text cache files: '{hash12}_{variation}.pt'

Validation Flow

Per-file validation runs for each file needed in the current epoch:

EXIST: Does this file have a cache entry for the current modeltype? If not, hash it and check for dedup (same content elsewhere), or build new cache.
EXIST: Does the expected '.pt' file exist on disk? If not, rebuild.
MTIME: Has mtime changed since the cache entry was written? If not, accept (fast path - most files won't have changed).
HASH: Recalculate xxhash64. If hash unchanged (file touched/copied but content identical), accept and update mtime. If hash changed, rebuild.

The mtime check is the fast path. Hash computation only happens when mtime changes. This means validation of a 100k dataset where nothing changed is essentially free - it's 100k 'stat()' calls, no file reads.

Cache Index

Each cache directory ('image/' and 'text/') maintains a 'cache.json' index with per-file entries (filename, hash, mtime, modeltype, resolution, cache_file, cache_version) and a 'hash_index' mapping hashes to lists of filepaths for dedup lookups. The index uses atomic writes (write to '.tmp', backup to '.bak', rename) with crash recovery on startup.

Deduplication

When a new file is encountered, its hash is checked against the 'hash_index'. If a match exists with the same modeltype and resolution, the existing cache entry is reused - no encoding needed. This handles the common case of the same image appearing in multiple concepts.

When one copy of a deduplicated file is edited, it gets a new hash and new cache files. The unedited copy still points to the old cache entry. When all references to a hash are gone, the cache files become eligible for garbage collection.

Sourceless Training

If all necessary training data is embedded in the '.pt' cache files, users can train from cache alone without the source images/text files. A 'sourceless_training' toggle in the config enables this. When active, the dataloader skips file enumeration, loading, and augmentation modules entirely - the pipeline collapses to just '[cache_modules, output_modules]'.

On startup in sourceless mode, 'SmartDiskCache' validates that all cache entries have sufficient 'cache_version', correct 'modeltype', and existing '.pt' files. Clear errors are raised if anything is missing.

This enables dataset sharing without distributing original files. Cached latents can't be decoded back to pixel-space images without the VAE decoder, so this is a one-way transform - useful for privacy-sensitive datasets.

Garbage Collection

A "Clean Cache" button in the UI identifies orphaned cache files (source file no longer exists, or '.pt' files with no 'cache.json' entry) and shows a preview with file counts and sizes before deleting anything. Dedup-shared '.pt' files are preserved as long as at least one source file still references them.

Sample Selection Fix

The SAMPLES balancing strategy now shuffles the full file pool then takes N, rather than taking the first N then shuffling. This gives genuinely random sampling across epochs when using large datasets with sample limits.

What Changed

New File

'src/mgds/pipelineModules/SmartDiskCache.py' - the entire module. 'PipelineModule' + 'SingleVariationRandomAccessPipelineModule', drop-in replacement for 'DiskCache' with additional constructor params ('modeltype', 'source_path_in_name', 'sourceless').

Testing

Test branch: 'SmartcacheTests' - 69 tests covering hashing, cache validation flow, deduplication, atomic writes/crash recovery, garbage collection, sourceless training, sample selection, DiskCache regression, and issue regression scenarios.

Why not replace DiskCache?

While mgds is built for OneTrainer, I have no idea what else could be using mgds - so this allows existing repos to continue using DIskCache, even as OneTrainer shifts to SmartDiskCache - if desired we could raise a depreciation warning when DiskCache is used if this is merged.

Closes #41

Introduces SmartDiskCache as a drop-in replacement for DiskCache with per-file xxhash64 content validation, content-addressed cache filenames, a cache.json index with deduplication support, atomic writes with crash recovery, garbage collection, sourceless training mode, and a sample selection fix for the SAMPLES balancing strategy.

- rebuild validation status now cleans hash_index before re-queuing, matching the behavior of content_changed/resolution_changed/missing_pt - Remove unused all_input_files set from __refresh_cache

- Store loss_weight, type, name, path, seed from concept dict in .pt files at build time (follows existing __cache_version pattern) - In sourceless mode, reconstruct concept dict from stored metadata so OutputPipelineModule can resolve concept.loss_weight - Add concept to sourceless get_outputs() so pipeline resolution finds SmartDiskCache instead of walking back to ConceptPipelineModule - Bump CACHE_VERSION to 2 (forces cache rebuild for sourceless mode, normal mode unaffected)

Call before_cache_fun before falling through to upstream pipeline modules in get_item, so the model is on the correct device when re-encoding uncached items at training time.

The real bug was in OneTrainer passing 'prompt_path' (nonexistent) as source_path_in_name for the text cache, causing every text lookup to miss. With the correct key ('image_path'), the fallback path should never be reached after a fresh cache build.

- Add .pt existence check on mtime fast-path to prevent FileNotFoundError - Replace shutil.move with os.replace for atomic writes on Windows - Rewrite _load_cache_index with 3-stage fallback (cache.json → .tmp → .bak) - Extend _index_lock to cover full save operation (write + backup + rename) - Switch to time-based flush interval (30s) with compact JSON for intermediate flushes - Cache os.path.realpath once in __init__, use _real_pt_path consistently - Cache source paths at epoch start, eliminate per-item pipeline traversal - Load aggregate data into RAM at epoch start, serve from memory in get_item

Shows tqdm progress during the validation loop and aggregate cache loading so the terminal doesn't appear frozen between phases.

The generator expression caused as_completed to submit futures lazily, one at a time, preventing the executor from pipelining the next item while the current one's I/O completes.

On repeat runs where nothing changed, cache validation was taking 20+ minutes due to stat-ing every source file individually. This adds a fast path that checks directory mtimes and spot-checks a sample of entries, reducing validation to under a second for unchanged datasets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Cache validation was running at the start of every epoch, even when the same filepaths were being delivered (which is the common case since users configure repeats rather than custom samples_per_epoch). On larger datasets the per-file validation loop was noticeable at each epoch boundary despite no actual dataset change. Track validated filepaths in a per-process set and short-circuit _reshuffle_and_prepare when every required path is already in that set and still present in the on-disk index. Fall through to the existing fast-validate / full-validate paths otherwise. Trade-off: within-run edits to source files are no longer detected. Cross-run detection (via cache.json + fast validation) is unchanged. Training against a mutating dataset within a single process was never well-defined anyway.

Fix device mismatch on cache miss during training. Call before_cache_fun before falling through to upstream pipeline modules in get_item, so the model is on the correct device when re-encoding uncached items at training time. The fallback is reachable whenever individual files fail to cache (build_failed / missing / hash_failed), so the band-aid from c22be2f was removed prematurely in 28795b1.

Persist a zero-tensor sentinel during cache validation using any successful entry as a shape template. On cache miss, return the sentinel directly instead of re-running upstream encoders. Rationale: files that fail to cache (build_failed / missing / hash_failed) leave gaps in the index. At training time the text encoder is on the temp device (CPU) and bringing it back to GPU to re-encode a single sample risks both a device mismatch and an OOM since the main model is already on GPU. The before_cache_fun re-encode path is kept as a last-resort fallback for the edge case where no valid entries exist yet (e.g. caching interrupted before any file succeeded).

BitcrushedHeart added 2 commits April 6, 2026 07:21

Fix rebuild status hash_index cleanup and remove dead code

7fd2f36

- rebuild validation status now cleans hash_index before re-queuing, matching the behavior of content_changed/resolution_changed/missing_pt - Remove unused all_input_files set from __refresh_cache

BitcrushedHeart mentioned this pull request Apr 6, 2026

Integrate SmartDiskCache for hash-based persistent caching Nerogar/OneTrainer#1411

Open

BitcrushedHeart added 10 commits April 6, 2026 09:23

Fix device mismatch on cache miss during training

c22be2f

Call before_cache_fun before falling through to upstream pipeline modules in get_item, so the model is on the correct device when re-encoding uncached items at training time.

Pass through to upstream when index is outside output space

664b4b3

Add stop_check_fun for interruptible caching

35ee9c5

Precompute resolution, reorder validation for fast mtime path

1088282

Fix variation mismatch in _get_source_path: always use variation 0

a543dc5

Add progress bars for cache validation and aggregate loading

756b841

Shows tqdm progress during the validation loop and aggregate cache loading so the terminal doesn't appear frozen between phases.

Submit all cache build futures upfront to eliminate inter-item delays

22087de

The generator expression caused as_completed to submit futures lazily, one at a time, preventing the executor from pipelining the next item while the current one's I/O completes.

BitcrushedHeart force-pushed the SmartCache branch from c495267 to 22087de Compare April 6, 2026 14:37

BitcrushedHeart and others added 4 commits April 6, 2026 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SmartDiskCache module with hash-based persistent caching#49

Add SmartDiskCache module with hash-based persistent caching#49
BitcrushedHeart wants to merge 16 commits intoNerogar:masterfrom
BitcrushedHeart:SmartCache

BitcrushedHeart commented Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BitcrushedHeart commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SmartDiskCache - Hash-Based Persistent Caching

What This Is

How It Works

Hashing

Validation Flow

Cache Index

Deduplication

Sourceless Training

Garbage Collection

Sample Selection Fix

What Changed

New File

Testing

Why not replace DiskCache?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BitcrushedHeart commented Apr 6, 2026 •

edited

Loading