[feat] Optional cache-dit step caching for the Wan DiT#1426
[feat] Optional cache-dit step caching for the Wan DiT#1426Mister-Raggs wants to merge 4 commits into
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 PR merge requirementsWaiting for
This rule is failing.
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds optional cache-dit step-caching support for the Wan DiT pipeline and includes a Modal-based A/B harness to measure wall-time wins vs. SSIM quality cost.
Changes:
- Introduces
FastVideoArgs+ CLI flags for cache-dit configuration (Fn/Bn/threshold/warmup/TaylorSeer). - Integrates cache-dit enable/refresh logic into the denoising stage with an explicit offload incompatibility guard.
- Adds a Modal A/B runner and inner script to benchmark baseline vs cachedit and compute SSIM deltas.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Adds an optional extra for installing cache-dit. |
| fastvideo/fastvideo_args.py | Defines cache-dit args and exposes them via CLI flags. |
| fastvideo/pipelines/stages/denoising.py | Wires cache-dit into the denoising stage and blocks incompatible offload modes. |
| fastvideo/tests/modal/cachedit_ab.py | Adds a Modal harness to run baseline vs cachedit passes and print deltas. |
| fastvideo/tests/modal/_cachedit_ab_inner.py | Implements the per-pass runner and CPU SSIM computation between outputs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import cache_dit | ||
| from cache_dit import (BlockAdapter, DBCacheConfig, ForwardPattern, TaylorSeerCalibratorConfig) |
| for b, p in zip(baseline, patched, strict=True): | ||
| assert b["i"] == p["i"] | ||
| vals = _ssim(b["mp4"], p["mp4"]) |
| def _ssim(p1, p2): | ||
| f1, f2 = _read_video_frames(p1), _read_video_frames(p2) | ||
| n = min(f1.shape[0], f2.shape[0]) | ||
| f1 = (f1[:n].float() / 255.0).contiguous() | ||
| f2 = (f2[:n].float() / 255.0).contiguous() | ||
| return [pm_ssim(f1[i:i + 1], f2[i:i + 1], data_range=1.0).item() for i in range(n)] |
| with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f: | ||
| json.dump(config, f) | ||
| config_path = f.name | ||
| results_path = config_path.replace(".json", ".results.json") | ||
| inner = "/FastVideo/fastvideo/tests/modal/_cachedit_ab_inner.py" | ||
| cmd = (f"source /opt/venv/bin/activate && exec python {inner} " | ||
| f"--config-json {config_path} --results-json {results_path}") | ||
| subprocess.run(["/bin/bash", "-lc", cmd], check=True) |
| uv pip install -e ".[test]" | ||
| uv pip install cache-dit |
There was a problem hiding this comment.
Code Review
This pull request integrates cache-dit step caching into the Wan DiT pipeline to accelerate video generation, adding configuration options, integration logic, a Modal-based A/B testing harness, and an optional package extra. The review feedback highlights several key improvements: adding a compatibility check to prevent using step caching with torch.compile due to dynamic control flow, correcting the Hugging Face login command in the Modal harness, wrapping optional imports in a try-except block for clearer error messages, and broadening exception handling during video frame loading to ensure robust fallback decoding.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if fastvideo_args.use_cachedit and (fastvideo_args.dit_layerwise_offload or fastvideo_args.dit_cpu_offload): | ||
| raise ValueError("use_cachedit is incompatible with DiT offloading: caching skips " | ||
| "blocks, but the layerwise/CPU offload hook assumes every block " | ||
| "runs each step. Set dit_layerwise_offload=False and " | ||
| "dit_cpu_offload=False (the model must fit in GPU memory).") |
There was a problem hiding this comment.
Add a check to prevent using use_cachedit with enable_torch_compile. Since cache-dit dynamically skips blocks based on residual changes, it introduces data-dependent dynamic control flow that is incompatible with torch.compile and can lead to compilation failures or significant overhead.
| if fastvideo_args.use_cachedit and (fastvideo_args.dit_layerwise_offload or fastvideo_args.dit_cpu_offload): | |
| raise ValueError("use_cachedit is incompatible with DiT offloading: caching skips " | |
| "blocks, but the layerwise/CPU offload hook assumes every block " | |
| "runs each step. Set dit_layerwise_offload=False and " | |
| "dit_cpu_offload=False (the model must fit in GPU memory).") | |
| if fastvideo_args.use_cachedit and (fastvideo_args.dit_layerwise_offload or fastvideo_args.dit_cpu_offload): | |
| raise ValueError("use_cachedit is incompatible with DiT offloading: caching skips " | |
| "blocks, but the layerwise/CPU offload hook assumes every block " | |
| "runs each step. Set dit_layerwise_offload=False and " | |
| "dit_cpu_offload=False (the model must fit in GPU memory).") | |
| if fastvideo_args.use_cachedit and fastvideo_args.enable_torch_compile: | |
| raise ValueError("use_cachedit is currently incompatible with torch.compile because cache-dit " | |
| "introduces data-dependent dynamic control flow. Please set enable_torch_compile=False.") |
| uv pip install cache-dit | ||
| cd fastvideo-kernel && ./build.sh && cd .. | ||
| export HF_HOME=/root/data/.cache | ||
| hf auth login --token "$HF_TOKEN" |
There was a problem hiding this comment.
Use the standard huggingface-cli login command instead of hf auth login. huggingface-cli is the official CLI tool provided by the huggingface_hub library, whereas hf is not standard and will likely fail with a command-not-found error.
| hf auth login --token "$HF_TOKEN" | |
| huggingface-cli login --token "$HF_TOKEN" |
| import cache_dit | ||
| from cache_dit import (BlockAdapter, DBCacheConfig, ForwardPattern, TaylorSeerCalibratorConfig) |
There was a problem hiding this comment.
Wrap the optional cache_dit imports in a try-except block to provide a more helpful error message if the package is not installed. This improves usability when users enable --use-cachedit without installing the optional [cache] extra.
try:
import cache_dit
from cache_dit import (BlockAdapter, DBCacheConfig, ForwardPattern, TaylorSeerCalibratorConfig)
except ImportError as e:
raise ImportError(
"cache-dit is not installed. Please install it using `pip install cache-dit` "
"or `pip install \"fastvideo[cache]\"` to use step caching."
) from e| except (ImportError, AttributeError): | ||
| pass |
There was a problem hiding this comment.
Catch all exceptions (Exception) instead of only ImportError and AttributeError when attempting to read video frames with torchvision.io.read_video. read_video can raise RuntimeError or other exceptions if the underlying video reader backend (like FFmpeg or PyAV) fails or is missing, which would prevent falling back to the PyAV decoder.
| except (ImportError, AttributeError): | |
| pass | |
| except Exception: | |
| pass |
- denoising: raise a clear ValueError when use_cachedit is combined with enable_torch_compile (cache-dit's data-dependent block-skipping can't be traced by torch.compile; eager-only for now). Wrap the optional cache-dit import in try/except with an actionable install hint. - harness inner: broaden the torchvision decode fallback to any Exception (FFmpeg/PyAV backends can raise RuntimeError), guard against zero decoded frames before SSIM, and replace an index-match assert with a raised ValueError (asserts are stripped under python -O). - harness driver: install cache-dit via the declared ".[test,cache]" extra instead of a separate step; clean up temp config/results/ssim files. Kept `hf auth login` (gemini suggested huggingface-cli): `hf` is the current huggingface_hub entrypoint and the bootstrap is validated working on the fastvideo-dev image.
|
The cachedit schema-parity failure is fixed in the latest push. The other CI failure (test_ltx2_presets_registered) is pre-existing on main - ltx2_3_base is registered and is the LTX2 default, but the test still expects the old 3-preset set; this PR touches no LTX2/preset code. Filed separately as #1427. Re: the hf auth login suggestion - keeping it. hf is the current huggingface_hub CLI entrypoint (huggingface-cli is the legacy alias), and this bootstrap line is validated working on the fastvideo-dev image in our A/B runs. |
|
/merge |
Adds opt-in, default-off step caching for Wan via the cache-dit library (https://github.com/vipshop/cache-dit). When enabled, DBCache skips DiT blocks on denoising steps whose features barely change and reuses a cached residual; an optional TaylorSeer calibrator extrapolates that residual (Taylor expansion) instead of holding it constant, for higher fidelity at the same skip rate. Lossy (SSIM<1.0), so default OFF and gated behind --use-cachedit. cache-dit is wired in via a transformer-only BlockAdapter (no diffusers pipeline needed): DenoisingStage._enable_or_refresh_cachedit() enables the cache once per transformer and refreshes the cache context per generation (num_inference_steps + refresh_context) so state never leaks across prompts. - fastvideo_args: use_cachedit + cachedit_{fn,bn}_compute_blocks / residual_threshold / max_warmup_steps + cachedit_taylorseer[_order], with matching CLI flags. - denoising: enable/refresh hook + a guard that raises if caching is combined with DiT offloading (caching skips blocks; the offload hook assumes every block runs each step). - pyproject: cache-dit as an optional [cache] extra (lazy-imported, not a runtime dependency). Measured on Wan2.1-T2V-1.3B, L40S, 720x1280/77f/30steps, 5 prompts (eager): F8B0 + TaylorSeer, threshold 0.08 -> -18% wall at SSIM mean 0.957 / worst 0.941, visually clean (incl. high-detail scenes). Eager only; compile path untested.
Single-container baseline-vs-cachedit A/B on Wan: per-prompt wall + pairwise SSIM, --taylorseer toggle, --threshold/--fn/--bn/--warmup sweep knobs. Disables DiT offload on both passes (caching skips blocks) and installs cache-dit in the image. This is the harness that produced the PR numbers.
- denoising: raise a clear ValueError when use_cachedit is combined with enable_torch_compile (cache-dit's data-dependent block-skipping can't be traced by torch.compile; eager-only for now). Wrap the optional cache-dit import in try/except with an actionable install hint. - harness inner: broaden the torchvision decode fallback to any Exception (FFmpeg/PyAV backends can raise RuntimeError), guard against zero decoded frames before SSIM, and replace an index-match assert with a raised ValueError (asserts are stripped under python -O). - harness driver: install cache-dit via the declared ".[test,cache]" extra instead of a separate step; clean up temp config/results/ssim files. Kept `hf auth login` (gemini suggested huggingface-cli): `hf` is the current huggingface_hub entrypoint and the bootstrap is validated working on the fastvideo-dev image.
test_fastvideo_args_fields_are_classified requires every FastVideoArgs field to appear in docs/design/inference_schema_parity_inventory.yaml. Add the 7 new use_cachedit/cachedit_* fields under internal_only (opt-in runtime caching config, not part of the canonical typed inference API).
f98973b to
e12c0fc
Compare
Generalizes the Wan-only cache-dit wiring (hao-ai-lab#1426) into a small per-model spec registry (`_CACHEDIT_MODEL_SPECS` + `_resolve_cachedit_spec`, MRO- matched so subclasses inherit) and adds HunyuanVideo. Each spec declares the transformer block-list attribute(s) and the cache-dit ForwardPattern each follows. Wan stays a single-stream Pattern_2 entry (`blocks`). HunyuanVideo is dual-stream MMDiT: `double_blocks` (Pattern_0, returns img+txt) + `single_blocks` (Pattern_3, concatenated single tensor) — FastVideo's block signatures differ from diffusers', so patterns are passed explicitly with check_forward_pattern=False. The CFG mode is no longer hard-coded: `enable_separate_cfg` / `BlockAdapter.has_separate_cfg` are derived per generation from `batch.do_classifier_free_guidance`, so one spec covers a model's classic-CFG and distilled/embedded-guidance configs. For the validated Wan-with-CFG path this is identical to hao-ai-lab#1426 (do_cfg=True). Behavior unchanged when use_cachedit is off. Validation (lossy SSIM A/B on HunyuanVideo) tracked separately.
- denoising: raise a clear ValueError when use_cachedit is combined with enable_torch_compile (cache-dit's data-dependent block-skipping can't be traced by torch.compile; eager-only for now). Wrap the optional cache-dit import in try/except with an actionable install hint. - harness inner: broaden the torchvision decode fallback to any Exception (FFmpeg/PyAV backends can raise RuntimeError), guard against zero decoded frames before SSIM, and replace an index-match assert with a raised ValueError (asserts are stripped under python -O). - harness driver: install cache-dit via the declared ".[test,cache]" extra instead of a separate step; clean up temp config/results/ssim files. Kept `hf auth login` (gemini suggested huggingface-cli): `hf` is the current huggingface_hub entrypoint and the bootstrap is validated working on the fastvideo-dev image.
Generalizes the Wan-only cache-dit wiring (hao-ai-lab#1426) into a small per-model spec registry (`_CACHEDIT_MODEL_SPECS` + `_resolve_cachedit_spec`, MRO- matched so subclasses inherit) and adds HunyuanVideo. Each spec declares the transformer block-list attribute(s) and the cache-dit ForwardPattern each follows. Wan stays a single-stream Pattern_2 entry (`blocks`). HunyuanVideo is dual-stream MMDiT: `double_blocks` (Pattern_0, returns img+txt) + `single_blocks` (Pattern_3, concatenated single tensor) — FastVideo's block signatures differ from diffusers', so patterns are passed explicitly with check_forward_pattern=False. The CFG mode is no longer hard-coded: `enable_separate_cfg` / `BlockAdapter.has_separate_cfg` are derived per generation from `batch.do_classifier_free_guidance`, so one spec covers a model's classic-CFG and distilled/embedded-guidance configs. For the validated Wan-with-CFG path this is identical to hao-ai-lab#1426 (do_cfg=True). Behavior unchanged when use_cachedit is off. Validation (lossy SSIM A/B on HunyuanVideo) tracked separately.
Optional step caching for the Wan DiT via cache-dit
What & why. Adds opt-in, default-off step caching for Wan inference using the cache-dit library. Diffusion runs the DiT once per denoising step, and adjacent steps produce nearly-identical features — cache-dit's DBCache skips the middle DiT blocks on steps whose leading-block residual barely changes and reuses a cached result. An optional TaylorSeer calibrator extrapolates that residual (Taylor expansion) instead of holding it constant, giving higher fidelity at the same skip rate. Lossy (SSIM < 1.0), so default-off and gated behind
--use-cachedit.How. cache-dit attaches through a transformer-only
BlockAdapter(no diffusers pipeline needed).DenoisingStage._enable_or_refresh_cachedit()enables the cache once per transformer and refreshes the cache context each generation (num_inference_steps+refresh_context) so state never leaks across prompts. Wan's separate cond/uncond forwards are handled viaenable_separate_cfg=True/cfg_compute_first=False.Results — Wan2.1-T2V-1.3B, 1×L40S, 720×1280 / 77f / 30 steps, 5 prompts, eager. Two operating points via
--cachedit-residual-threshold(both F8B0 + TaylorSeer):TaylorSeer is the key to the aggressive preset: spot-checked frames including a high-detail space-station cupola are visually clean at both thresholds. Without TaylorSeer (plain DBCache constant-residual reuse), the same −32% operating point shows undersolved-noise artifacts on that high-detail content — extrapolating the residual is what keeps the aggressive setting clean.
Caveats.
--dit-cpu-offload False --dit-layerwise-offload False).torch.compilepath is rejected with a clear error (caching adds data-dependent control flow). Follow-up.[cache]extra, lazy-imported — not a runtime dependency. (Flagging the new optional dependency explicitly for maintainer visibility.)Test evidence. Modal A/B harness
fastvideo/tests/modal/cachedit_ab.py(baseline vs cached, per-prompt wall + SSIM) — produced the numbers above. Reproduce:modal run fastvideo/tests/modal/cachedit_ab.py --gpu L40S --taylorseer(quality) or--taylorseer --threshold 0.15(performance).