[feat] Optional cache-dit step caching for the Wan DiT by Mister-Raggs · Pull Request #1426 · hao-ai-lab/FastVideo

Mister-Raggs · 2026-06-03T00:46:16Z

Optional step caching for the Wan DiT via cache-dit

What & why. Adds opt-in, default-off step caching for Wan inference using the cache-dit library. Diffusion runs the DiT once per denoising step, and adjacent steps produce nearly-identical features — cache-dit's DBCache skips the middle DiT blocks on steps whose leading-block residual barely changes and reuses a cached result. An optional TaylorSeer calibrator extrapolates that residual (Taylor expansion) instead of holding it constant, giving higher fidelity at the same skip rate. Lossy (SSIM < 1.0), so default-off and gated behind --use-cachedit.

How. cache-dit attaches through a transformer-only BlockAdapter (no diffusers pipeline needed). DenoisingStage._enable_or_refresh_cachedit() enables the cache once per transformer and refreshes the cache context each generation (num_inference_steps + refresh_context) so state never leaks across prompts. Wan's separate cond/uncond forwards are handled via enable_separate_cfg=True / cfg_compute_first=False.

Results — Wan2.1-T2V-1.3B, 1×L40S, 720×1280 / 77f / 30 steps, 5 prompts, eager. Two operating points via --cachedit-residual-threshold (both F8B0 + TaylorSeer):

Preset	threshold	wall	SSIM mean / worst
Quality (default)	0.08	−18%	0.957 / 0.941
Performance	0.15	−32%	0.943 / 0.920

TaylorSeer is the key to the aggressive preset: spot-checked frames including a high-detail space-station cupola are visually clean at both thresholds. Without TaylorSeer (plain DBCache constant-residual reuse), the same −32% operating point shows undersolved-noise artifacts on that high-detail content — extrapolating the residual is what keeps the aggressive setting clean.

Caveats.

Incompatible with DiT offloading — caching skips blocks, the offload hook assumes every block runs each step; guarded with a clear error (set --dit-cpu-offload False --dit-layerwise-offload False).
Eager only; the torch.compile path is rejected with a clear error (caching adds data-dependent control flow). Follow-up.
Wan DiT only for now.
cache-dit is added as an optional [cache] extra, lazy-imported — not a runtime dependency. (Flagging the new optional dependency explicitly for maintainer visibility.)

Test evidence. Modal A/B harness fastvideo/tests/modal/cachedit_ab.py (baseline vs cached, per-prompt wall + SSIM) — produced the numbers above. Reproduce: modal run fastvideo/tests/modal/cachedit_ab.py --gpu L40S --taylorseer (quality) or --taylorseer --threshold 0.15 (performance).

mergify · 2026-06-03T00:47:05Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for

#approved-reviews-by>=1

This rule is failing.

#approved-reviews-by>=1
check-success=fastcheck-passed
check-success=full-suite-passed
check-success~=pre-commit
title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds optional cache-dit step-caching support for the Wan DiT pipeline and includes a Modal-based A/B harness to measure wall-time wins vs. SSIM quality cost.

Changes:

Introduces FastVideoArgs + CLI flags for cache-dit configuration (Fn/Bn/threshold/warmup/TaylorSeer).
Integrates cache-dit enable/refresh logic into the denoising stage with an explicit offload incompatibility guard.
Adds a Modal A/B runner and inner script to benchmark baseline vs cachedit and compute SSIM deltas.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
pyproject.toml	Adds an optional extra for installing cache-dit.
fastvideo/fastvideo_args.py	Defines cache-dit args and exposes them via CLI flags.
fastvideo/pipelines/stages/denoising.py	Wires cache-dit into the denoising stage and blocks incompatible offload modes.
fastvideo/tests/modal/cachedit_ab.py	Adds a Modal harness to run baseline vs cachedit passes and print deltas.
fastvideo/tests/modal/_cachedit_ab_inner.py	Implements the per-pass runner and CPU SSIM computation between outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        import cache_dit
+        from cache_dit import (BlockAdapter, DBCacheConfig, ForwardPattern, TaylorSeerCalibratorConfig)


+    for b, p in zip(baseline, patched, strict=True):
+        assert b["i"] == p["i"]
+        vals = _ssim(b["mp4"], p["mp4"])


+    def _ssim(p1, p2):
+        f1, f2 = _read_video_frames(p1), _read_video_frames(p2)
+        n = min(f1.shape[0], f2.shape[0])
+        f1 = (f1[:n].float() / 255.0).contiguous()
+        f2 = (f2[:n].float() / 255.0).contiguous()
+        return [pm_ssim(f1[i:i + 1], f2[i:i + 1], data_range=1.0).item() for i in range(n)]


+    with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
+        json.dump(config, f)
+        config_path = f.name
+    results_path = config_path.replace(".json", ".results.json")
+    inner = "/FastVideo/fastvideo/tests/modal/_cachedit_ab_inner.py"
+    cmd = (f"source /opt/venv/bin/activate && exec python {inner} "
+           f"--config-json {config_path} --results-json {results_path}")
+    subprocess.run(["/bin/bash", "-lc", cmd], check=True)


+uv pip install -e ".[test]"
+uv pip install cache-dit


gemini-code-assist

Code Review

This pull request integrates cache-dit step caching into the Wan DiT pipeline to accelerate video generation, adding configuration options, integration logic, a Modal-based A/B testing harness, and an optional package extra. The review feedback highlights several key improvements: adding a compatibility check to prevent using step caching with torch.compile due to dynamic control flow, correcting the Hugging Face login command in the Modal harness, wrapping optional imports in a try-except block for clearer error messages, and broadening exception handling during video frame loading to ensure robust fallback decoding.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-03T00:47:39Z

+        if fastvideo_args.use_cachedit and (fastvideo_args.dit_layerwise_offload or fastvideo_args.dit_cpu_offload):
+            raise ValueError("use_cachedit is incompatible with DiT offloading: caching skips "
+                             "blocks, but the layerwise/CPU offload hook assumes every block "
+                             "runs each step. Set dit_layerwise_offload=False and "
+                             "dit_cpu_offload=False (the model must fit in GPU memory).")


Add a check to prevent using use_cachedit with enable_torch_compile. Since cache-dit dynamically skips blocks based on residual changes, it introduces data-dependent dynamic control flow that is incompatible with torch.compile and can lead to compilation failures or significant overhead.

Suggested change

if fastvideo_args.use_cachedit and (fastvideo_args.dit_layerwise_offload or fastvideo_args.dit_cpu_offload):

raise ValueError("use_cachedit is incompatible with DiT offloading: caching skips "

"blocks, but the layerwise/CPU offload hook assumes every block "

"runs each step. Set dit_layerwise_offload=False and "

"dit_cpu_offload=False (the model must fit in GPU memory).")

if fastvideo_args.use_cachedit and (fastvideo_args.dit_layerwise_offload or fastvideo_args.dit_cpu_offload):

raise ValueError("use_cachedit is incompatible with DiT offloading: caching skips "

"blocks, but the layerwise/CPU offload hook assumes every block "

"runs each step. Set dit_layerwise_offload=False and "

"dit_cpu_offload=False (the model must fit in GPU memory).")

if fastvideo_args.use_cachedit and fastvideo_args.enable_torch_compile:

raise ValueError("use_cachedit is currently incompatible with torch.compile because cache-dit "

"introduces data-dependent dynamic control flow. Please set enable_torch_compile=False.")

gemini-code-assist · 2026-06-03T00:47:39Z

+uv pip install cache-dit
+cd fastvideo-kernel && ./build.sh && cd ..
+export HF_HOME=/root/data/.cache
+hf auth login --token "$HF_TOKEN"


Use the standard huggingface-cli login command instead of hf auth login. huggingface-cli is the official CLI tool provided by the huggingface_hub library, whereas hf is not standard and will likely fail with a command-not-found error.

Suggested change

hf auth login --token "$HF_TOKEN"

huggingface-cli login --token "$HF_TOKEN"

gemini-code-assist · 2026-06-03T00:47:39Z

+        import cache_dit
+        from cache_dit import (BlockAdapter, DBCacheConfig, ForwardPattern, TaylorSeerCalibratorConfig)


Wrap the optional cache_dit imports in a try-except block to provide a more helpful error message if the package is not installed. This improves usability when users enable --use-cachedit without installing the optional [cache] extra.

try: import cache_dit from cache_dit import (BlockAdapter, DBCacheConfig, ForwardPattern, TaylorSeerCalibratorConfig) except ImportError as e: raise ImportError( "cache-dit is not installed. Please install it using `pip install cache-dit` " "or `pip install \"fastvideo[cache]\"` to use step caching." ) from e

gemini-code-assist · 2026-06-03T00:47:39Z

+        except (ImportError, AttributeError):
+            pass


Catch all exceptions (Exception) instead of only ImportError and AttributeError when attempting to read video frames with torchvision.io.read_video. read_video can raise RuntimeError or other exceptions if the underlying video reader backend (like FFmpeg or PyAV) fails or is missing, which would prevent falling back to the PyAV decoder.

Suggested change

except (ImportError, AttributeError):

pass

except Exception:

pass

- denoising: raise a clear ValueError when use_cachedit is combined with enable_torch_compile (cache-dit's data-dependent block-skipping can't be traced by torch.compile; eager-only for now). Wrap the optional cache-dit import in try/except with an actionable install hint. - harness inner: broaden the torchvision decode fallback to any Exception (FFmpeg/PyAV backends can raise RuntimeError), guard against zero decoded frames before SSIM, and replace an index-match assert with a raised ValueError (asserts are stripped under python -O). - harness driver: install cache-dit via the declared ".[test,cache]" extra instead of a separate step; clean up temp config/results/ssim files. Kept `hf auth login` (gemini suggested huggingface-cli): `hf` is the current huggingface_hub entrypoint and the bootstrap is validated working on the fastvideo-dev image.

Mister-Raggs · 2026-06-03T01:19:29Z

The cachedit schema-parity failure is fixed in the latest push. The other CI failure (test_ltx2_presets_registered) is pre-existing on main - ltx2_3_base is registered and is the LTX2 default, but the test still expects the old 3-preset set; this PR touches no LTX2/preset code. Filed separately as #1427.

Re: the hf auth login suggestion - keeping it. hf is the current huggingface_hub CLI entrypoint (huggingface-cli is the legacy alias), and this bootstrap line is validated working on the fastvideo-dev image in our A/B runs.

SolitaryThinker · 2026-06-08T23:53:19Z

/merge

Adds opt-in, default-off step caching for Wan via the cache-dit library (https://github.com/vipshop/cache-dit). When enabled, DBCache skips DiT blocks on denoising steps whose features barely change and reuses a cached residual; an optional TaylorSeer calibrator extrapolates that residual (Taylor expansion) instead of holding it constant, for higher fidelity at the same skip rate. Lossy (SSIM<1.0), so default OFF and gated behind --use-cachedit. cache-dit is wired in via a transformer-only BlockAdapter (no diffusers pipeline needed): DenoisingStage._enable_or_refresh_cachedit() enables the cache once per transformer and refreshes the cache context per generation (num_inference_steps + refresh_context) so state never leaks across prompts. - fastvideo_args: use_cachedit + cachedit_{fn,bn}_compute_blocks / residual_threshold / max_warmup_steps + cachedit_taylorseer[_order], with matching CLI flags. - denoising: enable/refresh hook + a guard that raises if caching is combined with DiT offloading (caching skips blocks; the offload hook assumes every block runs each step). - pyproject: cache-dit as an optional [cache] extra (lazy-imported, not a runtime dependency). Measured on Wan2.1-T2V-1.3B, L40S, 720x1280/77f/30steps, 5 prompts (eager): F8B0 + TaylorSeer, threshold 0.08 -> -18% wall at SSIM mean 0.957 / worst 0.941, visually clean (incl. high-detail scenes). Eager only; compile path untested.

Single-container baseline-vs-cachedit A/B on Wan: per-prompt wall + pairwise SSIM, --taylorseer toggle, --threshold/--fn/--bn/--warmup sweep knobs. Disables DiT offload on both passes (caching skips blocks) and installs cache-dit in the image. This is the harness that produced the PR numbers.

- denoising: raise a clear ValueError when use_cachedit is combined with enable_torch_compile (cache-dit's data-dependent block-skipping can't be traced by torch.compile; eager-only for now). Wrap the optional cache-dit import in try/except with an actionable install hint. - harness inner: broaden the torchvision decode fallback to any Exception (FFmpeg/PyAV backends can raise RuntimeError), guard against zero decoded frames before SSIM, and replace an index-match assert with a raised ValueError (asserts are stripped under python -O). - harness driver: install cache-dit via the declared ".[test,cache]" extra instead of a separate step; clean up temp config/results/ssim files. Kept `hf auth login` (gemini suggested huggingface-cli): `hf` is the current huggingface_hub entrypoint and the bootstrap is validated working on the fastvideo-dev image.

test_fastvideo_args_fields_are_classified requires every FastVideoArgs field to appear in docs/design/inference_schema_parity_inventory.yaml. Add the 7 new use_cachedit/cachedit_* fields under internal_only (opt-in runtime caching config, not part of the canonical typed inference API).

Generalizes the Wan-only cache-dit wiring (hao-ai-lab#1426) into a small per-model spec registry (`_CACHEDIT_MODEL_SPECS` + `_resolve_cachedit_spec`, MRO- matched so subclasses inherit) and adds HunyuanVideo. Each spec declares the transformer block-list attribute(s) and the cache-dit ForwardPattern each follows. Wan stays a single-stream Pattern_2 entry (`blocks`). HunyuanVideo is dual-stream MMDiT: `double_blocks` (Pattern_0, returns img+txt) + `single_blocks` (Pattern_3, concatenated single tensor) — FastVideo's block signatures differ from diffusers', so patterns are passed explicitly with check_forward_pattern=False. The CFG mode is no longer hard-coded: `enable_separate_cfg` / `BlockAdapter.has_separate_cfg` are derived per generation from `batch.do_classifier_free_guidance`, so one spec covers a model's classic-CFG and distilled/embedded-guidance configs. For the validated Wan-with-CFG path this is identical to hao-ai-lab#1426 (do_cfg=True). Behavior unchanged when use_cachedit is off. Validation (lossy SSIM A/B on HunyuanVideo) tracked separately.

- denoising: raise a clear ValueError when use_cachedit is combined with enable_torch_compile (cache-dit's data-dependent block-skipping can't be traced by torch.compile; eager-only for now). Wrap the optional cache-dit import in try/except with an actionable install hint. - harness inner: broaden the torchvision decode fallback to any Exception (FFmpeg/PyAV backends can raise RuntimeError), guard against zero decoded frames before SSIM, and replace an index-match assert with a raised ValueError (asserts are stripped under python -O). - harness driver: install cache-dit via the declared ".[test,cache]" extra instead of a separate step; clean up temp config/results/ssim files. Kept `hf auth login` (gemini suggested huggingface-cli): `hf` is the current huggingface_hub entrypoint and the bootstrap is validated working on the fastvideo-dev image.

Generalizes the Wan-only cache-dit wiring (hao-ai-lab#1426) into a small per-model spec registry (`_CACHEDIT_MODEL_SPECS` + `_resolve_cachedit_spec`, MRO- matched so subclasses inherit) and adds HunyuanVideo. Each spec declares the transformer block-list attribute(s) and the cache-dit ForwardPattern each follows. Wan stays a single-stream Pattern_2 entry (`blocks`). HunyuanVideo is dual-stream MMDiT: `double_blocks` (Pattern_0, returns img+txt) + `single_blocks` (Pattern_3, concatenated single tensor) — FastVideo's block signatures differ from diffusers', so patterns are passed explicitly with check_forward_pattern=False. The CFG mode is no longer hard-coded: `enable_separate_cfg` / `BlockAdapter.has_separate_cfg` are derived per generation from `batch.do_classifier_free_guidance`, so one spec covers a model's classic-CFG and distilled/embedded-guidance configs. For the validated Wan-with-CFG path this is identical to hao-ai-lab#1426 (do_cfg=True). Behavior unchanged when use_cachedit is off. Validation (lossy SSIM A/B on HunyuanVideo) tracked separately.

Copilot AI review requested due to automatic review settings June 3, 2026 00:46

mergify Bot added type: feat New feature or capability scope: inference Inference pipeline, serving, CLI scope: infra CI, tests, Docker, build labels Jun 3, 2026

Copilot AI reviewed Jun 3, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jun 3, 2026

View reviewed changes

mergify Bot added the scope: docs Documentation label Jun 3, 2026

Mister-Raggs mentioned this pull request Jun 3, 2026

test_ltx2_presets_registered fails on main: ltx2_3_base registered but missing from expected preset set #1427

Closed

github-actions Bot added the ready PR is ready to merge label Jun 8, 2026

Mister-Raggs added 4 commits June 8, 2026 18:33

Mister-Raggs force-pushed the perf/wan-cachedit branch from f98973b to e12c0fc Compare June 9, 2026 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Optional cache-dit step caching for the Wan DiT#1426

[feat] Optional cache-dit step caching for the Wan DiT#1426
Mister-Raggs wants to merge 4 commits into
hao-ai-lab:mainfrom
Mister-Raggs:perf/wan-cachedit

Mister-Raggs commented Jun 3, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Uh oh!

Mister-Raggs commented Jun 3, 2026

Uh oh!

SolitaryThinker commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		import cache_dit
		from cache_dit import (BlockAdapter, DBCacheConfig, ForwardPattern, TaylorSeerCalibratorConfig)

	hf auth login --token "$HF_TOKEN"
	huggingface-cli login --token "$HF_TOKEN"

Conversation

Mister-Raggs commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optional step caching for the Wan DiT via cache-dit

Uh oh!

mergify Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 PR merge requirements

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Mister-Raggs commented Jun 3, 2026

Uh oh!

SolitaryThinker commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mister-Raggs commented Jun 3, 2026 •

edited

Loading

mergify Bot commented Jun 3, 2026 •

edited

Loading