Skip to content

[feat] Optional cache-dit step caching for the Wan DiT#1426

Open
Mister-Raggs wants to merge 4 commits into
hao-ai-lab:mainfrom
Mister-Raggs:perf/wan-cachedit
Open

[feat] Optional cache-dit step caching for the Wan DiT#1426
Mister-Raggs wants to merge 4 commits into
hao-ai-lab:mainfrom
Mister-Raggs:perf/wan-cachedit

Conversation

@Mister-Raggs

@Mister-Raggs Mister-Raggs commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Optional step caching for the Wan DiT via cache-dit

What & why. Adds opt-in, default-off step caching for Wan inference using the cache-dit library. Diffusion runs the DiT once per denoising step, and adjacent steps produce nearly-identical features — cache-dit's DBCache skips the middle DiT blocks on steps whose leading-block residual barely changes and reuses a cached result. An optional TaylorSeer calibrator extrapolates that residual (Taylor expansion) instead of holding it constant, giving higher fidelity at the same skip rate. Lossy (SSIM < 1.0), so default-off and gated behind --use-cachedit.

How. cache-dit attaches through a transformer-only BlockAdapter (no diffusers pipeline needed). DenoisingStage._enable_or_refresh_cachedit() enables the cache once per transformer and refreshes the cache context each generation (num_inference_steps + refresh_context) so state never leaks across prompts. Wan's separate cond/uncond forwards are handled via enable_separate_cfg=True / cfg_compute_first=False.

Results — Wan2.1-T2V-1.3B, 1×L40S, 720×1280 / 77f / 30 steps, 5 prompts, eager. Two operating points via --cachedit-residual-threshold (both F8B0 + TaylorSeer):

Preset threshold wall SSIM mean / worst
Quality (default) 0.08 −18% 0.957 / 0.941
Performance 0.15 −32% 0.943 / 0.920

TaylorSeer is the key to the aggressive preset: spot-checked frames including a high-detail space-station cupola are visually clean at both thresholds. Without TaylorSeer (plain DBCache constant-residual reuse), the same −32% operating point shows undersolved-noise artifacts on that high-detail content — extrapolating the residual is what keeps the aggressive setting clean.

Caveats.

  • Incompatible with DiT offloading — caching skips blocks, the offload hook assumes every block runs each step; guarded with a clear error (set --dit-cpu-offload False --dit-layerwise-offload False).
  • Eager only; the torch.compile path is rejected with a clear error (caching adds data-dependent control flow). Follow-up.
  • Wan DiT only for now.
  • cache-dit is added as an optional [cache] extra, lazy-imported — not a runtime dependency. (Flagging the new optional dependency explicitly for maintainer visibility.)

Test evidence. Modal A/B harness fastvideo/tests/modal/cachedit_ab.py (baseline vs cached, per-prompt wall + SSIM) — produced the numbers above. Reproduce: modal run fastvideo/tests/modal/cachedit_ab.py --gpu L40S --taylorseer (quality) or --taylorseer --threshold 0.15 (performance).

Copilot AI review requested due to automatic review settings June 3, 2026 00:46
@mergify mergify Bot added type: feat New feature or capability scope: inference Inference pipeline, serving, CLI scope: infra CI, tests, Docker, build labels Jun 3, 2026
@mergify

mergify Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for

  • #approved-reviews-by>=1
This rule is failing.
  • #approved-reviews-by>=1
  • check-success=fastcheck-passed
  • check-success=full-suite-passed
  • check-success~=pre-commit
  • title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds optional cache-dit step-caching support for the Wan DiT pipeline and includes a Modal-based A/B harness to measure wall-time wins vs. SSIM quality cost.

Changes:

  • Introduces FastVideoArgs + CLI flags for cache-dit configuration (Fn/Bn/threshold/warmup/TaylorSeer).
  • Integrates cache-dit enable/refresh logic into the denoising stage with an explicit offload incompatibility guard.
  • Adds a Modal A/B runner and inner script to benchmark baseline vs cachedit and compute SSIM deltas.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pyproject.toml Adds an optional extra for installing cache-dit.
fastvideo/fastvideo_args.py Defines cache-dit args and exposes them via CLI flags.
fastvideo/pipelines/stages/denoising.py Wires cache-dit into the denoising stage and blocks incompatible offload modes.
fastvideo/tests/modal/cachedit_ab.py Adds a Modal harness to run baseline vs cachedit passes and print deltas.
fastvideo/tests/modal/_cachedit_ab_inner.py Implements the per-pass runner and CPU SSIM computation between outputs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fastvideo/pipelines/stages/denoising.py Outdated
Comment on lines +82 to +83
import cache_dit
from cache_dit import (BlockAdapter, DBCacheConfig, ForwardPattern, TaylorSeerCalibratorConfig)
Comment on lines +159 to +161
for b, p in zip(baseline, patched, strict=True):
assert b["i"] == p["i"]
vals = _ssim(b["mp4"], p["mp4"])
Comment on lines +146 to +151
def _ssim(p1, p2):
f1, f2 = _read_video_frames(p1), _read_video_frames(p2)
n = min(f1.shape[0], f2.shape[0])
f1 = (f1[:n].float() / 255.0).contiguous()
f2 = (f2[:n].float() / 255.0).contiguous()
return [pm_ssim(f1[i:i + 1], f2[i:i + 1], data_range=1.0).item() for i in range(n)]
Comment thread fastvideo/tests/modal/cachedit_ab.py Outdated
Comment on lines +116 to +123
with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f:
json.dump(config, f)
config_path = f.name
results_path = config_path.replace(".json", ".results.json")
inner = "/FastVideo/fastvideo/tests/modal/_cachedit_ab_inner.py"
cmd = (f"source /opt/venv/bin/activate && exec python {inner} "
f"--config-json {config_path} --results-json {results_path}")
subprocess.run(["/bin/bash", "-lc", cmd], check=True)
Comment thread fastvideo/tests/modal/cachedit_ab.py Outdated
Comment on lines +91 to +92
uv pip install -e ".[test]"
uv pip install cache-dit

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates cache-dit step caching into the Wan DiT pipeline to accelerate video generation, adding configuration options, integration logic, a Modal-based A/B testing harness, and an optional package extra. The review feedback highlights several key improvements: adding a compatibility check to prevent using step caching with torch.compile due to dynamic control flow, correcting the Hugging Face login command in the Modal harness, wrapping optional imports in a try-except block for clearer error messages, and broadening exception handling during video frame loading to ensure robust fallback decoding.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +326 to +330
if fastvideo_args.use_cachedit and (fastvideo_args.dit_layerwise_offload or fastvideo_args.dit_cpu_offload):
raise ValueError("use_cachedit is incompatible with DiT offloading: caching skips "
"blocks, but the layerwise/CPU offload hook assumes every block "
"runs each step. Set dit_layerwise_offload=False and "
"dit_cpu_offload=False (the model must fit in GPU memory).")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Add a check to prevent using use_cachedit with enable_torch_compile. Since cache-dit dynamically skips blocks based on residual changes, it introduces data-dependent dynamic control flow that is incompatible with torch.compile and can lead to compilation failures or significant overhead.

Suggested change
if fastvideo_args.use_cachedit and (fastvideo_args.dit_layerwise_offload or fastvideo_args.dit_cpu_offload):
raise ValueError("use_cachedit is incompatible with DiT offloading: caching skips "
"blocks, but the layerwise/CPU offload hook assumes every block "
"runs each step. Set dit_layerwise_offload=False and "
"dit_cpu_offload=False (the model must fit in GPU memory).")
if fastvideo_args.use_cachedit and (fastvideo_args.dit_layerwise_offload or fastvideo_args.dit_cpu_offload):
raise ValueError("use_cachedit is incompatible with DiT offloading: caching skips "
"blocks, but the layerwise/CPU offload hook assumes every block "
"runs each step. Set dit_layerwise_offload=False and "
"dit_cpu_offload=False (the model must fit in GPU memory).")
if fastvideo_args.use_cachedit and fastvideo_args.enable_torch_compile:
raise ValueError("use_cachedit is currently incompatible with torch.compile because cache-dit "
"introduces data-dependent dynamic control flow. Please set enable_torch_compile=False.")

uv pip install cache-dit
cd fastvideo-kernel && ./build.sh && cd ..
export HF_HOME=/root/data/.cache
hf auth login --token "$HF_TOKEN"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Use the standard huggingface-cli login command instead of hf auth login. huggingface-cli is the official CLI tool provided by the huggingface_hub library, whereas hf is not standard and will likely fail with a command-not-found error.

Suggested change
hf auth login --token "$HF_TOKEN"
huggingface-cli login --token "$HF_TOKEN"

Comment thread fastvideo/pipelines/stages/denoising.py Outdated
Comment on lines +82 to +83
import cache_dit
from cache_dit import (BlockAdapter, DBCacheConfig, ForwardPattern, TaylorSeerCalibratorConfig)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Wrap the optional cache_dit imports in a try-except block to provide a more helpful error message if the package is not installed. This improves usability when users enable --use-cachedit without installing the optional [cache] extra.

        try:
            import cache_dit
            from cache_dit import (BlockAdapter, DBCacheConfig, ForwardPattern, TaylorSeerCalibratorConfig)
        except ImportError as e:
            raise ImportError(
                "cache-dit is not installed. Please install it using `pip install cache-dit` "
                "or `pip install \"fastvideo[cache]\"` to use step caching."
            ) from e

Comment on lines +134 to +135
except (ImportError, AttributeError):
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catch all exceptions (Exception) instead of only ImportError and AttributeError when attempting to read video frames with torchvision.io.read_video. read_video can raise RuntimeError or other exceptions if the underlying video reader backend (like FFmpeg or PyAV) fails or is missing, which would prevent falling back to the PyAV decoder.

Suggested change
except (ImportError, AttributeError):
pass
except Exception:
pass

Mister-Raggs added a commit to Mister-Raggs/FastVideo that referenced this pull request Jun 3, 2026
- denoising: raise a clear ValueError when use_cachedit is combined with
  enable_torch_compile (cache-dit's data-dependent block-skipping can't be
  traced by torch.compile; eager-only for now). Wrap the optional cache-dit
  import in try/except with an actionable install hint.
- harness inner: broaden the torchvision decode fallback to any Exception
  (FFmpeg/PyAV backends can raise RuntimeError), guard against zero decoded
  frames before SSIM, and replace an index-match assert with a raised
  ValueError (asserts are stripped under python -O).
- harness driver: install cache-dit via the declared ".[test,cache]" extra
  instead of a separate step; clean up temp config/results/ssim files.

Kept `hf auth login` (gemini suggested huggingface-cli): `hf` is the current
huggingface_hub entrypoint and the bootstrap is validated working on the
fastvideo-dev image.
@mergify mergify Bot added the scope: docs Documentation label Jun 3, 2026
@Mister-Raggs

Copy link
Copy Markdown
Contributor Author

The cachedit schema-parity failure is fixed in the latest push. The other CI failure (test_ltx2_presets_registered) is pre-existing on main - ltx2_3_base is registered and is the LTX2 default, but the test still expects the old 3-preset set; this PR touches no LTX2/preset code. Filed separately as #1427.

Re: the hf auth login suggestion - keeping it. hf is the current huggingface_hub CLI entrypoint (huggingface-cli is the legacy alias), and this bootstrap line is validated working on the fastvideo-dev image in our A/B runs.

@SolitaryThinker

Copy link
Copy Markdown
Collaborator

/merge

@github-actions github-actions Bot added the ready PR is ready to merge label Jun 8, 2026
Adds opt-in, default-off step caching for Wan via the cache-dit library
(https://github.com/vipshop/cache-dit). When enabled, DBCache skips DiT
blocks on denoising steps whose features barely change and reuses a cached
residual; an optional TaylorSeer calibrator extrapolates that residual
(Taylor expansion) instead of holding it constant, for higher fidelity at
the same skip rate. Lossy (SSIM<1.0), so default OFF and gated behind
--use-cachedit.

cache-dit is wired in via a transformer-only BlockAdapter (no diffusers
pipeline needed): DenoisingStage._enable_or_refresh_cachedit() enables the
cache once per transformer and refreshes the cache context per generation
(num_inference_steps + refresh_context) so state never leaks across prompts.

- fastvideo_args: use_cachedit + cachedit_{fn,bn}_compute_blocks /
  residual_threshold / max_warmup_steps + cachedit_taylorseer[_order],
  with matching CLI flags.
- denoising: enable/refresh hook + a guard that raises if caching is
  combined with DiT offloading (caching skips blocks; the offload hook
  assumes every block runs each step).
- pyproject: cache-dit as an optional [cache] extra (lazy-imported, not a
  runtime dependency).

Measured on Wan2.1-T2V-1.3B, L40S, 720x1280/77f/30steps, 5 prompts (eager):
F8B0 + TaylorSeer, threshold 0.08 -> -18% wall at SSIM mean 0.957 /
worst 0.941, visually clean (incl. high-detail scenes). Eager only; compile
path untested.
Single-container baseline-vs-cachedit A/B on Wan: per-prompt wall + pairwise
SSIM, --taylorseer toggle, --threshold/--fn/--bn/--warmup sweep knobs.
Disables DiT offload on both passes (caching skips blocks) and installs
cache-dit in the image. This is the harness that produced the PR numbers.
- denoising: raise a clear ValueError when use_cachedit is combined with
  enable_torch_compile (cache-dit's data-dependent block-skipping can't be
  traced by torch.compile; eager-only for now). Wrap the optional cache-dit
  import in try/except with an actionable install hint.
- harness inner: broaden the torchvision decode fallback to any Exception
  (FFmpeg/PyAV backends can raise RuntimeError), guard against zero decoded
  frames before SSIM, and replace an index-match assert with a raised
  ValueError (asserts are stripped under python -O).
- harness driver: install cache-dit via the declared ".[test,cache]" extra
  instead of a separate step; clean up temp config/results/ssim files.

Kept `hf auth login` (gemini suggested huggingface-cli): `hf` is the current
huggingface_hub entrypoint and the bootstrap is validated working on the
fastvideo-dev image.
test_fastvideo_args_fields_are_classified requires every FastVideoArgs
field to appear in docs/design/inference_schema_parity_inventory.yaml. Add
the 7 new use_cachedit/cachedit_* fields under internal_only (opt-in runtime
caching config, not part of the canonical typed inference API).
Mister-Raggs added a commit to Mister-Raggs/FastVideo that referenced this pull request Jun 9, 2026
Generalizes the Wan-only cache-dit wiring (hao-ai-lab#1426) into a small per-model
spec registry (`_CACHEDIT_MODEL_SPECS` + `_resolve_cachedit_spec`, MRO-
matched so subclasses inherit) and adds HunyuanVideo.

Each spec declares the transformer block-list attribute(s) and the cache-dit
ForwardPattern each follows. Wan stays a single-stream Pattern_2 entry
(`blocks`). HunyuanVideo is dual-stream MMDiT: `double_blocks` (Pattern_0,
returns img+txt) + `single_blocks` (Pattern_3, concatenated single tensor) —
FastVideo's block signatures differ from diffusers', so patterns are passed
explicitly with check_forward_pattern=False.

The CFG mode is no longer hard-coded: `enable_separate_cfg` /
`BlockAdapter.has_separate_cfg` are derived per generation from
`batch.do_classifier_free_guidance`, so one spec covers a model's classic-CFG
and distilled/embedded-guidance configs. For the validated Wan-with-CFG path
this is identical to hao-ai-lab#1426 (do_cfg=True).

Behavior unchanged when use_cachedit is off. Validation (lossy SSIM A/B on
HunyuanVideo) tracked separately.
Mister-Raggs added a commit to Mister-Raggs/FastVideo that referenced this pull request Jun 9, 2026
- denoising: raise a clear ValueError when use_cachedit is combined with
  enable_torch_compile (cache-dit's data-dependent block-skipping can't be
  traced by torch.compile; eager-only for now). Wrap the optional cache-dit
  import in try/except with an actionable install hint.
- harness inner: broaden the torchvision decode fallback to any Exception
  (FFmpeg/PyAV backends can raise RuntimeError), guard against zero decoded
  frames before SSIM, and replace an index-match assert with a raised
  ValueError (asserts are stripped under python -O).
- harness driver: install cache-dit via the declared ".[test,cache]" extra
  instead of a separate step; clean up temp config/results/ssim files.

Kept `hf auth login` (gemini suggested huggingface-cli): `hf` is the current
huggingface_hub entrypoint and the bootstrap is validated working on the
fastvideo-dev image.
Mister-Raggs added a commit to Mister-Raggs/FastVideo that referenced this pull request Jun 9, 2026
Generalizes the Wan-only cache-dit wiring (hao-ai-lab#1426) into a small per-model
spec registry (`_CACHEDIT_MODEL_SPECS` + `_resolve_cachedit_spec`, MRO-
matched so subclasses inherit) and adds HunyuanVideo.

Each spec declares the transformer block-list attribute(s) and the cache-dit
ForwardPattern each follows. Wan stays a single-stream Pattern_2 entry
(`blocks`). HunyuanVideo is dual-stream MMDiT: `double_blocks` (Pattern_0,
returns img+txt) + `single_blocks` (Pattern_3, concatenated single tensor) —
FastVideo's block signatures differ from diffusers', so patterns are passed
explicitly with check_forward_pattern=False.

The CFG mode is no longer hard-coded: `enable_separate_cfg` /
`BlockAdapter.has_separate_cfg` are derived per generation from
`batch.do_classifier_free_guidance`, so one spec covers a model's classic-CFG
and distilled/embedded-guidance configs. For the validated Wan-with-CFG path
this is identical to hao-ai-lab#1426 (do_cfg=True).

Behavior unchanged when use_cachedit is off. Validation (lossy SSIM A/B on
HunyuanVideo) tracked separately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready PR is ready to merge scope: docs Documentation scope: inference Inference pipeline, serving, CLI scope: infra CI, tests, Docker, build type: feat New feature or capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants