🇨🇳 中文 · 🇺🇸 English
|
将扩散推理过程抽象为统一的 Stage 管线,内建调度器自动编排多模型、多策略的执行流程,告别手工拼接。 |
原生支持 CFG Parallel 与 Hybrid Context Parallel(Graph-Ring / Async-Ulysses / AGKV),灵活组合并行拓扑,数十GPU近线性扩展。 |
|
一套 API 同时支持并行场景与多粒度缓存(步级 / 层级 / Token 级),策略热插拔,覆盖 AI 顶会 SOTA 方法。 |
专为 Feature Cache 设计的评测体系,从速度、质量、显存三个维度横向对比,可视化快照 + Contact Sheet 让结果一目了然。 |
完整数据表、命令与图表见 ChituBench/result.md。
| 工作负载 | 🏆 最佳结果 | 详情 |
|---|---|---|
| Flux1-dev Attention | SageAttention 达 1.160x 加速(质量无损) | → |
| Flux1-dev FlexCache | MeanCache 达 4.989x;Cubic 与 TaylorSeer 覆盖中高速区间 | → |
| Flux1-dev 序列并行 | 8-GPU Ulysses 达 4.843x(vs 1 GPU) | → |
| Flux2-klein Attention | SageAttention 达 1.163x | → |
| Wan2.1-T2V-1.3B Attention | Sparge 达 2.228x;Torch SDPA 保持最佳质量 | → |
| Wan2.1-T2V-1.3B 并行 | 16-GPU 达 12.813x(vs 1 GPU)🔥 | → |
| Wan2.1-T2V-1.3B FlexCache | MeanCache30 达 1.658x(PSNR 35.60);Cubic 1.568-2.203x | → |
| Qwen-Image 并行 | 8-GPU CFG + image CP4 达 5.404x | → |
| Qwen-Image FlexCache | MeanCache 覆盖 3.616x / 5.331x / 9.092x 三档 | → |
| Z-Image FlexCache | Runtime、单卡 FlashAttention、MeanCache、FreeCache replay 与 TracePlanner 探索已接入 | → |
| 领域 | 包含内容 |
|---|---|
| 运行时 | chitu run、配置加载、分布式启动、任务执行、输出打包 |
| 并行策略 | CFG 并行、上下文并行、Ring、Ulysses、混合 CP/CFG 布局 |
| Attention 后端 | Torch SDPA、FlashAttention、SageAttention、SpargeAttention、FlashInfer |
| FlexCache 加速 | TeaCache、PAB、BlockDance、Cubic、MeanCache、TaylorSeer、DiTango |
| DiTango | 面向通信受限场景的 cache-aware 分布式 Attention 规划器/运行时 |
| 评估 | PSNR、SSIM、LPIPS、FID、FVD、HPSv3 工具链 |
| 可观测性 | 计时 JSON、内存 JSON、运行日志、任务元数据、调试可视化 |
Legend: ✅ 已支持,❌ 不支持或不适用,👷 计划中或验证中。
| Model | Type | Runtime | Sage/Sparge | CFG Parallel | Hybrid CP | VAE Parallel | FlexCache | ChituBench |
|---|---|---|---|---|---|---|---|---|
Flux1-dev |
T2I | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
FLUX.2-klein-4B |
T2I | ✅ | ✅ | ❌ | ✅ | ✅ | 👷 | ✅ |
Qwen-Image |
T2I | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Z-Image |
T2I | ✅ | 👷 | ✅ | ❌ | ✅ | ✅ | ✅ |
Wan2.1-T2V-1.3B |
T2V | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
Wan2.1-T2V-14B |
T2V | ✅ | ✅ | ✅ | ✅ | ❌ | 👷 | 👷 |
Wan2.2-T2V-A14B |
T2V | ✅ | ✅ | ✅ | ✅ | ❌ | 👷 | 👷 |
模型可用性取决于本地 checkpoint 路径及
chitu_diffusion/core/config/models/下的对应配置。
git clone <repo-url>
cd ChituDiffusion
git submodule update --init --recursive
uv syncsource .venv/bin/activate
chitu --help也可直接用
uv run chitu ...免激活。
编辑 system_config.yaml 指向你的本地 checkpoint:
model:
name: Wan2.1-T2V-1.3B
ckpt_dir: /path/to/Wan2.1-T2V-1.3B
launch:
num_nodes: 1
gpus_per_node: 8
parallel:
cfp: 2
up: 8
infer:
attn_type: torch_sdpachitu run system_config.yamlchitu run system_config.yaml --gpus-per-node 8 --cfp 2按需安装加速或评估依赖:
uv sync --extra sage # SageAttention
uv sync --extra sparge # SpargeAttention
uv sync --extra flash # FlashAttention
uv sync --extra flashinfer # FlashInfer
uv sync --extra eval # 评估指标 (PSNR, LPIPS, HPSv3...)
⚠️ CUDA 扩展需要在 GPU 计算节点上编译,确保 CUDA toolkit 与 PyTorch 版本匹配。
也支持手动环境配置:
pip install -r requirements.txt
pip install -e .FlexCache 是请求驱动的:仅在任务需要加速策略时传入参数,省略参数则走默认全量计算路径。
| 策略 | 原理 | 论文 |
|---|---|---|
| MeanCache | 步级噪声预测缓存 + JVP 速度更新 | ICLR 2026 |
| Cubic (Jano) | 区域感知 / Token 选择性前向 | CVPR Findings 2026 |
| TeaCache | 基于时间步嵌入变化的残差复用 | CVPR 2025 |
| TaylorSeer | 模块输出缓存 + Taylor 展开预测 | ICCV 2025 |
| BlockDance | 层级块复用 + 活动去噪窗口 | CVPR 2025 |
| PAB | 注意力输出广播复用 | ICLR 2025 |
详细文档:chitu_diffusion/flexcache/README.md
在 system_config.yaml 中开启评估:
eval:
eval_type: [psnr, lpips]
reference_path: /path/to/reference/videos安装评估依赖:
uv sync --extra eval需要可复现的公开评测结果?推荐使用 ChituBench 脚本与协议。
每次运行都会生成结构化的输出目录:
outputs/<tag>-<YYYYMMDD_HHMMSS>-<taskid>/
request_params.json # 请求参数
system_params.json # 系统参数
run_config.yaml # 运行时配置
results/<task_id>/
*.mp4 / *.png # 生成的媒体文件
*.json # 元数据
metrics/
timing/summary.json # 计时汇总
memory/rank<N>.json # 内存统计
quality/summary.json # 质量评估
logs/
command.log # 完整启动输出
run.log # 运行日志
chitu_diffusion/core/ 配置、Schema、分布式工具、注册中心
chitu_diffusion/runtime/ 后端、生成器、调度器、任务、运行时 API
chitu_diffusion/modules/ 模型专用与可复用的扩散模块
chitu_diffusion/flexcache/ FlexCache 策略与共享缓存工具
chitu_diffusion/ditango/ DiTango 规划器、运行时 Attention、可视化
chitu_diffusion/evaluation/ 评估管理器、策略、指标工具
chitu_diffusion/observability/ 计时与量级日志工具
ChituBench/ 可复现评测工作区与结果图表
service_framework/ 常驻式 Web 服务
script/ 本地与 Slurm 启动辅助脚本
test/ 生成与加速测试入口
system_config.yaml 默认运行时配置
python - <<'PY'
import chitu_diffusion.core
from chitu_diffusion.runtime.task import DiffusionUserParams
from chitu_diffusion.observability import Timer
print("imports ok")
PYpytest test部分测试需要 CUDA、本地 checkpoint 以及分布式启动环境。
仓库内置了 Codex 技能文件(.codex/skills/),涵盖模型适配、FlexCache 评测、结果可视化、清理与提交切片等 ChituDiffusion 专属惯例。
安装到本地 Codex 技能目录:
./.venv/bin/python script/install_codex_skills.py --force默认在 ${CODEX_HOME:-~/.codex}/skills 下创建符号链接,仓库更新后 Codex 无需重新安装即可生效。
ChituDiffusion 已有多篇学术论文发表:
| 🎉 论文 | 会议/期刊 | 说明 |
|---|---|---|
| DiTango | HPDC 2026 | 通信受限场景下的 cache-accelerated parallelism |
| Jano | CVPR Findings 2026 | FlexCache-Cubic 的前身工作 |
| Difflow | PPoPP 2026 | ChituDiffusion 的 stage-level scheduling 起点 |
本项目基于 Apache License 2.0 开源。详见 LICENSE。
🇨🇳 中文 · 🇺🇸 English
ChituDiffusion: High-Performance Diffusion Inference — Distributed Parallelism · Cache Acceleration · Reproducible Benchmarks
|
Models diffusion inference as a unified stage pipeline. A built-in scheduler orchestrates multi-model, multi-strategy execution — no more manual glue code. |
Native CFG Parallel and Hybrid Context Parallel (Graph-Ring / Async-Ulysses / AGKV). Compose flexible topologies and scale near-linearly across dozens of GPUs. |
|
A single API for parallel and multi-granularity caching (step / layer / token level). Hot-swappable strategies covering top-tier SOTA methods. |
A Feature-Cache-native evaluation suite that compares speed, quality, and memory in one view. Visual snapshots and contact sheets make results instantly clear. |
Full tables, commands, and figures live in ChituBench/result.md.
| Workload | 🏆 Best Headline Result | Details |
|---|---|---|
| Flux1-dev Attention | SageAttention reaches 1.160x (quality-preserving) | → |
| Flux1-dev FlexCache | MeanCache reaches 4.989x; Cubic & TaylorSeer cover mid/high-speed frontier | → |
| Flux1-dev Sequence Parallel | 8-GPU Ulysses reaches 4.843x (vs 1 GPU) | → |
| Flux2-klein Attention | SageAttention reaches 1.163x | → |
| Wan2.1-T2V-1.3B Attention | Sparge reaches 2.228x; Torch SDPA keeps best quality | → |
| Wan2.1-T2V-1.3B Parallel | 16-GPU reaches 12.813x (vs 1 GPU) 🔥 | → |
| Wan2.1-T2V-1.3B FlexCache | MeanCache30 reaches 1.658x (PSNR 35.60); Cubic 1.568-2.203x | → |
| Qwen-Image Parallel | 8-GPU CFG + image CP4 reaches 5.404x | → |
| Qwen-Image FlexCache | MeanCache spans 3.616x / 5.331x / 9.092x speed-quality points | → |
| Z-Image FlexCache | Runtime path, single-GPU FlashAttention, MeanCache, FreeCache replay, and TracePlanner probes are integrated | → |
| Area | What's Included |
|---|---|
| Runtime | chitu run, config loading, distributed launch, task execution, output packaging |
| Parallelism | CFG parallelism, context parallelism, Ring, Ulysses, mixed CP/CFG layouts |
| Attention | Torch SDPA, FlashAttention, SageAttention, SpargeAttention, FlashInfer |
| FlexCache | TeaCache, PAB, BlockDance, Cubic, MeanCache, TaylorSeer, DiTango |
| DiTango | Planner/runtime experiments for cache-aware distributed attention |
| Evaluation | PSNR, SSIM, LPIPS, FID, FVD, HPSv3 utilities |
| Observability | Timing JSON, memory JSON, run logs, task metadata, debug visualizations |
Legend: ✅ supported, ❌ unsupported or not applicable, 👷 planned or still being validated.
| Model | Type | Runtime | Sage/Sparge | CFG Parallel | Hybrid CP | VAE Parallel | FlexCache | ChituBench |
|---|---|---|---|---|---|---|---|---|
Flux1-dev |
T2I | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ |
FLUX.2-klein-4B |
T2I | ✅ | ✅ | ❌ | ✅ | ✅ | 👷 | ✅ |
Qwen-Image |
T2I | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Z-Image |
T2I | ✅ | 👷 | ✅ | ❌ | ✅ | ✅ | ✅ |
Wan2.1-T2V-1.3B |
T2V | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
Wan2.1-T2V-14B |
T2V | ✅ | ✅ | ✅ | ✅ | ❌ | 👷 | 👷 |
Wan2.2-T2V-A14B |
T2V | ✅ | ✅ | ✅ | ✅ | ❌ | 👷 | 👷 |
Availability depends on local checkpoint paths and the corresponding config under
chitu_diffusion/core/config/models/.
git clone <repo-url>
cd ChituDiffusion
git submodule update --init --recursive
uv syncsource .venv/bin/activate
chitu --helpAlternatively, use
uv run chitu ...without activation.
Edit system_config.yaml to point to your local checkpoint:
model:
name: Wan2.1-T2V-1.3B
ckpt_dir: /path/to/Wan2.1-T2V-1.3B
launch:
num_nodes: 1
gpus_per_node: 8
parallel:
cfp: 2
up: 8
infer:
attn_type: torch_sdpachitu run system_config.yamlchitu run system_config.yaml --gpus-per-node 8 --cfp 2Install only the acceleration or evaluation stack you need:
uv sync --extra sage # SageAttention
uv sync --extra sparge # SpargeAttention
uv sync --extra flash # FlashAttention
uv sync --extra flashinfer # FlashInfer
uv sync --extra eval # Metrics (PSNR, LPIPS, HPSv3...)
⚠️ Build CUDA extension extras on a GPU compute node whose CUDA toolkit matches the selected PyTorch build.
Manual environments are also supported:
pip install -r requirements.txt
pip install -e .FlexCache is request-driven: include strategy params when a task should use an acceleration strategy, and omit them for the default full-compute path.
| Strategy | Principle | Paper |
|---|---|---|
| MeanCache | Step-level noise-prediction cache + JVP velocity update | ICLR 2026 |
| Cubic (Jano) | Region-aware / token-selective forward | CVPR Findings 2026 |
| TeaCache | Residual reuse driven by timestep-embedding change | CVPR 2025 |
| TaylorSeer | Module-output cache + Taylor expansion forecast | ICCV 2025 |
| BlockDance | Layerwise block reuse + active denoising window | CVPR 2025 |
| PAB | Attention-output broadcast reuse | ICLR 2025 |
Full documentation: chitu_diffusion/flexcache/README.md
Enable evaluation from system_config.yaml:
eval:
eval_type: [psnr, lpips]
reference_path: /path/to/reference/videosInstall metric dependencies:
uv sync --extra evalFor reproducible public results, prefer the ChituBench scripts and protocol in ChituBench/README.md.
Each run writes a structured output directory:
outputs/<tag>-<YYYYMMDD_HHMMSS>-<taskid>/
request_params.json # Request parameters
system_params.json # System parameters
run_config.yaml # Runtime configuration
results/<task_id>/
*.mp4 / *.png # Generated media
*.json # Sidecar metadata
metrics/
timing/summary.json # Timing summary
memory/rank<N>.json # Memory stats
quality/summary.json # Quality evaluation
logs/
command.log # Full launch output
run.log # Run logs
chitu_diffusion/core/ Configuration, schemas, distributed utilities, registry
chitu_diffusion/runtime/ Backend, generator, scheduler, task, runtime API
chitu_diffusion/modules/ Model-specific and reusable diffusion modules
chitu_diffusion/flexcache/ FlexCache strategies and shared cache utilities
chitu_diffusion/ditango/ DiTango planner, runtime attention, visualization
chitu_diffusion/evaluation/ Evaluation manager, strategies, metric helpers
chitu_diffusion/observability/ Timing and magnitude logging helpers
ChituBench/ Reproducible benchmark workspace and result figures
service_framework/ Long-lived web service
script/ Launch helpers for local and Slurm execution
test/ Generation and acceleration test entry points
system_config.yaml Default runtime configuration
python - <<'PY'
import chitu_diffusion.core
from chitu_diffusion.runtime.task import DiffusionUserParams
from chitu_diffusion.observability import Timer
print("imports ok")
PYpytest testSome tests require CUDA, local checkpoints, and distributed launch settings.
Repository-specific Codex skills live under .codex/skills/. They capture ChituDiffusion conventions for model adaptation, FlexCache benchmarking, result visualization, cleanup, and commit slicing.
Install them into the local Codex skill directory:
./.venv/bin/python script/install_codex_skills.py --forceBy default this creates symlinks in ${CODEX_HOME:-~/.codex}/skills, so updates pulled from the repository are visible to Codex without reinstalling. Use --copy if symlinks are not desired.
ChituDiffusion has multiple academic publications:
| 🎉 Paper | Venue | Description |
|---|---|---|
| DiTango | HPDC 2026 | Cache-accelerated parallelism under communication constraints |
| Jano | CVPR Findings 2026 | Precursor to FlexCache-Cubic |
| Difflow | PPoPP 2026 | Stage-level scheduling origin for ChituDiffusion |
This project is licensed under the Apache License 2.0. See LICENSE for details.


