[docs] Add NVIDIA DGX Spark (GB10) installation guide by SolitaryThinker · Pull Request #1447 · hao-ai-lab/FastVideo

SolitaryThinker · 2026-06-10T07:50:03Z

Summary

Adds a dedicated installation guide for the NVIDIA DGX Spark (GB10) and wires
it into the docs and README.

The Spark is ARM64 (aarch64) with the CUDA 13 toolkit, a combination none of
the existing guides cover. The quick-start path (uv pip install fastvideo /
uv pip install -e .) fails on it for two compounding reasons:

fastvideo-kernel has no prebuilt aarch64 wheel (PyPI ships x86_64
only), so it must be compiled from source on the box.
The system Python lacks the dev headers that the kernel's CMake build
needs (find_package(Python ... Development.Module) → Python.h), so the
build aborts before it starts.

This guide gets a Spark from a fresh clone to a verified, GPU-accelerated install
without sudo.

What the guide does (and the non-obvious decisions)

Managed-Python venv (uv venv --python-preference only-managed) — uv's
standalone CPython bundles the dev headers, so the kernel compiles without
installing python3.12-dev system-wide.
cu130 torch, not the pinned cu128 — on aarch64, PyPI torch==2.11.0
resolves to 2.11.0+cu130 (CUDA 13), which matches the system nvcc 13.
Pairing them gives a clean toolchain for compiling the kernel; the repo's
pyproject.toml cu128 pin would force a CUDA-12.8-vs-13.0 mismatch.
Compile fastvideo-kernel for sm_121 via --no-build-isolation (so it
links against the real cu130 torch and detects the GB10's arch) with
ThunderKittens disabled (Hopper-only; Triton fallbacks at runtime).
uv pip install -e . --no-sources for the rest, so the cu128 source pin
doesn't reinstall torch and break the freshly-built kernel's ABI.

The page also includes a verify section (versions, fastvideo --help, and a
compiled-kernel execution check) and a troubleshooting table mapping each failure
mode to its fix.

Changes

docs/getting_started/installation/spark.md (new) — the guide above,
following the structure/voice of the sibling gpu.md / mps.md pages and
linked from the installation index the same way (not added to the mkdocs
nav, matching the existing convention).
docs/getting_started/installation.md — adds the Spark page to the
"supported hardware platforms" list.
README.md — a one-line DGX Spark callout under Getting Started, plus a new
"Install with an AI coding agent" section: a paste-able prompt that has a
coding agent detect the platform (uname -m / nvidia-smi / nvcc) and follow
the matching per-platform guide.

Testing

Verified end-to-end on an actual DGX Spark (GB10, aarch64, CUDA 13.0, driver
580): clean install from a fresh venv, import fastvideo + fastvideo --help
work, torch sees the GB10, and the compiled int8 GEMM kernel runs correctly on
sm_121 (~1% error vs an fp32 reference, as expected for int8). The repo's
codespell config (the lint hook that applies to markdown) passes on all three
changed files.

Notes for reviewers

Docs-only change; no code paths touched.
Version pins in the guide (torch==2.11.0+cu130, fastvideo-kernel==0.2.6,
flash-attn==2.8.1) reflect what was verified on 2026-05; the
compute-capability values (121 / 12.1) are called out as substitutable for
other Blackwell steppings, with a one-liner to detect them.

🤖 Generated with Claude Code

mergify · 2026-06-10T07:50:45Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for

check-success=fastcheck-passed
check-success=full-suite-passed

This rule is failing.

check-success=fastcheck-passed
check-success=full-suite-passed
#approved-reviews-by>=1
check-success~=pre-commit
title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]

gemini-code-assist

Code Review

This pull request adds installation documentation and guidance for running FastVideo on the NVIDIA DGX Spark (GB10 / ARM64 + CUDA 13) platform, including updates to the README and a new dedicated installation guide. The review feedback suggests limiting parallel compilation jobs by setting CMAKE_BUILD_PARALLEL_LEVEL and MAX_JOBS when building fastvideo-kernel and flash-attn from source to prevent potential out-of-memory (OOM) issues or CPU thrashing on high-core-count Grace CPUs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-10T07:51:42Z

+export CUDA_HOME=/usr/local/cuda
+export CUDACXX=/usr/local/cuda/bin/nvcc
+export TORCH_CUDA_ARCH_LIST=12.1
+export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=121 -DFASTVIDEO_KERNEL_BUILD_TK=OFF -DGPU_BACKEND=CUDA"
+uv pip install ./fastvideo-kernel -v --no-build-isolation


On high-core-count systems like the NVIDIA Grace CPU (which has 72 or 144 cores), compiling heavy CUDA/C++ extensions can spawn too many parallel compiler processes. This can easily lead to system memory exhaustion (OOM) or extreme CPU thrashing. It is highly recommended to limit the number of parallel build jobs by setting the CMAKE_BUILD_PARALLEL_LEVEL environment variable.

Suggested change

export CUDA_HOME=/usr/local/cuda

export CUDACXX=/usr/local/cuda/bin/nvcc

export TORCH_CUDA_ARCH_LIST=12.1

export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=121 -DFASTVIDEO_KERNEL_BUILD_TK=OFF -DGPU_BACKEND=CUDA"

uv pip install ./fastvideo-kernel -v --no-build-isolation

export CUDA_HOME=/usr/local/cuda

export CUDACXX=/usr/local/cuda/bin/nvcc

export TORCH_CUDA_ARCH_LIST=12.1

export CMAKE_BUILD_PARALLEL_LEVEL=8

export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=121 -DFASTVIDEO_KERNEL_BUILD_TK=OFF -DGPU_BACKEND=CUDA"

uv pip install ./fastvideo-kernel -v --no-build-isolation

gemini-code-assist · 2026-06-10T07:51:42Z

+it. If you want it (uses the same `CUDA_HOME` / `CUDACXX` env as the kernel step):
+
+```bash
+uv pip install flash-attn==2.8.1 --no-cache-dir --no-build-isolation -v


Similarly to the kernel compilation, compiling flash-attn from source on high-core-count Grace CPUs can trigger excessive parallel compilation jobs, leading to OOM or hangs. Limiting the build concurrency using the MAX_JOBS environment variable is highly recommended.

Suggested change

uv pip install flash-attn==2.8.1 --no-cache-dir --no-build-isolation -v

MAX_JOBS=4 uv pip install flash-attn==2.8.1 --no-cache-dir --no-build-isolation -v

Satyam-53

LGTM

Mister-Raggs · 2026-06-16T22:31:39Z

Small addition for the install steps: --prerelease=allow (needed for the cu130 torch on aarch64) also pulls transformers 5.x over the repo's transformers==4.57.3 pin. Wan models tolerate it, but Qwen2.5-VL-based models (e.g. Cosmos-Predict2.5) then fail at load with AttributeError: 'Qwen2_5_VLConfig' object has no attribute 'pad_token_id' (5.x nests it under text_config). Pinning it back after install fixes it:

uv pip install "transformers==4.57.3"

Verified on a GB10. Happy to push the doc line if helpful.

The standard CUDA install path does not work on the DGX Spark (GB10, aarch64, CUDA 13): fastvideo-kernel has no prebuilt ARM wheel and the system Python lacks the headers needed to compile it from source. - docs/getting_started/installation/spark.md: dedicated guide — managed-Python venv (for the build headers), cu130 torch to match the system nvcc, compile fastvideo-kernel for sm_121, and `-e . --no-sources`. - Link it from the installation index. - README: a DGX Spark callout plus an "Install with an AI coding agent" section (a paste-able prompt that routes to the right per-platform guide).

build.sh probes the GPU (via torch) and injects the arch, but standards-based builds (pip / uv pip install, sdist) skip build.sh and let torch's cmake auto-detect, which on CUDA 13 yields a wrong arch (e.g. compute_20) and kernels that don't run on the device. Resolve TORCH_CUDA_ARCH_LIST in CMakeLists.txt before find_package(Torch) (torch forces CMAKE_CUDA_ARCHITECTURES=OFF and drives -gencode from TORCH_CUDA_ARCH_LIST): honor an existing value, else translate a pinned CMAKE_CUDA_ARCHITECTURES, else probe the visible GPU, else fail loudly. This also routes around the ThunderKittens AUTO over-trigger on Blackwell. Verified on GB10 (sm_121): auto-detect and TORCH_CUDA_ARCH_LIST=12.1 both yield compute_121,sm_121; -DCMAKE_CUDA_ARCHITECTURES=90a yields sm_90a; a GPU-less build with no hint now errors instead of mis-compiling.

No prebuilt aarch64 wheel for fastvideo-kernel exists on PyPI, so on ARM (e.g. the DGX Spark / GB10) point uv at the in-tree source via a platform_machine == 'aarch64' marker; x86_64 keeps the PyPI wheel. The source table is uv-only and not part of the published wheel metadata. Targets the CUDA 13 (cu130) torch index to match the Spark's toolkit.

With the kernel's CMake arch auto-detection and the aarch64 uv source in place, a fresh Spark install is now just: managed-python venv -> submodule init -> uv pip install -e . (verified cold: kernel rebuilds, targets sm_121, and the int8 GEMM runs on the GB10). Drop the obsolete manual arch exports (TORCH_CUDA_ARCH_LIST/CMAKE_ARGS), the separate kernel build step, and --no-sources from the happy path; keep a correct condensed manual fallback. Add a CI/GPU-less note (pass TORCH_CUDA_ARCH_LIST=12.1) and refresh the troubleshooting table. Update the README callout to match.

SolitaryThinker · 2026-06-19T03:51:23Z

/merge

Collapse the four near-identical docker/Dockerfile.python3.* files into a single parameterized, multi-arch-aware docker/Dockerfile and build the images as a CUDA matrix in CI. - docker/Dockerfile: one file replaces py3.10/3.11/3.12/3.12-cuda12.9.1, selected via ARGs (PYTHON_VERSION, CUDA_VERSION, UBUNTU_VERSION, TORCH_CUDA_ARCH_LIST). BuildKit uv cache mounts (drop --no-cache-dir), TARGETARCH-gated flash-attn (x86 prebuilt wheel vs arm64 source build), and CUDA_HOME via the /usr/local/cuda symlink so any CUDA_VERSION works. - Fix the flash-attn wheel to cu130torch2.12 (release v0.9.17) to match the pinned torch 2.12.1+cu130; the old cu128torch2.11 pin was stale. - infra-build-image.yml: CUDA matrix (py3.12 x {12.8.0, 12.9.1, 13.0.0}) plus a dreamverse matrix (backend + UI x same CUDAs). latest -> py3.12 via a new mark_as_latest template input, replacing the brittle python==3.10 check. - apps/dreamverse/docker/Dockerfile: parameterized base + CUDA_HOME symlink fix + uv cache mounts so it builds on all three CUDA versions. - Add a root .dockerignore (excludes the ~6.5G .venv etc.; keeps .git, which the fastvideo-kernel submodule build needs). - Repoint buildkite path-triggers and docs from Dockerfile.python3.12 to the unified docker/Dockerfile. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Flip mark_as_latest in the CUDA matrix from the 12.8.0 cell to the 13.0.0 cell so the global `latest` tag tracks the torch-matched (torch 2.12.1+cu130) image instead of CUDA 12.8.0, and CI stops reverting a manual retag. Note the SSIM seeding skill: `latest` now differs from `py3.12-latest` (CUDA 12.8.0, what CI pins via IMAGE_VERSION), so the explicit pin is required. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The link was correct relative to the source README's location, but breaks once docs/generate_examples.py flattens the file into docs/training/examples/ (where the docs link checker runs): `../../../../../docs/training/data_preprocess.md` overshoots the repo root from there. Point it at `../data_preprocess.md`, which resolves correctly in the generated docs tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…index Drop the [tool.uv.sources] torch/torchvision/torchaudio pins and the pytorch-cpu / pytorch-cu130 [[tool.uv.index]] blocks from pyproject. Torch is now chosen at install time via uv's UV_TORCH_BACKEND / --torch-backend: - Docker images set UV_TORCH_BACKEND=cu130 (GPU-less builders can't auto-detect a driver, which would fall back to CPU). CI installs run inside the image and inherit it, so editable reinstalls no longer override the baked torch. - Install docs use UV_TORCH_BACKEND=auto (matches the user's GPU driver). torch 2.12.1 is published only on cu129/cu130, so auto needs a CUDA 12.9+/13 driver. - Spark guide: dropped the now-outdated "why this is just uv pip install -e ." explanation (it described the removed cu130 pin) and wired UV_TORCH_BACKEND. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

torch 2.12 is not published for cu128, so a CUDA 12.8.0 image can't be torch-2.12-consistent. Remove the 12.8.0 cell from the CUDA and dreamverse matrices and from the Dockerfile default (CUDA_VERSION 12.8.0 -> 12.9.1). The CUDA matrix is now {12.9.1, 13.0.0}; 12.9.1 takes the bare py3.12 tag and `latest`, 13.0.0 keeps the -cuda13.0.0 suffix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the hard torch==2.12.1 pin with a >=2.11,<2.13 range and select the PyTorch build per CUDA base via UV_TORCH_BACKEND + a matching prebuilt flash-attn wheel. Each cell is now internally consistent (kernel nvcc, torch, and flash-attn all aligned) and runnable on its own driver tier: - 12.6.3 -> cu126, torch 2.12.x, cu126torch2.12 FA (v0.9.17). Default `latest`; broadest, runs on a CUDA 12.6+ driver. - 12.8.0 -> cu128, torch 2.11.0, cu128torch2.11 FA (v0.9.4). Backward-compat image (cu128 has no torch 2.12). - 13.0.0 -> cu130, torch 2.12.x, cu130torch2.12 FA (v0.9.17). Newest CUDA. Drops 12.9.1 (cu129 has no flash-attn wheel for torch >=2.10). Dreamverse matrix follows the 2.12 lane: {12.6.3, 13.0.0} with matching backends. Default Dockerfile CUDA_VERSION 12.9.1 -> 12.6.3; dreamverse UV_TORCH_BACKEND is now an ARG. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

uv does not auto-detect the CUDA build; without UV_TORCH_BACKEND a plain `uv pip install` pulls PyPI's default torch instead of a driver-matched wheel (verified: unset -> torch==2.12.1, auto -> torch==2.12.1+cu130). Prefix the canonical install commands across the user-facing + contributor install docs so copy-paste gets the right PyTorch build, matching the Spark guide. Covers README, AGENTS.md, comfyui + dreamverse READMEs, the getting_started guides (installation/gpu/mps/quick_start), inference_quick_start, and the contributing setup docs. Secondary `[eval]`/extra installs (run after the base install, torch already present) and non-uv `pip install` lines are left as-is. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Make CUDA 13.0.0 (cu130, torch 2.12.1) the default `py3.12` / `latest` image: - docker/Dockerfile + dreamverse Dockerfile defaults -> 13.0.0 / cu130 / cu130torch2.12 - infra-build-image.yml: 13.0.0 takes the bare `py3.12` tag + `latest`; 12.6.3 -> `py3.12-cuda12.6.3`, 12.8.0 -> `py3.12-cuda12.8.0`; mark_as_latest -> 13.0.0 Pin UV_TORCH_BACKEND in the Modal test env (pr_test / ssim_test / launch_l40s_job) keyed off IMAGE_VERSION (cuda12.6->cu126, cuda12.8->cu128, else cu130), so the in-container editable reinstall resolves torch's CUDA deps from the matching index (a true no-op against the baked torch) instead of drifting to PyPI. Not `auto` -- CI stays deterministic, independent of the test box's driver. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

flash_attn.cute (FA4) loads the cutlass CuTe DSL, whose API can be version-skewed against the installed flash-attn wheel -- on the cu130/torch-2.12 image this surfaces as "module 'cutlass.cute.core' has no attribute 'ThrMma'". That import raises AttributeError, but the FA4->FA3->FA2 fallback chains in flash_attn.py and flash_attn_no_pad.py only catch ImportError, so worker init crashed (seen on an L40S CI run) instead of degrading. Re-raise as ImportError at the single flash_attn.cute import site so every caller falls back to FA2/FA3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…talled) Distinguish the two flash_attn.cute (FA4) failure modes at the import site: - ImportError (not installed) -> re-raise quietly; callers fall back to FA3/FA2. - any other import error (installed but broken, e.g. the nvidia-cutlass-dsl skew "module 'cutlass.cute.core' has no attribute 'ThrMma'") -> log a warning that points at the fix (pin a compatible nvidia-cutlass-dsl) before re-raising as ImportError so the FA3/FA2 fallback still engages. So a fixable FA4 degradation is visible in the logs instead of silent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…the stale fork The XOR-op fa4-compile fork is 1 commit (a redundant PT2-compile shim) ahead and 248 behind upstream, stuck on the cutlass-4.4 `cute.core.ThrMma` API that crashes on cutlass-dsl 4.5. Upstream cute (`flash-attn-4`) uses `cute.ThrMma`, pins `nvidia-cutlass-dsl>=4.5.2` (co-installable with flashinfer/quack), and its `_flash_attn_fwd`/`_bwd` signatures match FastVideo's wrappers. torch.compile is provided by FastVideo's own custom_op wrappers, so the fork's shim isn't needed. Repoint flash-attn-cute -> flash-attn-4 (Dao-AILab @ 940cd968, subdirectory flash_attn/cute) in the root + apps/dreamverse pyprojects. NVFP4 is a separate package (hao-ai-lab/flash-attention-fp4) and is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@FP4

…ention-fp4 The fp4 branch's cute uses cute.core.ThrMma, which crashes on cutlass-dsl 4.5 (now required by flashinfer/quack). Point the NVFP4 FA4 install at the fix/cutlass-dsl-4.5 branch (hao-ai-lab/flash-attention-fp4#2) and bump the shown nvidia-cutlass-dsl pin to >=4.5.2 so NVFP4 loads on the current stack. Revert to @FP4 once the PR merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The prebuilt flash-attn wheel ships an FA4 flash_attn.cute built against the cutlass-4.4 cute.core.ThrMma API, which crashes on the cutlass-dsl 4.5 that flashinfer/quack pull in -- so the default x86 image fell back to FA2. After the wheel install, overlay the cutlass-4.5-safe upstream flash-attn-4 cute (pinned via FA4_CUTE_REF, same commit as pyproject) plus its runtime stack (cutlass-dsl, quack-kernels, apache-tvm-ffi, torch-c-dlpack-ext) -- the same deps the [dreamverse] extra already installs -- so the image runs FA4. Only flash_attn/cute is replaced (the cute package never ships flash_attn/__init__.py, so FA2 / varlen / bert_padding stay from the wheel); a post-install import check fails the build rather than shipping an FA2-less image. x86 only -- arm64/GB10 (sm_121) keeps the FA2 fallback since the FA4 stack is unvalidated there. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mergify Bot added type: docs Documentation only scope: docs Documentation labels Jun 10, 2026

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

Satyam-53 reviewed Jun 10, 2026

View reviewed changes

Satyam-53 approved these changes Jun 10, 2026

View reviewed changes

SolitaryThinker force-pushed the docs/dgx-spark-install branch from b58d185 to 572bc64 Compare June 18, 2026 21:14

SolitaryThinker force-pushed the docs/dgx-spark-install branch from 572bc64 to 239c9ed Compare June 18, 2026 21:17

SolitaryThinker added 2 commits June 18, 2026 15:56

mergify Bot added the scope: kernel CUDA kernels, fastvideo-kernel label Jun 18, 2026

SolitaryThinker force-pushed the docs/dgx-spark-install branch from 4adb3b9 to cbcc074 Compare June 18, 2026 23:33

github-actions Bot added the ready PR is ready to merge label Jun 19, 2026

mergify Bot and others added 2 commits June 19, 2026 03:52

Merge branch 'main' into docs/dgx-spark-install

493652a

mergify Bot added the scope: infra CI, tests, Docker, build label Jun 19, 2026

mergify Bot and others added 6 commits June 19, 2026 08:05

Merge branch 'main' into docs/dgx-spark-install

34a1a54

trigger ci

d6fd0dd

Merge branch 'main' into docs/dgx-spark-install

c83a93b

trigger ci

5c81577

mergify Bot added the scope: training Training pipeline, methods, configs label Jun 21, 2026

SolitaryThinker and others added 3 commits June 20, 2026 18:13

SolitaryThinker and others added 3 commits June 20, 2026 21:39

mergify Bot added the scope: attention Attention backends (VSA, STA, Flash, etc.) label Jun 21, 2026

SolitaryThinker and others added 4 commits June 20, 2026 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Add NVIDIA DGX Spark (GB10) installation guide#1447

[docs] Add NVIDIA DGX Spark (GB10) installation guide#1447
SolitaryThinker wants to merge 22 commits into
mainfrom
docs/dgx-spark-install

SolitaryThinker commented Jun 10, 2026

Uh oh!

mergify Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

Satyam-53 left a comment

Uh oh!

Mister-Raggs commented Jun 16, 2026

Uh oh!

SolitaryThinker commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	uv pip install flash-attn==2.8.1 --no-cache-dir --no-build-isolation -v
	MAX_JOBS=4 uv pip install flash-attn==2.8.1 --no-cache-dir --no-build-isolation -v

Conversation

SolitaryThinker commented Jun 10, 2026

Summary

What the guide does (and the non-obvious decisions)

Changes

Testing

Notes for reviewers

Uh oh!

mergify Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 PR merge requirements

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Satyam-53 left a comment

Choose a reason for hiding this comment

Uh oh!

Mister-Raggs commented Jun 16, 2026

Uh oh!

SolitaryThinker commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergify Bot commented Jun 10, 2026 •

edited

Loading