[docs] Add NVIDIA DGX Spark (GB10) installation guide#1447
[docs] Add NVIDIA DGX Spark (GB10) installation guide#1447SolitaryThinker wants to merge 22 commits into
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 PR merge requirementsWaiting for
This rule is failing.
|
There was a problem hiding this comment.
Code Review
This pull request adds installation documentation and guidance for running FastVideo on the NVIDIA DGX Spark (GB10 / ARM64 + CUDA 13) platform, including updates to the README and a new dedicated installation guide. The review feedback suggests limiting parallel compilation jobs by setting CMAKE_BUILD_PARALLEL_LEVEL and MAX_JOBS when building fastvideo-kernel and flash-attn from source to prevent potential out-of-memory (OOM) issues or CPU thrashing on high-core-count Grace CPUs.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| export CUDA_HOME=/usr/local/cuda | ||
| export CUDACXX=/usr/local/cuda/bin/nvcc | ||
| export TORCH_CUDA_ARCH_LIST=12.1 | ||
| export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=121 -DFASTVIDEO_KERNEL_BUILD_TK=OFF -DGPU_BACKEND=CUDA" | ||
| uv pip install ./fastvideo-kernel -v --no-build-isolation |
There was a problem hiding this comment.
On high-core-count systems like the NVIDIA Grace CPU (which has 72 or 144 cores), compiling heavy CUDA/C++ extensions can spawn too many parallel compiler processes. This can easily lead to system memory exhaustion (OOM) or extreme CPU thrashing. It is highly recommended to limit the number of parallel build jobs by setting the CMAKE_BUILD_PARALLEL_LEVEL environment variable.
| export CUDA_HOME=/usr/local/cuda | |
| export CUDACXX=/usr/local/cuda/bin/nvcc | |
| export TORCH_CUDA_ARCH_LIST=12.1 | |
| export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=121 -DFASTVIDEO_KERNEL_BUILD_TK=OFF -DGPU_BACKEND=CUDA" | |
| uv pip install ./fastvideo-kernel -v --no-build-isolation | |
| export CUDA_HOME=/usr/local/cuda | |
| export CUDACXX=/usr/local/cuda/bin/nvcc | |
| export TORCH_CUDA_ARCH_LIST=12.1 | |
| export CMAKE_BUILD_PARALLEL_LEVEL=8 | |
| export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=121 -DFASTVIDEO_KERNEL_BUILD_TK=OFF -DGPU_BACKEND=CUDA" | |
| uv pip install ./fastvideo-kernel -v --no-build-isolation |
| it. If you want it (uses the same `CUDA_HOME` / `CUDACXX` env as the kernel step): | ||
|
|
||
| ```bash | ||
| uv pip install flash-attn==2.8.1 --no-cache-dir --no-build-isolation -v |
There was a problem hiding this comment.
Similarly to the kernel compilation, compiling flash-attn from source on high-core-count Grace CPUs can trigger excessive parallel compilation jobs, leading to OOM or hangs. Limiting the build concurrency using the MAX_JOBS environment variable is highly recommended.
| uv pip install flash-attn==2.8.1 --no-cache-dir --no-build-isolation -v | |
| MAX_JOBS=4 uv pip install flash-attn==2.8.1 --no-cache-dir --no-build-isolation -v |
|
Small addition for the install steps: uv pip install "transformers==4.57.3"Verified on a GB10. Happy to push the doc line if helpful. |
b58d185 to
572bc64
Compare
The standard CUDA install path does not work on the DGX Spark (GB10, aarch64, CUDA 13): fastvideo-kernel has no prebuilt ARM wheel and the system Python lacks the headers needed to compile it from source. - docs/getting_started/installation/spark.md: dedicated guide — managed-Python venv (for the build headers), cu130 torch to match the system nvcc, compile fastvideo-kernel for sm_121, and `-e . --no-sources`. - Link it from the installation index. - README: a DGX Spark callout plus an "Install with an AI coding agent" section (a paste-able prompt that routes to the right per-platform guide).
572bc64 to
239c9ed
Compare
build.sh probes the GPU (via torch) and injects the arch, but standards-based builds (pip / uv pip install, sdist) skip build.sh and let torch's cmake auto-detect, which on CUDA 13 yields a wrong arch (e.g. compute_20) and kernels that don't run on the device. Resolve TORCH_CUDA_ARCH_LIST in CMakeLists.txt before find_package(Torch) (torch forces CMAKE_CUDA_ARCHITECTURES=OFF and drives -gencode from TORCH_CUDA_ARCH_LIST): honor an existing value, else translate a pinned CMAKE_CUDA_ARCHITECTURES, else probe the visible GPU, else fail loudly. This also routes around the ThunderKittens AUTO over-trigger on Blackwell. Verified on GB10 (sm_121): auto-detect and TORCH_CUDA_ARCH_LIST=12.1 both yield compute_121,sm_121; -DCMAKE_CUDA_ARCHITECTURES=90a yields sm_90a; a GPU-less build with no hint now errors instead of mis-compiling.
No prebuilt aarch64 wheel for fastvideo-kernel exists on PyPI, so on ARM (e.g. the DGX Spark / GB10) point uv at the in-tree source via a platform_machine == 'aarch64' marker; x86_64 keeps the PyPI wheel. The source table is uv-only and not part of the published wheel metadata. Targets the CUDA 13 (cu130) torch index to match the Spark's toolkit.
With the kernel's CMake arch auto-detection and the aarch64 uv source in place, a fresh Spark install is now just: managed-python venv -> submodule init -> uv pip install -e . (verified cold: kernel rebuilds, targets sm_121, and the int8 GEMM runs on the GB10). Drop the obsolete manual arch exports (TORCH_CUDA_ARCH_LIST/CMAKE_ARGS), the separate kernel build step, and --no-sources from the happy path; keep a correct condensed manual fallback. Add a CI/GPU-less note (pass TORCH_CUDA_ARCH_LIST=12.1) and refresh the troubleshooting table. Update the README callout to match.
4adb3b9 to
cbcc074
Compare
|
/merge |
Collapse the four near-identical docker/Dockerfile.python3.* files into a
single parameterized, multi-arch-aware docker/Dockerfile and build the images
as a CUDA matrix in CI.
- docker/Dockerfile: one file replaces py3.10/3.11/3.12/3.12-cuda12.9.1,
selected via ARGs (PYTHON_VERSION, CUDA_VERSION, UBUNTU_VERSION,
TORCH_CUDA_ARCH_LIST). BuildKit uv cache mounts (drop --no-cache-dir),
TARGETARCH-gated flash-attn (x86 prebuilt wheel vs arm64 source build),
and CUDA_HOME via the /usr/local/cuda symlink so any CUDA_VERSION works.
- Fix the flash-attn wheel to cu130torch2.12 (release v0.9.17) to match the
pinned torch 2.12.1+cu130; the old cu128torch2.11 pin was stale.
- infra-build-image.yml: CUDA matrix (py3.12 x {12.8.0, 12.9.1, 13.0.0}) plus
a dreamverse matrix (backend + UI x same CUDAs). latest -> py3.12 via a new
mark_as_latest template input, replacing the brittle python==3.10 check.
- apps/dreamverse/docker/Dockerfile: parameterized base + CUDA_HOME symlink
fix + uv cache mounts so it builds on all three CUDA versions.
- Add a root .dockerignore (excludes the ~6.5G .venv etc.; keeps .git, which
the fastvideo-kernel submodule build needs).
- Repoint buildkite path-triggers and docs from Dockerfile.python3.12 to the
unified docker/Dockerfile.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Flip mark_as_latest in the CUDA matrix from the 12.8.0 cell to the 13.0.0 cell so the global `latest` tag tracks the torch-matched (torch 2.12.1+cu130) image instead of CUDA 12.8.0, and CI stops reverting a manual retag. Note the SSIM seeding skill: `latest` now differs from `py3.12-latest` (CUDA 12.8.0, what CI pins via IMAGE_VERSION), so the explicit pin is required. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The link was correct relative to the source README's location, but breaks once docs/generate_examples.py flattens the file into docs/training/examples/ (where the docs link checker runs): `../../../../../docs/training/data_preprocess.md` overshoots the repo root from there. Point it at `../data_preprocess.md`, which resolves correctly in the generated docs tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…index Drop the [tool.uv.sources] torch/torchvision/torchaudio pins and the pytorch-cpu / pytorch-cu130 [[tool.uv.index]] blocks from pyproject. Torch is now chosen at install time via uv's UV_TORCH_BACKEND / --torch-backend: - Docker images set UV_TORCH_BACKEND=cu130 (GPU-less builders can't auto-detect a driver, which would fall back to CPU). CI installs run inside the image and inherit it, so editable reinstalls no longer override the baked torch. - Install docs use UV_TORCH_BACKEND=auto (matches the user's GPU driver). torch 2.12.1 is published only on cu129/cu130, so auto needs a CUDA 12.9+/13 driver. - Spark guide: dropped the now-outdated "why this is just uv pip install -e ." explanation (it described the removed cu130 pin) and wired UV_TORCH_BACKEND. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
torch 2.12 is not published for cu128, so a CUDA 12.8.0 image can't be
torch-2.12-consistent. Remove the 12.8.0 cell from the CUDA and dreamverse
matrices and from the Dockerfile default (CUDA_VERSION 12.8.0 -> 12.9.1). The
CUDA matrix is now {12.9.1, 13.0.0}; 12.9.1 takes the bare py3.12 tag and
`latest`, 13.0.0 keeps the -cuda13.0.0 suffix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the hard torch==2.12.1 pin with a >=2.11,<2.13 range and select the
PyTorch build per CUDA base via UV_TORCH_BACKEND + a matching prebuilt flash-attn
wheel. Each cell is now internally consistent (kernel nvcc, torch, and flash-attn
all aligned) and runnable on its own driver tier:
- 12.6.3 -> cu126, torch 2.12.x, cu126torch2.12 FA (v0.9.17). Default `latest`;
broadest, runs on a CUDA 12.6+ driver.
- 12.8.0 -> cu128, torch 2.11.0, cu128torch2.11 FA (v0.9.4). Backward-compat
image (cu128 has no torch 2.12).
- 13.0.0 -> cu130, torch 2.12.x, cu130torch2.12 FA (v0.9.17). Newest CUDA.
Drops 12.9.1 (cu129 has no flash-attn wheel for torch >=2.10). Dreamverse matrix
follows the 2.12 lane: {12.6.3, 13.0.0} with matching backends. Default Dockerfile
CUDA_VERSION 12.9.1 -> 12.6.3; dreamverse UV_TORCH_BACKEND is now an ARG.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
uv does not auto-detect the CUDA build; without UV_TORCH_BACKEND a plain `uv pip install` pulls PyPI's default torch instead of a driver-matched wheel (verified: unset -> torch==2.12.1, auto -> torch==2.12.1+cu130). Prefix the canonical install commands across the user-facing + contributor install docs so copy-paste gets the right PyTorch build, matching the Spark guide. Covers README, AGENTS.md, comfyui + dreamverse READMEs, the getting_started guides (installation/gpu/mps/quick_start), inference_quick_start, and the contributing setup docs. Secondary `[eval]`/extra installs (run after the base install, torch already present) and non-uv `pip install` lines are left as-is. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make CUDA 13.0.0 (cu130, torch 2.12.1) the default `py3.12` / `latest` image: - docker/Dockerfile + dreamverse Dockerfile defaults -> 13.0.0 / cu130 / cu130torch2.12 - infra-build-image.yml: 13.0.0 takes the bare `py3.12` tag + `latest`; 12.6.3 -> `py3.12-cuda12.6.3`, 12.8.0 -> `py3.12-cuda12.8.0`; mark_as_latest -> 13.0.0 Pin UV_TORCH_BACKEND in the Modal test env (pr_test / ssim_test / launch_l40s_job) keyed off IMAGE_VERSION (cuda12.6->cu126, cuda12.8->cu128, else cu130), so the in-container editable reinstall resolves torch's CUDA deps from the matching index (a true no-op against the baked torch) instead of drifting to PyPI. Not `auto` -- CI stays deterministic, independent of the test box's driver. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
flash_attn.cute (FA4) loads the cutlass CuTe DSL, whose API can be version-skewed against the installed flash-attn wheel -- on the cu130/torch-2.12 image this surfaces as "module 'cutlass.cute.core' has no attribute 'ThrMma'". That import raises AttributeError, but the FA4->FA3->FA2 fallback chains in flash_attn.py and flash_attn_no_pad.py only catch ImportError, so worker init crashed (seen on an L40S CI run) instead of degrading. Re-raise as ImportError at the single flash_attn.cute import site so every caller falls back to FA2/FA3. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…talled) Distinguish the two flash_attn.cute (FA4) failure modes at the import site: - ImportError (not installed) -> re-raise quietly; callers fall back to FA3/FA2. - any other import error (installed but broken, e.g. the nvidia-cutlass-dsl skew "module 'cutlass.cute.core' has no attribute 'ThrMma'") -> log a warning that points at the fix (pin a compatible nvidia-cutlass-dsl) before re-raising as ImportError so the FA3/FA2 fallback still engages. So a fixable FA4 degradation is visible in the logs instead of silent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…the stale fork The XOR-op fa4-compile fork is 1 commit (a redundant PT2-compile shim) ahead and 248 behind upstream, stuck on the cutlass-4.4 `cute.core.ThrMma` API that crashes on cutlass-dsl 4.5. Upstream cute (`flash-attn-4`) uses `cute.ThrMma`, pins `nvidia-cutlass-dsl>=4.5.2` (co-installable with flashinfer/quack), and its `_flash_attn_fwd`/`_bwd` signatures match FastVideo's wrappers. torch.compile is provided by FastVideo's own custom_op wrappers, so the fork's shim isn't needed. Repoint flash-attn-cute -> flash-attn-4 (Dao-AILab @ 940cd968, subdirectory flash_attn/cute) in the root + apps/dreamverse pyprojects. NVFP4 is a separate package (hao-ai-lab/flash-attention-fp4) and is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ention-fp4 The fp4 branch's cute uses cute.core.ThrMma, which crashes on cutlass-dsl 4.5 (now required by flashinfer/quack). Point the NVFP4 FA4 install at the fix/cutlass-dsl-4.5 branch (hao-ai-lab/flash-attention-fp4#2) and bump the shown nvidia-cutlass-dsl pin to >=4.5.2 so NVFP4 loads on the current stack. Revert to @FP4 once the PR merges. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The prebuilt flash-attn wheel ships an FA4 flash_attn.cute built against the cutlass-4.4 cute.core.ThrMma API, which crashes on the cutlass-dsl 4.5 that flashinfer/quack pull in -- so the default x86 image fell back to FA2. After the wheel install, overlay the cutlass-4.5-safe upstream flash-attn-4 cute (pinned via FA4_CUTE_REF, same commit as pyproject) plus its runtime stack (cutlass-dsl, quack-kernels, apache-tvm-ffi, torch-c-dlpack-ext) -- the same deps the [dreamverse] extra already installs -- so the image runs FA4. Only flash_attn/cute is replaced (the cute package never ships flash_attn/__init__.py, so FA2 / varlen / bert_padding stay from the wheel); a post-install import check fails the build rather than shipping an FA2-less image. x86 only -- arm64/GB10 (sm_121) keeps the FA2 fallback since the FA4 stack is unvalidated there. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Adds a dedicated installation guide for the NVIDIA DGX Spark (GB10) and wires
it into the docs and README.
The Spark is ARM64 (
aarch64) with the CUDA 13 toolkit, a combination none ofthe existing guides cover. The quick-start path (
uv pip install fastvideo/uv pip install -e .) fails on it for two compounding reasons:fastvideo-kernelhas no prebuiltaarch64wheel (PyPI shipsx86_64only), so it must be compiled from source on the box.
needs (
find_package(Python ... Development.Module)→Python.h), so thebuild aborts before it starts.
This guide gets a Spark from a fresh clone to a verified, GPU-accelerated install
without
sudo.What the guide does (and the non-obvious decisions)
uv venv --python-preference only-managed) — uv'sstandalone CPython bundles the dev headers, so the kernel compiles without
installing
python3.12-devsystem-wide.aarch64, PyPItorch==2.11.0resolves to
2.11.0+cu130(CUDA 13), which matches the systemnvcc13.Pairing them gives a clean toolchain for compiling the kernel; the repo's
pyproject.tomlcu128 pin would force a CUDA-12.8-vs-13.0 mismatch.fastvideo-kernelforsm_121via--no-build-isolation(so itlinks against the real cu130 torch and detects the GB10's arch) with
ThunderKittens disabled (Hopper-only; Triton fallbacks at runtime).
uv pip install -e . --no-sourcesfor the rest, so the cu128 source pindoesn't reinstall torch and break the freshly-built kernel's ABI.
The page also includes a verify section (versions,
fastvideo --help, and acompiled-kernel execution check) and a troubleshooting table mapping each failure
mode to its fix.
Changes
docs/getting_started/installation/spark.md(new) — the guide above,following the structure/voice of the sibling
gpu.md/mps.mdpages andlinked from the installation index the same way (not added to the mkdocs
nav, matching the existing convention).docs/getting_started/installation.md— adds the Spark page to the"supported hardware platforms" list.
README.md— a one-line DGX Spark callout under Getting Started, plus a new"Install with an AI coding agent" section: a paste-able prompt that has a
coding agent detect the platform (
uname -m/nvidia-smi/nvcc) and followthe matching per-platform guide.
Testing
Verified end-to-end on an actual DGX Spark (GB10, aarch64, CUDA 13.0, driver
580): clean install from a fresh venv,
import fastvideo+fastvideo --helpwork, torch sees the GB10, and the compiled int8 GEMM kernel runs correctly on
sm_121(~1% error vs an fp32 reference, as expected for int8). The repo'scodespellconfig (the lint hook that applies to markdown) passes on all threechanged files.
Notes for reviewers
torch==2.11.0+cu130,fastvideo-kernel==0.2.6,flash-attn==2.8.1) reflect what was verified on 2026-05; thecompute-capability values (
121/12.1) are called out as substitutable forother Blackwell steppings, with a one-liner to detect them.
🤖 Generated with Claude Code