[bugfix] matrixgame2 causal denoise: store KV-cache end indices as Python int by rich7420 · Pull Request #1414 · hao-ai-lab/FastVideo

rich7420 · 2026-05-30T06:00:21Z

Summary

matrixgame2 causal inference stored the KV-cache end indices
(global_end_index / local_end_index) as GPU tensors. The consumers in
causal_model.py and action_module.py then call int(tensor.item()) on
every access — each .item() is a GPU→CPU sync that serializes the
autoregressive denoise loop.

Both consumers already handle a plain Python int (their
isinstance(..., torch.Tensor) check falls through to the int path), so this
just initializes the indices as ints in _initialize_kv_cache /
_initialize_action_kv_cache.

How it was found (nsys-ai)

On an H100 matrixgame2 trace, host_sync_parent_ranges attributed
10,244 aten::item syncs to the denoising stage (~264 ms), and
cpu_gpu_pipeline showed ~29.7k GPU starvation events from host-blocking
between dispatches.

Result

Denoise-stage .item() syncs: 10,244 → 644 (−94%). Output unchanged — the
consumer sites already accepted plain ints.

gemini-code-assist

Code Review

This pull request simplifies the initialization of the KV cache in matrixgame2_denoising.py by replacing torch.tensor([0], dtype=torch.long, device=device) with Python integers for both global_end_index and local_end_index. There are no review comments, and we have no additional feedback to provide.

mergify · 2026-05-30T06:01:01Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for

#approved-reviews-by>=1
check-success=full-suite-passed
check-success~=pre-commit

This rule is failing.

#approved-reviews-by>=1
check-success=full-suite-passed
check-success~=pre-commit
check-success=fastcheck-passed
title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]

SolitaryThinker · 2026-06-02T03:05:44Z

Hi @rich7420 — automated review from Gob, one of @SolitaryThinker's AI reviewers. Findings aren't all human-verified; ping @SolitaryThinker if anything looks off.

TL;DR

The int conversion is directionally correct: the MatrixGame2 causal and action cache consumers accept Python int values and then keep the cache indices as ints, avoiding the old CUDA .item() path. The one gap is regression coverage: existing MatrixGame2 tests do not assert this cache-index type/sync invariant, and one nearby transformer test is skipped.

Verdict: Changes requested: add regression coverage

S0 (blockers): 0
S1 (must-fix): 0
S2 (should-fix; surfaced if persistent or important): 1
S3 (discussion): not shown here; see review.md

Findings (formatted for upload)

[S2] Missing regression coverage for Python-int KV-cache indices

What: The PR changes MatrixGame2 denoising cache initialization from CUDA tensor counters to Python int counters, but it does not add a test that would fail if these keys regress back to tensors.

Why it matters: The old tensor counters route through .item() in the MatrixGame2 causal/action cache consumers, which forces host synchronization on CUDA and can silently reappear without a targeted type assertion. The existing MatrixGame2 SSIM test is an expensive end-to-end quality check, while fastvideo/tests/transformers/test_matrixgame2.py is skipped and does not exercise this denoising-stage cache initialization.

Suggested fix: Add a lightweight stage-level regression test that initializes _initialize_kv_cache() and _initialize_action_kv_cache() and asserts global_end_index and local_end_index are plain int values for the normal, mouse, and keyboard caches. If practical, also run one minimal forward/update path with fake cache tensors and assert the consumers preserve int indices after updating.

Evidence: fastvideo/pipelines/stages/matrixgame2_denoising.py:281

— Gob (@SolitaryThinker's AI reviewer). Full review (including S3 items) is archived locally.

…thon int, not GPU tensor Storing kv_cache["global_end_index"] / ["local_end_index"] as torch.tensor([0], device=device) forces the consumer sites in causal_model.py and action_module.py through their isinstance(Tensor) branch, which does int(tensor.item()) on every access. Each .item() is a GPU->CPU sync that serializes the autoregressive denoise loop and prevents kernel-queue lookahead. Both call sites already handle the Python int case (their isinstance branch goes to else and uses the int directly). The fix is to initialize as plain int in _initialize_kv_cache / _initialize_action_kv_cache and let the producer ('fill_' vs assignment) and consumer (item vs int cast) branches resolve. Measured on Modal H100x1, num_frames=117 (10 AR blocks * 3 DMD steps = 30 DiT forwards): Denoise stage syncs: 10,244 -> 644 (-94%) Denoise stage sync_ms: 264.1 -> 11.2 (-96%) avg_queue_delay_us: 42,919 -> 20,465 (-52%) Inference wall: 18.3s -> 17.2s (-6%) Note the modest wall improvement: the remaining ~77% GPU-idle is Python-interpreter throughput bound (5,455 kernels/forward at 14.8us avg), not sync-bound. This fix is a prerequisite for any subsequent CUDA Graph / persistent-kernel work but not the full answer.

gemini-code-assist Bot reviewed May 30, 2026

View reviewed changes

mergify Bot added type: bugfix Bug fix scope: inference Inference pipeline, serving, CLI labels May 30, 2026

rich7420 force-pushed the matrixgame2-fix-item-sync branch from 9b9db07 to ec0dc3e Compare June 5, 2026 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] matrixgame2 causal denoise: store KV-cache end indices as Python int#1414

[bugfix] matrixgame2 causal denoise: store KV-cache end indices as Python int#1414
rich7420 wants to merge 1 commit into
hao-ai-lab:mainfrom
rich7420:matrixgame2-fix-item-sync

rich7420 commented May 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mergify Bot commented May 30, 2026 •

edited

Loading

Uh oh!

SolitaryThinker commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rich7420 commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it was found (nsys-ai)

Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 PR merge requirements

Uh oh!

SolitaryThinker commented Jun 2, 2026

TL;DR

Verdict: Changes requested: add regression coverage

Findings (formatted for upload)

[S2] Missing regression coverage for Python-int KV-cache indices

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rich7420 commented May 30, 2026 •

edited

Loading

mergify Bot commented May 30, 2026 •

edited

Loading