Skip to content

[bugfix] matrixgame2 causal denoise: store KV-cache end indices as Python int#1414

Open
rich7420 wants to merge 1 commit into
hao-ai-lab:mainfrom
rich7420:matrixgame2-fix-item-sync
Open

[bugfix] matrixgame2 causal denoise: store KV-cache end indices as Python int#1414
rich7420 wants to merge 1 commit into
hao-ai-lab:mainfrom
rich7420:matrixgame2-fix-item-sync

Conversation

@rich7420

@rich7420 rich7420 commented May 30, 2026

Copy link
Copy Markdown
Contributor

Summary

matrixgame2 causal inference stored the KV-cache end indices
(global_end_index / local_end_index) as GPU tensors. The consumers in
causal_model.py and action_module.py then call int(tensor.item()) on
every access — each .item() is a GPU→CPU sync that serializes the
autoregressive denoise loop.

Both consumers already handle a plain Python int (their
isinstance(..., torch.Tensor) check falls through to the int path), so this
just initializes the indices as ints in _initialize_kv_cache /
_initialize_action_kv_cache.

How it was found (nsys-ai)

On an H100 matrixgame2 trace, host_sync_parent_ranges attributed
10,244 aten::item syncs to the denoising stage (~264 ms), and
cpu_gpu_pipeline showed ~29.7k GPU starvation events from host-blocking
between dispatches.

Result

Denoise-stage .item() syncs: 10,244 → 644 (−94%). Output unchanged — the
consumer sites already accepted plain ints.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request simplifies the initialization of the KV cache in matrixgame2_denoising.py by replacing torch.tensor([0], dtype=torch.long, device=device) with Python integers for both global_end_index and local_end_index. There are no review comments, and we have no additional feedback to provide.

@mergify mergify Bot added type: bugfix Bug fix scope: inference Inference pipeline, serving, CLI labels May 30, 2026
@mergify

mergify Bot commented May 30, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for

  • #approved-reviews-by>=1
  • check-success=full-suite-passed
  • check-success~=pre-commit
This rule is failing.
  • #approved-reviews-by>=1
  • check-success=full-suite-passed
  • check-success~=pre-commit
  • check-success=fastcheck-passed
  • title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]

@SolitaryThinker

Copy link
Copy Markdown
Collaborator

Hi @rich7420 — automated review from Gob, one of @SolitaryThinker's AI reviewers. Findings aren't all human-verified; ping @SolitaryThinker if anything looks off.

TL;DR

The int conversion is directionally correct: the MatrixGame2 causal and action cache consumers accept Python int values and then keep the cache indices as ints, avoiding the old CUDA .item() path. The one gap is regression coverage: existing MatrixGame2 tests do not assert this cache-index type/sync invariant, and one nearby transformer test is skipped.

Verdict: Changes requested: add regression coverage

  • S0 (blockers): 0
  • S1 (must-fix): 0
  • S2 (should-fix; surfaced if persistent or important): 1
  • S3 (discussion): not shown here; see review.md

Findings (formatted for upload)

[S2] Missing regression coverage for Python-int KV-cache indices

What: The PR changes MatrixGame2 denoising cache initialization from CUDA tensor counters to Python int counters, but it does not add a test that would fail if these keys regress back to tensors.

Why it matters: The old tensor counters route through .item() in the MatrixGame2 causal/action cache consumers, which forces host synchronization on CUDA and can silently reappear without a targeted type assertion. The existing MatrixGame2 SSIM test is an expensive end-to-end quality check, while fastvideo/tests/transformers/test_matrixgame2.py is skipped and does not exercise this denoising-stage cache initialization.

Suggested fix: Add a lightweight stage-level regression test that initializes _initialize_kv_cache() and _initialize_action_kv_cache() and asserts global_end_index and local_end_index are plain int values for the normal, mouse, and keyboard caches. If practical, also run one minimal forward/update path with fake cache tensors and assert the consumers preserve int indices after updating.

Evidence: fastvideo/pipelines/stages/matrixgame2_denoising.py:281


— Gob (@SolitaryThinker's AI reviewer). Full review (including S3 items) is archived locally.

…thon int, not GPU tensor

Storing kv_cache["global_end_index"] / ["local_end_index"] as
torch.tensor([0], device=device) forces the consumer sites in
causal_model.py and action_module.py through their isinstance(Tensor)
branch, which does int(tensor.item()) on every access. Each .item()
is a GPU->CPU sync that serializes the autoregressive denoise loop
and prevents kernel-queue lookahead.

Both call sites already handle the Python int case (their isinstance
branch goes to else and uses the int directly). The fix is to
initialize as plain int in _initialize_kv_cache /
_initialize_action_kv_cache and let the producer ('fill_' vs
assignment) and consumer (item vs int cast) branches resolve.

Measured on Modal H100x1, num_frames=117 (10 AR blocks * 3 DMD steps
= 30 DiT forwards):

  Denoise stage syncs:   10,244 -> 644   (-94%)
  Denoise stage sync_ms: 264.1  -> 11.2  (-96%)
  avg_queue_delay_us:    42,919 -> 20,465 (-52%)
  Inference wall:        18.3s  -> 17.2s  (-6%)

Note the modest wall improvement: the remaining ~77% GPU-idle is
Python-interpreter throughput bound (5,455 kernels/forward at
14.8us avg), not sync-bound. This fix is a prerequisite for any
subsequent CUDA Graph / persistent-kernel work but not the full
answer.
@rich7420 rich7420 force-pushed the matrixgame2-fix-item-sync branch from 9b9db07 to ec0dc3e Compare June 5, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scope: inference Inference pipeline, serving, CLI type: bugfix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants