[bugfix] matrixgame2 causal denoise: store KV-cache end indices as Python int#1414
[bugfix] matrixgame2 causal denoise: store KV-cache end indices as Python int#1414rich7420 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request simplifies the initialization of the KV cache in matrixgame2_denoising.py by replacing torch.tensor([0], dtype=torch.long, device=device) with Python integers for both global_end_index and local_end_index. There are no review comments, and we have no additional feedback to provide.
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 PR merge requirementsWaiting for
This rule is failing.
|
|
Hi @rich7420 — automated review from Gob, one of @SolitaryThinker's AI reviewers. Findings aren't all human-verified; ping @SolitaryThinker if anything looks off. TL;DRThe int conversion is directionally correct: the MatrixGame2 causal and action cache consumers accept Python Verdict: Changes requested: add regression coverage
Findings (formatted for upload)[S2] Missing regression coverage for Python-int KV-cache indicesWhat: The PR changes MatrixGame2 denoising cache initialization from CUDA tensor counters to Python Why it matters: The old tensor counters route through Suggested fix: Add a lightweight stage-level regression test that initializes Evidence: — Gob (@SolitaryThinker's AI reviewer). Full review (including S3 items) is archived locally. |
…thon int, not GPU tensor
Storing kv_cache["global_end_index"] / ["local_end_index"] as
torch.tensor([0], device=device) forces the consumer sites in
causal_model.py and action_module.py through their isinstance(Tensor)
branch, which does int(tensor.item()) on every access. Each .item()
is a GPU->CPU sync that serializes the autoregressive denoise loop
and prevents kernel-queue lookahead.
Both call sites already handle the Python int case (their isinstance
branch goes to else and uses the int directly). The fix is to
initialize as plain int in _initialize_kv_cache /
_initialize_action_kv_cache and let the producer ('fill_' vs
assignment) and consumer (item vs int cast) branches resolve.
Measured on Modal H100x1, num_frames=117 (10 AR blocks * 3 DMD steps
= 30 DiT forwards):
Denoise stage syncs: 10,244 -> 644 (-94%)
Denoise stage sync_ms: 264.1 -> 11.2 (-96%)
avg_queue_delay_us: 42,919 -> 20,465 (-52%)
Inference wall: 18.3s -> 17.2s (-6%)
Note the modest wall improvement: the remaining ~77% GPU-idle is
Python-interpreter throughput bound (5,455 kernels/forward at
14.8us avg), not sync-bound. This fix is a prerequisite for any
subsequent CUDA Graph / persistent-kernel work but not the full
answer.
9b9db07 to
ec0dc3e
Compare
Summary
matrixgame2 causal inference stored the KV-cache end indices
(
global_end_index/local_end_index) as GPU tensors. The consumers incausal_model.pyandaction_module.pythen callint(tensor.item())onevery access — each
.item()is a GPU→CPU sync that serializes theautoregressive denoise loop.
Both consumers already handle a plain Python
int(theirisinstance(..., torch.Tensor)check falls through to the int path), so thisjust initializes the indices as ints in
_initialize_kv_cache/_initialize_action_kv_cache.How it was found (nsys-ai)
On an H100 matrixgame2 trace,
host_sync_parent_rangesattributed10,244
aten::itemsyncs to the denoising stage (~264 ms), andcpu_gpu_pipelineshowed ~29.7k GPU starvation events from host-blockingbetween dispatches.
Result
Denoise-stage
.item()syncs: 10,244 → 644 (−94%). Output unchanged — theconsumer sites already accepted plain ints.