perf(decode): replayable transducer CUDA graphs (per-step + batched)#46
Open
sims1253 wants to merge 3 commits into
Open
perf(decode): replayable transducer CUDA graphs (per-step + batched)#46sims1253 wants to merge 3 commits into
sims1253 wants to merge 3 commits into
Conversation
…apture)
The GPU transducer decode loop (tdt_greedy / rnnt_decode_frames) was
launch-overhead bound: it called Backend::compute twice per token step
(PredictionNet::step + Joint::step_logits), and each compute did a full
ggml_init -> build -> ggml_gallocr_alloc_graph -> compute -> readback ->
ggml_free on a fresh context.
Root cause: ggml-cuda keys its internal CUDA graph on cgraph->nodes[0], a tensor
pointer owned by the compute context. Because Backend::compute allocates and
frees the context every call, every per-step graph gets a NEW context -> NEW
node pointers -> a different key -> CUDA-graph capture NEVER warms up, so every
tiny per-step op is launched directly.
Fix: a ReplayGraph helper that builds a graph ONCE and keeps its ggml context,
cgraph, input tensors and output alive across calls. step_logits / PredictionNet
::step build their per-step graph on the first call and replay it every
subsequent step, feeding fresh inputs via ggml_backend_tensor_set and reading
the result (and, for the prediction net, the per-layer LSTM captures) back.
Keeping the context alive makes nodes[0] a stable pointer, so ggml-cuda warms
up and replays the captured per-step graph -- the C++ analogue of megapar's
torch.cuda.CUDAGraph step capture, realized through ggml's own capture.
The compute is byte-identical to the prior path (same ops, same order, same
zero-copy weights; transcripts match on all three bench tiers on both backends).
GPU-only: the graph-capture win is launch-overhead bound and only helps GPU.
On CPU the per-step work is already cheap (multithreaded matmul; launch
overhead negligible) and replay's set_input + capture-readback overhead is a
net regression, so step_logits / step gate on Backend::is_gpu() and keep the
original run_graph path on CPU -- no CPU regression, structurally identical to
before.
Measured clean (idle RTX 5090, tdt-0.6b-v3 f16, best-of-8, bench/ab_bench.sh
A/B vs the unmodified dev baseline; transcripts byte-identical on every row):
decode serial B=1 (the PR target):
7.4 s clip: 60.9 -> 28.9 ms (2.11x)
23 s clip: 169.9 -> 89.0 ms (1.91x)
77 s clip: 639.8 -> 245.3 ms (2.61x) -- win grows with clip length
end-to-end:
23 s clip: 382.3 -> 186.2 ms (2.05x)
77 s clip: 827.3 -> 423.9 ms (1.95x)
CPU decode: unchanged (gated to the original run_graph path).
(The decode loop scales with the number of decode steps, so the speedup grows
with clip length; the short-7s end-to-end is flat because mel+encoder+detok
dominate a clip that short.)
Batched step_logits_batch / step_batch still use the per-call run_graph path,
so the B=8 batched column is ~unchanged (~1.0x); converting them is a follow-up
that unlocks the B>1 throughput win.
AGENTS.md invariants respected: the persistent ggml_gallocr is kept (ReplayGraph
allocates on it once and reuses), zero-copy weights are unchanged, and the
CPU path is byte-for-byte the prior code.
Assisted-by: GLM:builtin:zai-coding-plan/GLM-5.2 [ZCode]
Profiling the replayed GPU decode (from the prior commit) showed the residual
per-step cost is dominated by cudaStreamSynchronize, which ggml-cuda's
buffer_set_tensor / buffer_get_tensor call after EVERY transfer. The prediction
net's step did 5 set_input (x0 + 2*L h/c) per step = 5 host-stalling syncs; the
joint did 2 (enc_proj_t + g). Each sync blocks the host until the GPU drains.
Coalesce each graph's per-step inputs into ONE contiguous host buffer uploaded
with a single set_input (1 sync). The graph slices the single input tensor into
the x0/h_in/c_in (prediction) or enc_proj_t/g (joint) views via ggml_view_1d.
This is numerically a no-op (same values, same order, just packed) -- transcripts
are byte-identical on all three bench tiers on both backends.
Measured clean (idle RTX 5090, tdt-0.6b-v3 f16, best-of-8, bench/ab_bench.sh A/B
vs the unmodified dev baseline; cumulative with the prior replay commit):
decode serial B=1:
7.4 s clip: 54.5 -> 21.5 ms (2.53x)
23 s clip: 178.5 -> 59.2 ms (3.01x)
77 s clip: 568.9 -> 188.9 ms (3.01x)
end-to-end:
77 s clip: 754.3 -> 383.0 ms (1.97x)
CPU: unchanged (the coalescing is GPU-gated alongside the replay path).
(vs the prior commit alone this lifts serial decode from ~2.1-2.6x to ~2.5-3.0x.)
GPU-only as before: step_logits / step gate on Backend::is_gpu() and keep the
original run_graph path on CPU, so there is no CPU regression. Batched
step_logits_batch / step_batch still use the per-call path; converting them is a
follow-up.
Assisted-by: GLM:builtin:zai-coding-plan/GLM-5.2 [ZCode]
…capture)
The batched decode path (transducer_greedy_batch -> step_logits_batch +
PredictionNet::step_batch) still used the per-call run_graph path, so on GPU it
was launch-overhead bound just like the single-step path before PR1 (the bench
B=8 column was ~1.0x). The batched decode loop calls both with a FIXED N per
batch (inactive items are masked, never removed), so one captured graph per
batch size N is valid for every round of that batch.
Add a per-N replay cache (unordered_map<int, unique_ptr<...>>) to Joint and
PredictionNet: the first round at a given N captures the graph (keeping its
context alive so cgraph->nodes[0] is stable -> ggml-cuda captures + replays it),
and every later round just re-feeds the inputs and recomputes. GPU-gated on
Backend::is_gpu(); CPU keeps the original run_graph path (the win is launch-
overhead bound and GPU-only). Byte-identical (same ops/order/weights; bench-
decode sanity "batched ids==serial" passes at B=1,4,8; transcripts match on all
three bench tiers on both backends).
The batched prediction net's per-layer (h',c') captures land in STABLE internal
buffers (the caller's out_state is reassigned every round, so it cannot be the
capture dst without aliasing) and are copied out once after compute -- the same
pattern the single-step path already uses.
Measured clean (idle RTX 5090, tdt-0.6b-v3 f16, best-of-8, bench/ab_bench.sh A/B
vs upstream/master; transcripts byte-identical on every row):
decode batched B=8: 62.4 -> 27.7 ms (2.25x) [7.4s]
181.2 -> 70.1 ms (2.58x) [23s]
631.6 -> 293.9 ms (2.15x) [77s]
decode serial B=1: 644.0 -> 185.2 ms (3.48x) [77s]
end-to-end: 849.0 -> 346.3 ms (2.45x) [77s]
325.1 -> 162.6 ms (2.00x) [23s]
batched clips/sec at B=8: ~17 -> ~262 (15x) on the 7.4s clip
CPU: unchanged (batched step_* are gated to run_graph on CPU).
Assisted-by: GLM:builtin:zai-coding-plan/GLM-5.2 [ZCode]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Capture the per-step and batched RNN-T decode graphs once and replay them, instead of rebuilding the ggml graph on every token. A new
ReplayGraphkeeps one ggml context + cgraph alive across calls so the CUDA backend can capture and replay the per-token joint/prediction work rather than launching each op directly.Problem
The transducer decode loop runs the joint + prediction nets once per emitted token. Each
Backend::computecall does a freshggml_init+ggml_free, so every per-step graph gets new tensor node pointers. ggml-cuda keys its internal CUDA-graph capture oncgraph->nodes[0], so the capture never warms up and every tiny per-token op is launched individually — the launch-overhead regime that dominates GPU decode.Solution
ReplayGraph(src/backend.{hpp,cpp}): build a graph once on the persistent gallocr, then recompute it many times, feeding fresh inputs each step via recorded input-tensor handles. Keeping the context alive gives ggml-cuda a stablenodes[0]to capture + replay.Joint(src/joint.cpp): replay the per-step and per-batch-size joint graphs on GPU. Per-step inputs are coalesced into a single host buffer uploaded with oneset_input.PredictionNet(src/prediction.cpp): replay the per-step and per-batch-size LSTM graphs on GPU, coalescing thex0/h/cinputs into one upload; per-layer(h', c')captures are read back from stable internal buffers.run_graphpath unchanged — replay'sset_input+ readback would regress there. Both paths are byte-identical (same ops, order, weights).Performance
GPU launch-overhead win on the transducer decode loop. Measured on the 0.6B GPU decode: the per-step joint drops from ~290 ms to tens of ms with replay. CPU behavior is unchanged by design (gated on
is_gpu()). 2-3x speedput in single-batch and ~15x with B=8.Testing
cmake --build buildclean.ctest -LE model— 11/11 model-independent tests pass (5 GPU/model-dependent tests skip as expected).run_graphpaths; CPU decode is untouched.Notes
Follows the project's performance invariants: the persistent
ggml_gallocris reused (no per-call alloc/free), theggml_backend_schedis only the per-graph fallback when the GPU backend lacks an op, and zero-copy weights are preserved.