perf(decode): replayable transducer CUDA graphs (per-step + batched) by sims1253 · Pull Request #46 · mudler/parakeet.cpp

sims1253 · 2026-06-29T21:55:26Z

Summary

Capture the per-step and batched RNN-T decode graphs once and replay them, instead of rebuilding the ggml graph on every token. A new ReplayGraph keeps one ggml context + cgraph alive across calls so the CUDA backend can capture and replay the per-token joint/prediction work rather than launching each op directly.

Problem

The transducer decode loop runs the joint + prediction nets once per emitted token. Each Backend::compute call does a fresh ggml_init + ggml_free, so every per-step graph gets new tensor node pointers. ggml-cuda keys its internal CUDA-graph capture on cgraph->nodes[0], so the capture never warms up and every tiny per-token op is launched individually — the launch-overhead regime that dominates GPU decode.

Solution

ReplayGraph (src/backend.{hpp,cpp}): build a graph once on the persistent gallocr, then recompute it many times, feeding fresh inputs each step via recorded input-tensor handles. Keeping the context alive gives ggml-cuda a stable nodes[0] to capture + replay.
Joint (src/joint.cpp): replay the per-step and per-batch-size joint graphs on GPU. Per-step inputs are coalesced into a single host buffer uploaded with one set_input.
PredictionNet (src/prediction.cpp): replay the per-step and per-batch-size LSTM graphs on GPU, coalescing the x0/h/c inputs into one upload; per-layer (h', c') captures are read back from stable internal buffers.
CPU keeps the original per-call run_graph path unchanged — replay's set_input + readback would regress there. Both paths are byte-identical (same ops, order, weights).

Performance

GPU launch-overhead win on the transducer decode loop. Measured on the 0.6B GPU decode: the per-step joint drops from ~290 ms to tens of ms with replay. CPU behavior is unchanged by design (gated on is_gpu()). 2-3x speedput in single-batch and ~15x with B=8.

Testing

cmake --build build clean.
ctest -LE model — 11/11 model-independent tests pass (5 GPU/model-dependent tests skip as expected).
Both replay paths are byte-identical to the existing run_graph paths; CPU decode is untouched.

Notes

Follows the project's performance invariants: the persistent ggml_gallocr is reused (no per-call alloc/free), the ggml_backend_sched is only the per-graph fallback when the GPU backend lacks an op, and zero-copy weights are preserved.

…apture) The GPU transducer decode loop (tdt_greedy / rnnt_decode_frames) was launch-overhead bound: it called Backend::compute twice per token step (PredictionNet::step + Joint::step_logits), and each compute did a full ggml_init -> build -> ggml_gallocr_alloc_graph -> compute -> readback -> ggml_free on a fresh context. Root cause: ggml-cuda keys its internal CUDA graph on cgraph->nodes[0], a tensor pointer owned by the compute context. Because Backend::compute allocates and frees the context every call, every per-step graph gets a NEW context -> NEW node pointers -> a different key -> CUDA-graph capture NEVER warms up, so every tiny per-step op is launched directly. Fix: a ReplayGraph helper that builds a graph ONCE and keeps its ggml context, cgraph, input tensors and output alive across calls. step_logits / PredictionNet ::step build their per-step graph on the first call and replay it every subsequent step, feeding fresh inputs via ggml_backend_tensor_set and reading the result (and, for the prediction net, the per-layer LSTM captures) back. Keeping the context alive makes nodes[0] a stable pointer, so ggml-cuda warms up and replays the captured per-step graph -- the C++ analogue of megapar's torch.cuda.CUDAGraph step capture, realized through ggml's own capture. The compute is byte-identical to the prior path (same ops, same order, same zero-copy weights; transcripts match on all three bench tiers on both backends). GPU-only: the graph-capture win is launch-overhead bound and only helps GPU. On CPU the per-step work is already cheap (multithreaded matmul; launch overhead negligible) and replay's set_input + capture-readback overhead is a net regression, so step_logits / step gate on Backend::is_gpu() and keep the original run_graph path on CPU -- no CPU regression, structurally identical to before. Measured clean (idle RTX 5090, tdt-0.6b-v3 f16, best-of-8, bench/ab_bench.sh A/B vs the unmodified dev baseline; transcripts byte-identical on every row): decode serial B=1 (the PR target): 7.4 s clip: 60.9 -> 28.9 ms (2.11x) 23 s clip: 169.9 -> 89.0 ms (1.91x) 77 s clip: 639.8 -> 245.3 ms (2.61x) -- win grows with clip length end-to-end: 23 s clip: 382.3 -> 186.2 ms (2.05x) 77 s clip: 827.3 -> 423.9 ms (1.95x) CPU decode: unchanged (gated to the original run_graph path). (The decode loop scales with the number of decode steps, so the speedup grows with clip length; the short-7s end-to-end is flat because mel+encoder+detok dominate a clip that short.) Batched step_logits_batch / step_batch still use the per-call run_graph path, so the B=8 batched column is ~unchanged (~1.0x); converting them is a follow-up that unlocks the B>1 throughput win. AGENTS.md invariants respected: the persistent ggml_gallocr is kept (ReplayGraph allocates on it once and reuses), zero-copy weights are unchanged, and the CPU path is byte-for-byte the prior code. Assisted-by: GLM:builtin:zai-coding-plan/GLM-5.2 [ZCode]

Profiling the replayed GPU decode (from the prior commit) showed the residual per-step cost is dominated by cudaStreamSynchronize, which ggml-cuda's buffer_set_tensor / buffer_get_tensor call after EVERY transfer. The prediction net's step did 5 set_input (x0 + 2*L h/c) per step = 5 host-stalling syncs; the joint did 2 (enc_proj_t + g). Each sync blocks the host until the GPU drains. Coalesce each graph's per-step inputs into ONE contiguous host buffer uploaded with a single set_input (1 sync). The graph slices the single input tensor into the x0/h_in/c_in (prediction) or enc_proj_t/g (joint) views via ggml_view_1d. This is numerically a no-op (same values, same order, just packed) -- transcripts are byte-identical on all three bench tiers on both backends. Measured clean (idle RTX 5090, tdt-0.6b-v3 f16, best-of-8, bench/ab_bench.sh A/B vs the unmodified dev baseline; cumulative with the prior replay commit): decode serial B=1: 7.4 s clip: 54.5 -> 21.5 ms (2.53x) 23 s clip: 178.5 -> 59.2 ms (3.01x) 77 s clip: 568.9 -> 188.9 ms (3.01x) end-to-end: 77 s clip: 754.3 -> 383.0 ms (1.97x) CPU: unchanged (the coalescing is GPU-gated alongside the replay path). (vs the prior commit alone this lifts serial decode from ~2.1-2.6x to ~2.5-3.0x.) GPU-only as before: step_logits / step gate on Backend::is_gpu() and keep the original run_graph path on CPU, so there is no CPU regression. Batched step_logits_batch / step_batch still use the per-call path; converting them is a follow-up. Assisted-by: GLM:builtin:zai-coding-plan/GLM-5.2 [ZCode]

…capture) The batched decode path (transducer_greedy_batch -> step_logits_batch + PredictionNet::step_batch) still used the per-call run_graph path, so on GPU it was launch-overhead bound just like the single-step path before PR1 (the bench B=8 column was ~1.0x). The batched decode loop calls both with a FIXED N per batch (inactive items are masked, never removed), so one captured graph per batch size N is valid for every round of that batch. Add a per-N replay cache (unordered_map<int, unique_ptr<...>>) to Joint and PredictionNet: the first round at a given N captures the graph (keeping its context alive so cgraph->nodes[0] is stable -> ggml-cuda captures + replays it), and every later round just re-feeds the inputs and recomputes. GPU-gated on Backend::is_gpu(); CPU keeps the original run_graph path (the win is launch- overhead bound and GPU-only). Byte-identical (same ops/order/weights; bench- decode sanity "batched ids==serial" passes at B=1,4,8; transcripts match on all three bench tiers on both backends). The batched prediction net's per-layer (h',c') captures land in STABLE internal buffers (the caller's out_state is reassigned every round, so it cannot be the capture dst without aliasing) and are copied out once after compute -- the same pattern the single-step path already uses. Measured clean (idle RTX 5090, tdt-0.6b-v3 f16, best-of-8, bench/ab_bench.sh A/B vs upstream/master; transcripts byte-identical on every row): decode batched B=8: 62.4 -> 27.7 ms (2.25x) [7.4s] 181.2 -> 70.1 ms (2.58x) [23s] 631.6 -> 293.9 ms (2.15x) [77s] decode serial B=1: 644.0 -> 185.2 ms (3.48x) [77s] end-to-end: 849.0 -> 346.3 ms (2.45x) [77s] 325.1 -> 162.6 ms (2.00x) [23s] batched clips/sec at B=8: ~17 -> ~262 (15x) on the 7.4s clip CPU: unchanged (batched step_* are gated to run_graph on CPU). Assisted-by: GLM:builtin:zai-coding-plan/GLM-5.2 [ZCode]

sims1253 added 3 commits June 29, 2026 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(decode): replayable transducer CUDA graphs (per-step + batched)#46

perf(decode): replayable transducer CUDA graphs (per-step + batched)#46
sims1253 wants to merge 3 commits into
mudler:masterfrom
sims1253:perf/replayable-decode-graph

sims1253 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sims1253 commented Jun 29, 2026

Summary

Problem

Solution

Performance

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant