Skip to content

perf(decode): replayable transducer CUDA graphs (per-step + batched)#46

Open
sims1253 wants to merge 3 commits into
mudler:masterfrom
sims1253:perf/replayable-decode-graph
Open

perf(decode): replayable transducer CUDA graphs (per-step + batched)#46
sims1253 wants to merge 3 commits into
mudler:masterfrom
sims1253:perf/replayable-decode-graph

Conversation

@sims1253

Copy link
Copy Markdown

Summary

Capture the per-step and batched RNN-T decode graphs once and replay them, instead of rebuilding the ggml graph on every token. A new ReplayGraph keeps one ggml context + cgraph alive across calls so the CUDA backend can capture and replay the per-token joint/prediction work rather than launching each op directly.

Problem

The transducer decode loop runs the joint + prediction nets once per emitted token. Each Backend::compute call does a fresh ggml_init + ggml_free, so every per-step graph gets new tensor node pointers. ggml-cuda keys its internal CUDA-graph capture on cgraph->nodes[0], so the capture never warms up and every tiny per-token op is launched individually — the launch-overhead regime that dominates GPU decode.

Solution

  • ReplayGraph (src/backend.{hpp,cpp}): build a graph once on the persistent gallocr, then recompute it many times, feeding fresh inputs each step via recorded input-tensor handles. Keeping the context alive gives ggml-cuda a stable nodes[0] to capture + replay.
  • Joint (src/joint.cpp): replay the per-step and per-batch-size joint graphs on GPU. Per-step inputs are coalesced into a single host buffer uploaded with one set_input.
  • PredictionNet (src/prediction.cpp): replay the per-step and per-batch-size LSTM graphs on GPU, coalescing the x0/h/c inputs into one upload; per-layer (h', c') captures are read back from stable internal buffers.
  • CPU keeps the original per-call run_graph path unchanged — replay's set_input + readback would regress there. Both paths are byte-identical (same ops, order, weights).

Performance

GPU launch-overhead win on the transducer decode loop. Measured on the 0.6B GPU decode: the per-step joint drops from ~290 ms to tens of ms with replay. CPU behavior is unchanged by design (gated on is_gpu()). 2-3x speedput in single-batch and ~15x with B=8.

Testing

  • cmake --build build clean.
  • ctest -LE model — 11/11 model-independent tests pass (5 GPU/model-dependent tests skip as expected).
  • Both replay paths are byte-identical to the existing run_graph paths; CPU decode is untouched.

Notes

Follows the project's performance invariants: the persistent ggml_gallocr is reused (no per-call alloc/free), the ggml_backend_sched is only the per-graph fallback when the GPU backend lacks an op, and zero-copy weights are preserved.

sims1253 added 3 commits June 29, 2026 21:07
…apture)

The GPU transducer decode loop (tdt_greedy / rnnt_decode_frames) was
launch-overhead bound: it called Backend::compute twice per token step
(PredictionNet::step + Joint::step_logits), and each compute did a full
ggml_init -> build -> ggml_gallocr_alloc_graph -> compute -> readback ->
ggml_free on a fresh context.

Root cause: ggml-cuda keys its internal CUDA graph on cgraph->nodes[0], a tensor
pointer owned by the compute context. Because Backend::compute allocates and
frees the context every call, every per-step graph gets a NEW context -> NEW
node pointers -> a different key -> CUDA-graph capture NEVER warms up, so every
tiny per-step op is launched directly.

Fix: a ReplayGraph helper that builds a graph ONCE and keeps its ggml context,
cgraph, input tensors and output alive across calls. step_logits / PredictionNet
::step build their per-step graph on the first call and replay it every
subsequent step, feeding fresh inputs via ggml_backend_tensor_set and reading
the result (and, for the prediction net, the per-layer LSTM captures) back.
Keeping the context alive makes nodes[0] a stable pointer, so ggml-cuda warms
up and replays the captured per-step graph -- the C++ analogue of megapar's
torch.cuda.CUDAGraph step capture, realized through ggml's own capture.

The compute is byte-identical to the prior path (same ops, same order, same
zero-copy weights; transcripts match on all three bench tiers on both backends).

GPU-only: the graph-capture win is launch-overhead bound and only helps GPU.
On CPU the per-step work is already cheap (multithreaded matmul; launch
overhead negligible) and replay's set_input + capture-readback overhead is a
net regression, so step_logits / step gate on Backend::is_gpu() and keep the
original run_graph path on CPU -- no CPU regression, structurally identical to
before.

Measured clean (idle RTX 5090, tdt-0.6b-v3 f16, best-of-8, bench/ab_bench.sh
A/B vs the unmodified dev baseline; transcripts byte-identical on every row):

  decode serial B=1 (the PR target):
    7.4 s clip:  60.9 -> 28.9 ms  (2.11x)
    23  s clip: 169.9 -> 89.0 ms  (1.91x)
    77  s clip: 639.8 -> 245.3 ms (2.61x)   -- win grows with clip length
  end-to-end:
    23  s clip: 382.3 -> 186.2 ms (2.05x)
    77  s clip: 827.3 -> 423.9 ms (1.95x)
  CPU decode: unchanged (gated to the original run_graph path).

(The decode loop scales with the number of decode steps, so the speedup grows
with clip length; the short-7s end-to-end is flat because mel+encoder+detok
dominate a clip that short.)

Batched step_logits_batch / step_batch still use the per-call run_graph path,
so the B=8 batched column is ~unchanged (~1.0x); converting them is a follow-up
that unlocks the B>1 throughput win.

AGENTS.md invariants respected: the persistent ggml_gallocr is kept (ReplayGraph
allocates on it once and reuses), zero-copy weights are unchanged, and the
CPU path is byte-for-byte the prior code.

Assisted-by: GLM:builtin:zai-coding-plan/GLM-5.2 [ZCode]
Profiling the replayed GPU decode (from the prior commit) showed the residual
per-step cost is dominated by cudaStreamSynchronize, which ggml-cuda's
buffer_set_tensor / buffer_get_tensor call after EVERY transfer. The prediction
net's step did 5 set_input (x0 + 2*L h/c) per step = 5 host-stalling syncs; the
joint did 2 (enc_proj_t + g). Each sync blocks the host until the GPU drains.

Coalesce each graph's per-step inputs into ONE contiguous host buffer uploaded
with a single set_input (1 sync). The graph slices the single input tensor into
the x0/h_in/c_in (prediction) or enc_proj_t/g (joint) views via ggml_view_1d.
This is numerically a no-op (same values, same order, just packed) -- transcripts
are byte-identical on all three bench tiers on both backends.

Measured clean (idle RTX 5090, tdt-0.6b-v3 f16, best-of-8, bench/ab_bench.sh A/B
vs the unmodified dev baseline; cumulative with the prior replay commit):

  decode serial B=1:
    7.4 s clip:  54.5 -> 21.5 ms  (2.53x)
    23  s clip: 178.5 -> 59.2 ms  (3.01x)
    77  s clip: 568.9 -> 188.9 ms (3.01x)
  end-to-end:
    77  s clip: 754.3 -> 383.0 ms (1.97x)
  CPU: unchanged (the coalescing is GPU-gated alongside the replay path).

(vs the prior commit alone this lifts serial decode from ~2.1-2.6x to ~2.5-3.0x.)

GPU-only as before: step_logits / step gate on Backend::is_gpu() and keep the
original run_graph path on CPU, so there is no CPU regression. Batched
step_logits_batch / step_batch still use the per-call path; converting them is a
follow-up.

Assisted-by: GLM:builtin:zai-coding-plan/GLM-5.2 [ZCode]
…capture)

The batched decode path (transducer_greedy_batch -> step_logits_batch +
PredictionNet::step_batch) still used the per-call run_graph path, so on GPU it
was launch-overhead bound just like the single-step path before PR1 (the bench
B=8 column was ~1.0x). The batched decode loop calls both with a FIXED N per
batch (inactive items are masked, never removed), so one captured graph per
batch size N is valid for every round of that batch.

Add a per-N replay cache (unordered_map<int, unique_ptr<...>>) to Joint and
PredictionNet: the first round at a given N captures the graph (keeping its
context alive so cgraph->nodes[0] is stable -> ggml-cuda captures + replays it),
and every later round just re-feeds the inputs and recomputes. GPU-gated on
Backend::is_gpu(); CPU keeps the original run_graph path (the win is launch-
overhead bound and GPU-only). Byte-identical (same ops/order/weights; bench-
decode sanity "batched ids==serial" passes at B=1,4,8; transcripts match on all
three bench tiers on both backends).

The batched prediction net's per-layer (h',c') captures land in STABLE internal
buffers (the caller's out_state is reassigned every round, so it cannot be the
capture dst without aliasing) and are copied out once after compute -- the same
pattern the single-step path already uses.

Measured clean (idle RTX 5090, tdt-0.6b-v3 f16, best-of-8, bench/ab_bench.sh A/B
vs upstream/master; transcripts byte-identical on every row):

  decode batched B=8:        62.4 -> 27.7 ms (2.25x)  [7.4s]
                            181.2 -> 70.1 ms (2.58x)  [23s]
                            631.6 -> 293.9 ms (2.15x) [77s]
  decode serial B=1:        644.0 -> 185.2 ms (3.48x) [77s]
  end-to-end:               849.0 -> 346.3 ms (2.45x) [77s]
                            325.1 -> 162.6 ms (2.00x) [23s]
  batched clips/sec at B=8: ~17 -> ~262 (15x) on the 7.4s clip

CPU: unchanged (batched step_* are gated to run_graph on CPU).

Assisted-by: GLM:builtin:zai-coding-plan/GLM-5.2 [ZCode]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant