fix: avoid O(N²) per-chunk reprocessing in long-running StreamingSession by rubdttcom · Pull Request #39 · mudler/parakeet.cpp

rubdttcom · 2026-06-26T16:30:49Z

Problem

StreamingSession rebuilds the running transcript and the word grouping from
the entire session history on every chunk:

process_emitted() runs detokenize() over all non-special tokens so far.
regroup_words() runs group_words() over all accumulated word tokens.

Neither buffer is trimmed across utterances (the <EOU>/<EOB> reset only
clears the decoder LSTM state). So for a long-lived stream the per-chunk cost
grows with the whole session — O(N²) total CPU — and the repeated growing
allocations make RSS climb. A dictation session left open for hours pegs a core;
a multi-minute continuous stream already shows the per-chunk rate collapsing.

Fix

Make both steps incremental, with identical output:

Text: detokenize is prefix-stable (per-token piece concat + ▁→space,
and the single leading-space strip only touches byte 0). So append only the
newly emitted tokens' text to text_ instead of rebuilding it each chunk.
Words: re-group only the still-open tail (word_tokens_ from the last
▁-word-start) and keep already-finalized words. group_words splits at
▁-word-starts and its only cross-token effects (one-token forward lookahead,
backward punctuation attach) never cross a word-start boundary, so
group_words([0,cursor)) ++ group_words([cursor,end)) == group_words([0,end)).

No public contract changes: text(), tokens(), take_new_text(),
drain_words(), drain_events() return exactly what they did before — now O(1)
amortized per chunk instead of O(N).

Testing

New tests/test_streaming_longrun_bounded.cpp drives a long multi-utterance
stream and asserts (a) the worst single-chunk reprocessing stays bounded (does
not scale with the session) via a lightweight high-water counter, and (b) the
incremental text() is byte-for-byte identical to a full detokenize() of all
non-special tokens (public API). Measured on the streaming nemotron model:
max_chunk_reprocess 3202 → 20 over a 1601-token session.
Existing streaming parity tests stay green against the NeMo reference
(test_streaming_decode, test_streaming_eou_reset, test_streaming_encoder,
test_streaming_nemotron, test_capi_stream_json) — these cover the word
grouping output.

Skips cleanly (exit 77) without PARAKEET_TEST_GGUF_STREAM, matching the other
model-dependent tests.

…sion process_emitted() re-detokenized all non_special_ and regroup_words() re-grouped all word_tokens_ on every chunk, and neither is trimmed across utterances -- so a long stream degrades to O(N^2) CPU with climbing RSS. Detokenize is prefix-stable, so append only the new tokens' text; re-group only the still-open word tail and keep finalized words. Output is identical (text()/tokens()/drain_words()/drain_events() unchanged), now O(1) amortized per chunk. Adds tests/test_streaming_longrun_bounded.cpp: asserts per-chunk reprocessing stays bounded and that the incremental text equals a full detokenize. Signed-off-by: Rubén Fernández <129730697+rubdttcom@users.noreply.github.com> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid O(N²) per-chunk reprocessing in long-running StreamingSession#39

fix: avoid O(N²) per-chunk reprocessing in long-running StreamingSession#39
rubdttcom wants to merge 1 commit into
mudler:masterfrom
rubdttcom:fix-streaming-on2-reprocessing

rubdttcom commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rubdttcom commented Jun 26, 2026

Problem

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant