fix: avoid O(N²) per-chunk reprocessing in long-running StreamingSession#39
Open
rubdttcom wants to merge 1 commit into
Open
fix: avoid O(N²) per-chunk reprocessing in long-running StreamingSession#39rubdttcom wants to merge 1 commit into
rubdttcom wants to merge 1 commit into
Conversation
…sion process_emitted() re-detokenized all non_special_ and regroup_words() re-grouped all word_tokens_ on every chunk, and neither is trimmed across utterances -- so a long stream degrades to O(N^2) CPU with climbing RSS. Detokenize is prefix-stable, so append only the new tokens' text; re-group only the still-open word tail and keep finalized words. Output is identical (text()/tokens()/drain_words()/drain_events() unchanged), now O(1) amortized per chunk. Adds tests/test_streaming_longrun_bounded.cpp: asserts per-chunk reprocessing stays bounded and that the incremental text equals a full detokenize. Signed-off-by: Rubén Fernández <129730697+rubdttcom@users.noreply.github.com> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
StreamingSessionrebuilds the running transcript and the word grouping fromthe entire session history on every chunk:
process_emitted()runsdetokenize()over all non-special tokens so far.regroup_words()runsgroup_words()over all accumulated word tokens.Neither buffer is trimmed across utterances (the
<EOU>/<EOB>reset onlyclears the decoder LSTM state). So for a long-lived stream the per-chunk cost
grows with the whole session — O(N²) total CPU — and the repeated growing
allocations make RSS climb. A dictation session left open for hours pegs a core;
a multi-minute continuous stream already shows the per-chunk rate collapsing.
Fix
Make both steps incremental, with identical output:
detokenizeis prefix-stable (per-token piece concat +▁→space,and the single leading-space strip only touches byte 0). So append only the
newly emitted tokens' text to
text_instead of rebuilding it each chunk.word_tokens_from the last▁-word-start) and keep already-finalized words.group_wordssplits at▁-word-starts and its only cross-token effects (one-token forward lookahead,backward punctuation attach) never cross a word-start boundary, so
group_words([0,cursor)) ++ group_words([cursor,end)) == group_words([0,end)).No public contract changes:
text(),tokens(),take_new_text(),drain_words(),drain_events()return exactly what they did before — now O(1)amortized per chunk instead of O(N).
Testing
tests/test_streaming_longrun_bounded.cppdrives a long multi-utterancestream and asserts (a) the worst single-chunk reprocessing stays bounded (does
not scale with the session) via a lightweight high-water counter, and (b) the
incremental
text()is byte-for-byte identical to a fulldetokenize()of allnon-special tokens (public API). Measured on the streaming nemotron model:
max_chunk_reprocess3202 → 20 over a 1601-token session.(
test_streaming_decode,test_streaming_eou_reset,test_streaming_encoder,test_streaming_nemotron,test_capi_stream_json) — these cover the wordgrouping output.
Skips cleanly (exit 77) without
PARAKEET_TEST_GGUF_STREAM, matching the othermodel-dependent tests.