Skip to content

fix: avoid O(N²) per-chunk reprocessing in long-running StreamingSession#39

Open
rubdttcom wants to merge 1 commit into
mudler:masterfrom
rubdttcom:fix-streaming-on2-reprocessing
Open

fix: avoid O(N²) per-chunk reprocessing in long-running StreamingSession#39
rubdttcom wants to merge 1 commit into
mudler:masterfrom
rubdttcom:fix-streaming-on2-reprocessing

Conversation

@rubdttcom

Copy link
Copy Markdown

Problem

StreamingSession rebuilds the running transcript and the word grouping from
the entire session history on every chunk:

  • process_emitted() runs detokenize() over all non-special tokens so far.
  • regroup_words() runs group_words() over all accumulated word tokens.

Neither buffer is trimmed across utterances (the <EOU>/<EOB> reset only
clears the decoder LSTM state). So for a long-lived stream the per-chunk cost
grows with the whole session — O(N²) total CPU — and the repeated growing
allocations make RSS climb. A dictation session left open for hours pegs a core;
a multi-minute continuous stream already shows the per-chunk rate collapsing.

Fix

Make both steps incremental, with identical output:

  • Text: detokenize is prefix-stable (per-token piece concat + →space,
    and the single leading-space strip only touches byte 0). So append only the
    newly emitted tokens' text to text_ instead of rebuilding it each chunk.
  • Words: re-group only the still-open tail (word_tokens_ from the last
    -word-start) and keep already-finalized words. group_words splits at
    -word-starts and its only cross-token effects (one-token forward lookahead,
    backward punctuation attach) never cross a word-start boundary, so
    group_words([0,cursor)) ++ group_words([cursor,end)) == group_words([0,end)).

No public contract changes: text(), tokens(), take_new_text(),
drain_words(), drain_events() return exactly what they did before — now O(1)
amortized per chunk instead of O(N).

Testing

  • New tests/test_streaming_longrun_bounded.cpp drives a long multi-utterance
    stream and asserts (a) the worst single-chunk reprocessing stays bounded (does
    not scale with the session) via a lightweight high-water counter, and (b) the
    incremental text() is byte-for-byte identical to a full detokenize() of all
    non-special tokens (public API). Measured on the streaming nemotron model:
    max_chunk_reprocess 3202 → 20 over a 1601-token session.
  • Existing streaming parity tests stay green against the NeMo reference
    (test_streaming_decode, test_streaming_eou_reset, test_streaming_encoder,
    test_streaming_nemotron, test_capi_stream_json) — these cover the word
    grouping output.

Skips cleanly (exit 77) without PARAKEET_TEST_GGUF_STREAM, matching the other
model-dependent tests.

…sion

process_emitted() re-detokenized all non_special_ and regroup_words()
re-grouped all word_tokens_ on every chunk, and neither is trimmed across
utterances -- so a long stream degrades to O(N^2) CPU with climbing RSS.

Detokenize is prefix-stable, so append only the new tokens' text; re-group only
the still-open word tail and keep finalized words. Output is identical
(text()/tokens()/drain_words()/drain_events() unchanged), now O(1) amortized
per chunk.

Adds tests/test_streaming_longrun_bounded.cpp: asserts per-chunk reprocessing
stays bounded and that the incremental text equals a full detokenize.

Signed-off-by: Rubén Fernández <129730697+rubdttcom@users.noreply.github.com>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant