spec: avoid all-token outputs during MTP prefill#149
Conversation
Split MTP pre-norm hidden-state extraction from normal embedding/logit output mode so prompt batches no longer mark every token as an output row. Keep Qwen35/Qwen35MoE hidden rows available for MTP while slicing before the final output head, and skip draft pre-norm host copies during prompt sync.
|
Merged. Verified independently on a GB10 (sm_121, Blackwell): the patch compiles clean, the qwen35moe main decode path produces coherent output, and turbo3 perplexity on Qwen3.6-35B-A3B came out identical to baseline (6.3005), so no regression. Your draft-MTP2/3/4 matrix on sm_86 covers the self-spec path nicely, so between the two arches this is well covered. Clean, correctly scoped change. Thanks for the thorough validation writeup. |
Cool! Kudos goes 100% to GPT 5.5
I did not even look at the code, all I did is discover the performance issue and then ask GPT 5.5 and pointed at that problem and after the solution was here 30min later, I did simply run performance and quality tests. And of course also Kudos to you TheTom for the great work you do here. |
This fork's pre-norm MTP optimization is superseded by upstream's maintained post-norm/nextn MTP, brought in by the commits that follow. Removing it here so that lineage applies onto a clean base instead of colliding with it.
Tap the MTP/nextn hidden state before the final output norm in both the trunk and MTP graphs of Qwen35 and Qwen35MoE, reverting the semantic core of the post-norm migration (ggml-org#24025) while keeping the upstream `nextn` naming and the masked (embeddings_nextn_masked) extraction path. This restores the fork's pre-norm Qwen MTP optimization (TheTom#149) on top of the rebased TurboQuant+ KV cache lineage, per explicit request to prefer pre-norm over upstream's post-norm/nextn approach. https://claude.ai/code/session_01HZZQMPA6D6KWgcuDsDarWa
Summary
Declaration: this patch was fully or predominantly AI-generated
use it or not, as you wish...
This PR addresses this issue: #147
inp_out_idsgraph inputs so MTP/pre-norm graphs that do not consume output IDs do not crash while setting inputsRationale
The MTP prompt-sync path needs the target context's pre-final-norm hidden state for prompt tokens, but it does not need normal logits/embeddings for every prompt token.
Before this change, the server-side MTP hidden-state requirement flowed through
need_embd()and therefore made prompt batches mark every prompt token as an output row. That enabled normal embedding/output mode, reserved all-token output buffers, and pushed Qwen35/Qwen35MoE graphs through final output norm / LM head work that MTP prompt sync does not consume.This keeps the inherent MTP draft-context synchronization, but removes the avoidable all-token output/head overhead during prompt ingestion.
Implementation Notes
server_slot::need_embd()now reflects only request-level embedding output.need_embd_pre_norm().llama_contexttracks pre-norm output rows independently from normal logits/embeddings output rows.inp_out_idsis only populated when the graph actually allocated/uses the tensor.Validation
Built CUDA server image from the local fix candidate:
Build smoke:
Note: the tested Docker image reported
35db4bd07-dirtyduring build/version because the secondout_idsguard was still an uncommitted local patch at image build time. The PR branch now contains that patch as the second clean commit.Limited draft-MTP2 smoke on Qwen3.6-35B-A3B UD-IQ4_NL:
Full speed matrix with no-MTP, draft-MTP2, draft-MTP3, draft-MTP4 all passed at 1k, 8k, 32k, 64k, 128k, and 192k target prompt sizes.
Representative full-matrix results:
Code-output matrix with no-MTP, draft-MTP2, and draft-MTP3 also completed successfully with
--no-cache-prompt, target prompt sizes up to 196k, andmax_tokensup to 16k.git diff --check origin/feature/turboquant-kv-cache...HEADpasses.