spec: avoid all-token outputs during MTP prefill by claude-eric-steiner · Pull Request #149 · TheTom/llama-cpp-turboquant

claude-eric-steiner · 2026-05-20T06:48:37Z

Summary

Declaration: this patch was fully or predominantly AI-generated

use it or not, as you wish...

This PR addresses this issue: #147

split MTP pre-norm hidden-state extraction from normal embedding/logit output mode
stop MTP prompt ingestion from marking every prompt token as a normal output row
keep full Qwen35/Qwen35MoE hidden rows available for MTP, then slice before final output norm / LM head
skip host-visible draft pre-norm copies while prompt sync only needs the draft context state updated
guard unused inp_out_ids graph inputs so MTP/pre-norm graphs that do not consume output IDs do not crash while setting inputs

Rationale

The MTP prompt-sync path needs the target context's pre-final-norm hidden state for prompt tokens, but it does not need normal logits/embeddings for every prompt token.

Before this change, the server-side MTP hidden-state requirement flowed through need_embd() and therefore made prompt batches mark every prompt token as an output row. That enabled normal embedding/output mode, reserved all-token output buffers, and pushed Qwen35/Qwen35MoE graphs through final output norm / LM head work that MTP prompt sync does not consume.

This keeps the inherent MTP draft-context synchronization, but removes the avoidable all-token output/head overhead during prompt ingestion.

Implementation Notes

server_slot::need_embd() now reflects only request-level embedding output.
MTP hidden-state extraction is exposed separately through need_embd_pre_norm().
llama_context tracks pre-norm output rows independently from normal logits/embeddings output rows.
Qwen35 and Qwen35MoE keep full pre-norm hidden rows available for MTP, then slice only the actual output rows before final output norm / LM head.
MTP prompt sync disables draft-context pre-norm host copies; draft generation re-enables them.
inp_out_ids is only populated when the graph actually allocated/uses the tensor.

Validation

Built CUDA server image from the local fix candidate:

agent-of-agent/llama-cpp-turboquant:server-cuda-sm86-mtp-prefillfix-outids3-20260519

Build smoke:

docker run --rm --entrypoint /app/llama-server agent-of-agent/llama-cpp-turboquant:server-cuda-sm86-mtp-prefillfix-outids3-20260519 --version
version: 9411 (35db4bd07)
built with GNU 14.2.0 for Linux x86_64

Note: the tested Docker image reported 35db4bd07-dirty during build/version because the second out_ids guard was still an uncommitted local patch at image build time. The PR branch now contains that patch as the second clean commit.

Limited draft-MTP2 smoke on Qwen3.6-35B-A3B UD-IQ4_NL:

1024 tokens: passed, prompt 1275.51 tok/s, output 158.68 tok/s
8192 tokens: passed, prompt 3795.18 tok/s, output 150.60 tok/s

Full speed matrix with no-MTP, draft-MTP2, draft-MTP3, draft-MTP4 all passed at 1k, 8k, 32k, 64k, 128k, and 192k target prompt sizes.

Representative full-matrix results:

1,024 input tokens:
  no-MTP      input 1288.02 tok/s, output 118.10 tok/s, elapsed 1.9s
  draft-MTP2  input 1190.01 tok/s, output 147.63 tok/s, elapsed 1.7s
  draft-MTP3  input 1145.07 tok/s, output 152.45 tok/s, elapsed 1.7s
  draft-MTP4  input 1277.92 tok/s, output 149.66 tok/s, elapsed 1.7s

196,608 input tokens:
  no-MTP      input 1716.97 tok/s, output 25.70 tok/s, elapsed 119.8s
  draft-MTP2  input 1478.25 tok/s, output 64.89 tok/s, elapsed 135.2s
  draft-MTP3  input 1479.37 tok/s, output 69.51 tok/s, elapsed 135.0s
  draft-MTP4  input 1410.82 tok/s, output 72.61 tok/s, elapsed 141.4s

Code-output matrix with no-MTP, draft-MTP2, and draft-MTP3 also completed successfully with --no-cache-prompt, target prompt sizes up to 196k, and max_tokens up to 16k.

git diff --check origin/feature/turboquant-kv-cache...HEAD passes.

Split MTP pre-norm hidden-state extraction from normal embedding/logit output mode so prompt batches no longer mark every token as an output row. Keep Qwen35/Qwen35MoE hidden rows available for MTP while slicing before the final output head, and skip draft pre-norm host copies during prompt sync.

TheTom · 2026-06-04T19:57:21Z

Merged. Verified independently on a GB10 (sm_121, Blackwell): the patch compiles clean, the qwen35moe main decode path produces coherent output, and turbo3 perplexity on Qwen3.6-35B-A3B came out identical to baseline (6.3005), so no regression. Your draft-MTP2/3/4 matrix on sm_86 covers the self-spec path nicely, so between the two arches this is well covered. Clean, correctly scoped change. Thanks for the thorough validation writeup.

claude-eric-steiner · 2026-06-04T20:26:16Z

Merged. Verified independently on a GB10 (sm_121, Blackwell): the patch compiles clean, the qwen35moe main decode path produces coherent output, and turbo3 perplexity on Qwen3.6-35B-A3B came out identical to baseline (6.3005), so no regression. Your draft-MTP2/3/4 matrix on sm_86 covers the self-spec path nicely, so between the two arches this is well covered. Clean, correctly scoped change. Thanks for the thorough validation writeup.

Cool!

Kudos goes 100% to GPT 5.5

It did the problem analysis
It made the architectural change plan
It wrote the new code
It wrote the problem and solution description and pull request

I did not even look at the code, all I did is discover the performance issue and then ask GPT 5.5 and pointed at that problem and after the solution was here 30min later, I did simply run performance and quality tests.

And of course also Kudos to you TheTom for the great work you do here.

This fork's pre-norm MTP optimization is superseded by upstream's maintained post-norm/nextn MTP, brought in by the commits that follow. Removing it here so that lineage applies onto a clean base instead of colliding with it.

Tap the MTP/nextn hidden state before the final output norm in both the trunk and MTP graphs of Qwen35 and Qwen35MoE, reverting the semantic core of the post-norm migration (ggml-org#24025) while keeping the upstream `nextn` naming and the masked (embeddings_nextn_masked) extraction path. This restores the fork's pre-norm Qwen MTP optimization (TheTom#149) on top of the rebased TurboQuant+ KV cache lineage, per explicit request to prefer pre-norm over upstream's post-norm/nextn approach. https://claude.ai/code/session_01HZZQMPA6D6KWgcuDsDarWa

claude-eric-steiner added 2 commits May 20, 2026 08:29

fix: guard unused MTP out_ids inputs

e7a7b93

github-actions Bot added examples server model labels May 20, 2026

claude-eric-steiner mentioned this pull request May 20, 2026

Misc. bug: current draft MTP implementation very slow input tokens digestion #147

Open

TheTom merged commit 9682c9b into TheTom:feature/turboquant-kv-cache Jun 4, 2026
1 check passed

TheTom mentioned this pull request Jun 8, 2026

Gemma 4 MTP: bring in the upstream MTP lineage (qwen35 post-norm + gemma4) on TurboQuant+ #172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

spec: avoid all-token outputs during MTP prefill#149

spec: avoid all-token outputs during MTP prefill#149
TheTom merged 2 commits into
TheTom:feature/turboquant-kv-cachefrom
claude-eric-steiner:codex/mtp-prefill-hidden-state-fix

claude-eric-steiner commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

TheTom commented Jun 4, 2026

Uh oh!

claude-eric-steiner commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

claude-eric-steiner commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Rationale

Implementation Notes

Validation

Uh oh!

Uh oh!

TheTom commented Jun 4, 2026

Uh oh!

claude-eric-steiner commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude-eric-steiner commented May 20, 2026 •

edited

Loading

claude-eric-steiner commented Jun 4, 2026 •

edited

Loading