Skip to content

spec: avoid all-token outputs during MTP prefill#149

Merged
TheTom merged 2 commits into
TheTom:feature/turboquant-kv-cachefrom
claude-eric-steiner:codex/mtp-prefill-hidden-state-fix
Jun 4, 2026
Merged

spec: avoid all-token outputs during MTP prefill#149
TheTom merged 2 commits into
TheTom:feature/turboquant-kv-cachefrom
claude-eric-steiner:codex/mtp-prefill-hidden-state-fix

Conversation

@claude-eric-steiner

@claude-eric-steiner claude-eric-steiner commented May 20, 2026

Copy link
Copy Markdown

Summary

Declaration: this patch was fully or predominantly AI-generated

use it or not, as you wish...

This PR addresses this issue: #147

  • split MTP pre-norm hidden-state extraction from normal embedding/logit output mode
  • stop MTP prompt ingestion from marking every prompt token as a normal output row
  • keep full Qwen35/Qwen35MoE hidden rows available for MTP, then slice before final output norm / LM head
  • skip host-visible draft pre-norm copies while prompt sync only needs the draft context state updated
  • guard unused inp_out_ids graph inputs so MTP/pre-norm graphs that do not consume output IDs do not crash while setting inputs

Rationale

The MTP prompt-sync path needs the target context's pre-final-norm hidden state for prompt tokens, but it does not need normal logits/embeddings for every prompt token.

Before this change, the server-side MTP hidden-state requirement flowed through need_embd() and therefore made prompt batches mark every prompt token as an output row. That enabled normal embedding/output mode, reserved all-token output buffers, and pushed Qwen35/Qwen35MoE graphs through final output norm / LM head work that MTP prompt sync does not consume.

This keeps the inherent MTP draft-context synchronization, but removes the avoidable all-token output/head overhead during prompt ingestion.

Implementation Notes

  • server_slot::need_embd() now reflects only request-level embedding output.
  • MTP hidden-state extraction is exposed separately through need_embd_pre_norm().
  • llama_context tracks pre-norm output rows independently from normal logits/embeddings output rows.
  • Qwen35 and Qwen35MoE keep full pre-norm hidden rows available for MTP, then slice only the actual output rows before final output norm / LM head.
  • MTP prompt sync disables draft-context pre-norm host copies; draft generation re-enables them.
  • inp_out_ids is only populated when the graph actually allocated/uses the tensor.

Validation

Built CUDA server image from the local fix candidate:

agent-of-agent/llama-cpp-turboquant:server-cuda-sm86-mtp-prefillfix-outids3-20260519

Build smoke:

docker run --rm --entrypoint /app/llama-server agent-of-agent/llama-cpp-turboquant:server-cuda-sm86-mtp-prefillfix-outids3-20260519 --version
version: 9411 (35db4bd07)
built with GNU 14.2.0 for Linux x86_64

Note: the tested Docker image reported 35db4bd07-dirty during build/version because the second out_ids guard was still an uncommitted local patch at image build time. The PR branch now contains that patch as the second clean commit.

Limited draft-MTP2 smoke on Qwen3.6-35B-A3B UD-IQ4_NL:

1024 tokens: passed, prompt 1275.51 tok/s, output 158.68 tok/s
8192 tokens: passed, prompt 3795.18 tok/s, output 150.60 tok/s

Full speed matrix with no-MTP, draft-MTP2, draft-MTP3, draft-MTP4 all passed at 1k, 8k, 32k, 64k, 128k, and 192k target prompt sizes.

Representative full-matrix results:

1,024 input tokens:
  no-MTP      input 1288.02 tok/s, output 118.10 tok/s, elapsed 1.9s
  draft-MTP2  input 1190.01 tok/s, output 147.63 tok/s, elapsed 1.7s
  draft-MTP3  input 1145.07 tok/s, output 152.45 tok/s, elapsed 1.7s
  draft-MTP4  input 1277.92 tok/s, output 149.66 tok/s, elapsed 1.7s

196,608 input tokens:
  no-MTP      input 1716.97 tok/s, output 25.70 tok/s, elapsed 119.8s
  draft-MTP2  input 1478.25 tok/s, output 64.89 tok/s, elapsed 135.2s
  draft-MTP3  input 1479.37 tok/s, output 69.51 tok/s, elapsed 135.0s
  draft-MTP4  input 1410.82 tok/s, output 72.61 tok/s, elapsed 141.4s

Code-output matrix with no-MTP, draft-MTP2, and draft-MTP3 also completed successfully with --no-cache-prompt, target prompt sizes up to 196k, and max_tokens up to 16k.

git diff --check origin/feature/turboquant-kv-cache...HEAD passes.

Split MTP pre-norm hidden-state extraction from normal embedding/logit output mode so prompt batches no longer mark every token as an output row. Keep Qwen35/Qwen35MoE hidden rows available for MTP while slicing before the final output head, and skip draft pre-norm host copies during prompt sync.
@TheTom TheTom merged commit 9682c9b into TheTom:feature/turboquant-kv-cache Jun 4, 2026
1 check passed
@TheTom

TheTom commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Merged. Verified independently on a GB10 (sm_121, Blackwell): the patch compiles clean, the qwen35moe main decode path produces coherent output, and turbo3 perplexity on Qwen3.6-35B-A3B came out identical to baseline (6.3005), so no regression. Your draft-MTP2/3/4 matrix on sm_86 covers the self-spec path nicely, so between the two arches this is well covered. Clean, correctly scoped change. Thanks for the thorough validation writeup.

@claude-eric-steiner

claude-eric-steiner commented Jun 4, 2026

Copy link
Copy Markdown
Author

Merged. Verified independently on a GB10 (sm_121, Blackwell): the patch compiles clean, the qwen35moe main decode path produces coherent output, and turbo3 perplexity on Qwen3.6-35B-A3B came out identical to baseline (6.3005), so no regression. Your draft-MTP2/3/4 matrix on sm_86 covers the self-spec path nicely, so between the two arches this is well covered. Clean, correctly scoped change. Thanks for the thorough validation writeup.

Cool!

Kudos goes 100% to GPT 5.5

  1. It did the problem analysis
  2. It made the architectural change plan
  3. It wrote the new code
  4. It wrote the problem and solution description and pull request

I did not even look at the code, all I did is discover the performance issue and then ask GPT 5.5 and pointed at that problem and after the solution was here 30min later, I did simply run performance and quality tests.

And of course also Kudos to you TheTom for the great work you do here.

TheTom added a commit that referenced this pull request Jun 8, 2026
This fork's pre-norm MTP optimization is superseded by upstream's maintained
post-norm/nextn MTP, brought in by the commits that follow. Removing it here
so that lineage applies onto a clean base instead of colliding with it.
KGardevoir pushed a commit to KGardevoir/llama-cpp-turboquant that referenced this pull request Jun 16, 2026
Tap the MTP/nextn hidden state before the final output norm in both the
trunk and MTP graphs of Qwen35 and Qwen35MoE, reverting the semantic core
of the post-norm migration (ggml-org#24025) while keeping the upstream `nextn`
naming and the masked (embeddings_nextn_masked) extraction path.

This restores the fork's pre-norm Qwen MTP optimization (TheTom#149) on top of
the rebased TurboQuant+ KV cache lineage, per explicit request to prefer
pre-norm over upstream's post-norm/nextn approach.

https://claude.ai/code/session_01HZZQMPA6D6KWgcuDsDarWa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants