fix(server): SEGV when --mtp-head + --mmproj are both passed by WillowOneVision · Pull Request #17 · AtomicBot-ai/atomic-llama-cpp-turboquant

WillowOneVision · 2026-05-21T20:54:00Z

Summary

llama-server consistently segfaults during slot initialization when launched with both --mtp-head <assistant.gguf> and --mmproj <vision.gguf>. This PR diagnoses the root cause empirically (gdb backtrace from a core dump) and ships a minimal two-layer defensive fix that lets both flags coexist gracefully — MTP cleanly disabled when mmproj is loaded, no crash, all other paths preserved.

Reproducer

./llama-server \
    -m gemma-4-E4B-it-Q4_K_M.gguf \
    --mtp-head gemma-4-E4B-it-assistant.Q4_K_M.gguf --spec-type mtp \
    --draft-block-size 3 \
    --mmproj gemma-4-E4B-it-mmproj-F16.gguf \
    --host 127.0.0.1 --port 8090 -t 4 -c 4096 \
    -ctk turbo4 -ctv turbo4 -ctkd turbo4 -ctvd turbo4

Pre-patch log:

srv    load_model: loaded multimodal model, '.../mmproj-F16.gguf'
srv    load_model: speculative decoding is not supported by multimodal, it will be disabled
srv    load_model: initializing slots, n_slots = 4
[Segmentation fault (core dumped)]

Backtrace from core

Program terminated with signal SIGSEGV, Segmentation fault.
#0 llama_context::n_batch() const  (from libllama.so.0)
#1 common_speculative_init(common_params_speculative&, llama_context*)
#2 server_context_impl::load_model(common_params&)
#3 main

Root cause chain (10 steps)

common/common.cpp::common_init_from_params calls llama_model_load_mtp_from_file() to load the MTP assistant into the target model — no separate draft context exists by design.
tools/server/server-context.cpp:668: when params_spec.type == COMMON_SPECULATIVE_TYPE_MTP, server-context sets params_base.speculative.model_dft = nullptr (correct — no draft context).
tools/server/server-context.cpp:736-739: when mctx != nullptr (mmproj loaded), server-context overrides params_base.speculative.type = COMMON_SPECULATIVE_TYPE_NONE and emits the WARN "speculative decoding is not supported by multimodal, it will be disabled". But it does NOT clear params_base.speculative.mparams_dft.path (= the --mtp-head GGUF path).
common/speculative.cpp:1250::common_speculative_is_compat(ctx_tgt) returns true — it only checks that the target context supports 2-token decode + partial sequence removal, independent of speculative type. So can_spec = true at server-context.cpp:777.
server-context.cpp:795: slot.spec = common_speculative_init(params_base.speculative, slot.ctx).
In common_speculative_init: line 1293 if (params.model_dft && params.type != MTP) { ctx_dft = init_from_model(...); } — model_dft is null (set at step 2), so ctx_dft stays nullptr.
Line 1322: has_draft = !params.mparams_dft.path.empty() evaluates to true (path was set by --mtp-head at CLI; never cleared at step 3).
Line 1330: has_mtp = (params.type == MTP) evaluates to false (was overridden to NONE at step 3).
Lines 1364-1369: if (has_draft) { if (has_mtp) {…} else { configs.push_back(DRAFT); } } — pushes a COMMON_SPECULATIVE_TYPE_DRAFT config based on the orphaned path.
Lines 1383-1390: DRAFT case constructs common_speculative_state_draft(type, ctx_tgt, ctx_dft=nullptr, replacements). The ctor at line 227 calls batch = llama_batch_init(llama_n_batch(ctx_dft), 0, 1) → llama_n_batch(nullptr) → SEGV.

In one sentence: the server's mmproj-disable cleanup at server-context.cpp:738 only clears params.speculative.type but leaves params.speculative.mparams_dft.path set, which lets common_speculative_init push a DRAFT config whose constructor then dereferences a null ctx_dft.

Patch (two-layer defense)

Layer 1 — `common/speculative.cpp::common_speculative_init`

Defensive early bail when type == NONE. Any future caller hitting the same conditions also gets protection.

common_speculative * common_speculative_init(
        common_params_speculative & params,
        llama_context             * ctx_tgt) {
    // Defensive: if speculative was disabled upstream (e.g. server disables it when mmproj
    // is loaded), bail out before any impl construction. Without this guard, a caller that
    // sets params.type=NONE but leaves params.mparams_dft.path (set via --mtp-head/--model-draft)
    // would still trigger the DRAFT config below with ctx_dft=nullptr, crashing in the
    // common_speculative_state_draft ctor at llama_n_batch(ctx_dft).
    if (params.type == COMMON_SPECULATIVE_TYPE_NONE) {
        return nullptr;
    }
    ...existing code...
}

Layer 2 — `tools/server/server-context.cpp:736-739`

Clear the orphan path + pointer at the disable site.

if (params_base.speculative.type != COMMON_SPECULATIVE_TYPE_NONE) {
    params_base.speculative.type = COMMON_SPECULATIVE_TYPE_NONE;
    // Also clear the draft model path so common_speculative_init does not
    // observe an orphan has_draft=true with type=NONE (would build a DRAFT
    // config and crash on ctx_dft=nullptr). See common/speculative.cpp init.
    params_base.speculative.mparams_dft.path.clear();
    params_base.speculative.model_dft = nullptr;
    SRV_WRN("%s\n", "speculative decoding is not supported by multimodal, it will be disabled");
}

Either layer alone is sufficient to prevent the crash; both together provide redundancy at the producer (server) and consumer (init) ends.

Verification

Test	Cmdline	Result
MTP + mmproj launch (the original SEGV path)	`--mtp-head + --mmproj`	✅ Server READY 18s, slots log "speculative decoding context not initialized" 4/4, no crash, no coredump
Text-only request (mmproj loaded, MTP disabled)	same cmdline + text query	✅ "Water sustains all life on Earth." dur 4.4s, finish=stop
Image+text request (mmproj loaded, MTP disabled)	same cmdline + image query	✅ 100 image tokens generated, eval 2.34 tok/s, no crash
Regression: MTP-only (no mmproj)	`--mtp-head` (no mmproj)	✅ slots log "speculative decoding context initialized" 4/4, "Photosynthesis is the process by which plants convert light energy..." dur 10.2s, finish=stop

The regression test is critical: it confirms the patch does not break the existing MTP-only path. The "initialized" status per slot proves the MTP draft head is loaded and ready to draft when text-only requests come in.

Not a true coexistence patch

This fix prevents the crash but does NOT achieve simultaneous MTP+vision (per-request dispatch). It cleanly disables MTP whenever mmproj is loaded, matching the WARN message's intent. True coexistence would require:

Per-batch detection of image tokens
Conditional MTP draft-head invocation (text-only batches → MTP speedup; image-containing batches → standard decode)
Possibly distinct slot configuration per request type

That work is significantly more invasive and is documented as a follow-up.

Test plan

Reproducer command in description triggers SEGV on master, succeeds on this branch.
gdb backtrace from core matches the 10-step chain analysis.
Regression: text-only MTP path still gets MTP speedup post-patch.
No new behavior for non-MTP, non-mmproj launches (unchanged code paths).
(Suggested) Add a regression test in tests/test-speculative-mtp.cpp exercising the mmproj-loaded path.

When llama-server is launched with both --mtp-head <assistant.gguf> (Gemma 4 MTP speculative decoding) and --mmproj <vision.gguf> (multimodal projector), the server consistently segfaults during slot initialization. Crash signature (gdb backtrace from core): #0 llama_context::n_batch() AtomicBot-ai#1 common_speculative_init(common_params_speculative&, llama_context*) AtomicBot-ai#2 server_context_impl::load_model AtomicBot-ai#3 main Root cause chain: 1. tools/server/server-context.cpp:738 sets params.speculative.type = NONE when mmproj is loaded (the WARN "speculative decoding is not supported by multimodal, it will be disabled"). 2. But params.speculative.mparams_dft.path (set via --mtp-head) is NOT cleared. Stale state. 3. common_speculative_init evaluates has_draft = !params.mparams_dft.path.empty() -> true has_mtp = (params.type == MTP) -> false (overridden) and falls into the legacy DRAFT branch. 4. common_speculative_state_draft ctor at speculative.cpp:227 calls llama_batch_init(llama_n_batch(ctx_dft), ...) where ctx_dft is nullptr (because params.model_dft was zeroed for the MTP case). Null deref. Two-layer defensive fix: - common/speculative.cpp: common_speculative_init returns nullptr early when params.type == COMMON_SPECULATIVE_TYPE_NONE. Defensive against any caller that disables speculative but leaves orphan params. - tools/server/server-context.cpp: at the mmproj-disable site, also clear params_base.speculative.mparams_dft.path and ...model_dft. Removes the orphan state at the source. Either layer alone is sufficient; both together provide defense-in-depth. Verified: - Repro before patch: SEGV on launch. - After patch: server boots cleanly, slots log "speculative decoding context not initialized", text-only and image+text requests both work, no crash. - Regression: same binary without --mmproj still gets MTP speedup (slots log "speculative decoding context initialized" 4/4). This patch prevents the crash but does NOT achieve concurrent MTP+vision operation (per-batch dispatch). It matches the WARN message intent (MTP cleanly disabled when mmproj is loaded). True coexistence is a separate scope (per-batch image-token detection + conditional draft head invocation).

* Basic JIT compilation for mul_mat, get_rows, and scale (AtomicBot-ai#17) * scale jit working * preliminary working jit for getrows and mulmat, needs refining * simplified mul_mat preprocessing switch statement * get_rows fixes, mul_mat refinement * formatted + last edits * removed some extraneous prints * fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish * small fix * some changes, working * get_rows and mul_mat jit fixed and working * Update formatting * formatting * Add header --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on all-encompassing shader library * refactor argmax, set_rows * Refactor all but flashattention, mat mul * flashattention and matrix multiplication moved to new format * clean up preprocessing * Formatting * remove duplicate constants * Split large shaders into multiple static strings --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

…better shader parameter handling (ggml-org#20173) * K quant speedup (AtomicBot-ai#20) * Basic JIT compilation for mul_mat, get_rows, and scale (AtomicBot-ai#17) * scale jit working * preliminary working jit for getrows and mulmat, needs refining * simplified mul_mat preprocessing switch statement * get_rows fixes, mul_mat refinement * formatted + last edits * removed some extraneous prints * fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish * small fix * some changes, working * get_rows and mul_mat jit fixed and working * Update formatting * formatting * Add header --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on all-encompassing shader library * refactor argmax, set_rows * Refactor all but flashattention, mat mul * no gibberish, all k quants added, merged * vec memory fix * q6_k matching metal on my machine, tests passing * Set tile size for q6_k separately * Separate out fast shaders --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> * Move towards writeBuffer for params * Move away from multiple buffers for set_rows errors, remove host buffer for parameter buffers, minor cleanups * Remove extra file * Formatting --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>

) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (AtomicBot-ai#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (AtomicBot-ai#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (AtomicBot-ai#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (AtomicBot-ai#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (AtomicBot-ai#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (AtomicBot-ai#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (AtomicBot-ai#17) * meta : formatting, naming, indentation (AtomicBot-ai#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

github-actions Bot added examples server labels May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(server): SEGV when --mtp-head + --mmproj are both passed#17

fix(server): SEGV when --mtp-head + --mmproj are both passed#17
WillowOneVision wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
WillowOneVision:cecil/fix-mtp-mmproj-segv

WillowOneVision commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

WillowOneVision commented May 21, 2026

Summary

Reproducer

Backtrace from core

Root cause chain (10 steps)

Patch (two-layer defense)

Layer 1 — common/speculative.cpp::common_speculative_init

Layer 2 — tools/server/server-context.cpp:736-739

Verification

Not a true coexistence patch

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Layer 1 — `common/speculative.cpp::common_speculative_init`

Layer 2 — `tools/server/server-context.cpp:736-739`