Environment
- Device: Apple M4 Pro (macOS 15.5, Darwin 24.6.0, arm64)
- Model:
CosyVoice3-2512_Q4_K_S.gguf from Lourdle/Fun-CosyVoice3-0.5B-2512-GGUF
- Frontend:
speech_tokenizer_v3.onnx + campplus.onnx
- GGML version: 0.9.11 (bundled)
- Build:
cmake -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON with SIMDe patch for ARM64 AVX2→NEON translation
Build
Build succeeds with the SIMDe patch (PR #2). libcosyvoice.dylib and cosyvoice-cli are produced without errors.
Issue
Metal backend — crashes with unsupported op
$ cosyvoice-cli --model CosyVoice3-2512_Q4_K_S.gguf \
--speech-tokenizer speech_tokenizer_v3.onnx \
--campplus campplus.onnx \
--prompt-audio ref.wav --prompt-text "テスト" \
--text "こんにちは、テストです。" \
--output out.wav --mode zero-shot
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_op_encode_impl: error: unsupported op 'PAD'
The crash occurs during token2wav (Flow + HiFT stage), specifically at:
ggml_metal_op_encode + 2500
ggml_metal_graph_compute + 588
ggml_backend_sched_graph_compute + 2244
cosyvoice_model_3::token2wav + 948
cosyvoice_tts + 168
CPU backend — "Too many stop tokens"
With GGML_METAL=OFF (CPU-only build), generation starts but produces only 0.48–0.96 seconds of audio before aborting:
Too many stop tokens sampled, something might be wrong with the model or the sampling parameters.
Error: TTS generation failed.
This is consistent with the README Known Issues ("CPU / Vulkan backends: Produced noisy output in tests").
Analysis
The PAD operation is used in the CosyVoice Flow/HiFT architecture but is not implemented in the GGML Metal backend. This is likely an upstream GGML issue — the PAD op exists in the CPU backend but hasn't been ported to Metal shaders.
Possible solutions:
- Upstream GGML: Implement
PAD in ggml-metal.metal (preferred long-term fix)
- Workaround: Force
PAD ops to CPU via ggml_backend_sched offloading while keeping other ops on Metal
- Graph rewrite: Replace
PAD with equivalent ops that are already Metal-supported (e.g., concatenation with zero tensors)
Related
Environment
CosyVoice3-2512_Q4_K_S.gguffrom Lourdle/Fun-CosyVoice3-0.5B-2512-GGUFspeech_tokenizer_v3.onnx+campplus.onnxcmake -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ONwith SIMDe patch for ARM64 AVX2→NEON translationBuild
Build succeeds with the SIMDe patch (PR #2).
libcosyvoice.dylibandcosyvoice-cliare produced without errors.Issue
Metal backend — crashes with unsupported op
The crash occurs during
token2wav(Flow + HiFT stage), specifically at:CPU backend — "Too many stop tokens"
With
GGML_METAL=OFF(CPU-only build), generation starts but produces only 0.48–0.96 seconds of audio before aborting:This is consistent with the README Known Issues ("CPU / Vulkan backends: Produced noisy output in tests").
Analysis
The
PADoperation is used in the CosyVoice Flow/HiFT architecture but is not implemented in the GGML Metal backend. This is likely an upstream GGML issue — thePADop exists in the CPU backend but hasn't been ported to Metal shaders.Possible solutions:
PADinggml-metal.metal(preferred long-term fix)PADops to CPU viaggml_backend_schedoffloading while keeping other ops on MetalPADwith equivalent ops that are already Metal-supported (e.g., concatenation with zero tensors)Related
PADin its Metal shader implementations