Summary
On Apple Silicon (M3), the Q4_K_S quantized model (CosyVoice3-2512_Q4_K_S.gguf) produces output that sounds like noise/machine sound rather than intelligible speech. The LLM stage generates speech tokens successfully, but the final audio from token2wav (Flow + HiFT) is not recognizable as speech.
Environment
Symptoms
1. HiFT vocoder IM2COL memory explosion (long sequences)
For sequences with >~400 speech tokens, HiFT Conv1d layers produce IM2COL tensors with ne[1] > 0xFFFF (e.g. ne={896, 149761, 1, 1}). There are 27 such tensors per HiFT pass, each requiring ~200MB+, causing segfaults.
The current workaround (node->op = GGML_OP_NONE for oversized IM2COL) prevents the crash but completely corrupts vocoder output — producing a repeating 4-sample pattern:
[0.002363, -0.002809, 0.006419, -0.003742, 0.002363, -0.002809, ...]
2. LLM token generation instability
The Q4_K_S model generates wildly inconsistent speech token counts. For 13 single-sentence inputs with the same reference audio:
- 4 produced reasonable length (3-12s)
- 9 produced near-empty output (0.2-0.4s)
3. Non-speech output even for "working" lengths
Even when the LLM generates a reasonable number of tokens and IM2COL is not disabled (short sequences), the output waveform has speech-like statistical properties (zero-crossing rate, autocorrelation) but does not sound like speech to human listeners.
Analysis
| Factor |
Impact |
| Q4_K_S quantization (4-bit) |
Model precision severely degraded. Stop token suppression partially helps but LLM remains unstable |
| HiFT IM2COL limit |
Conv1d intermediate tensors explode for sequences >~400 tokens. Disabling IM2COL corrupts output |
| Metal/CPU mixed execution |
Resolved via CPU-only workaround in token2wav (PR #2), but performance impact |
Potential fixes
-
Higher precision quantization — Q8_0 or Q5_K_M should significantly improve LLM stability and vocoder quality. Is a Q8_0 GGUF available or can one be produced from the FP16 weights?
-
Chunked HiFT processing — Split speech_feat into overlapping chunks with proper Conv1d receptive field padding, run HiFT on each chunk, crossfade and concatenate. This would avoid the IM2COL memory explosion.
-
Streaming/tiled IM2COL — Implement IM2COL in tiles rather than materializing the full intermediate tensor. This is a ggml-level change.
Reproduction
# Build with PR #2 patches
cosyvoice-cli \
--model CosyVoice3-2512_Q4_K_S.gguf \
--speech-tokenizer speech_tokenizer_v3.onnx \
--campplus campplus.onnx \
--prompt-audio reference.wav \
--prompt-text "reference transcript" \
--text "Any Japanese text" \
--output output.wav
# Analyze output
python3 -c "
import struct
with open('output.wav', 'rb') as f:
data = f.read()
idx = data.find(b'data') + 8
floats = struct.unpack_from(f'<{(len(data)-idx)//4}f', data, idx)
print(f'max={max(abs(f) for f in floats):.4f}')
# Check for 4-sample repeat pattern (corrupted HiFT)
if len(floats) > 200:
is_loop = all(abs(floats[i]-floats[i+4])<0.0001 for i in range(100,120))
print(f'corrupted={is_loop}')
"
Related
Summary
On Apple Silicon (M3), the Q4_K_S quantized model (
CosyVoice3-2512_Q4_K_S.gguf) produces output that sounds like noise/machine sound rather than intelligible speech. The LLM stage generates speech tokens successfully, but the final audio from token2wav (Flow + HiFT) is not recognizable as speech.Environment
Lourdle/Fun-CosyVoice3-0.5B-2512-GGUF(Q4_K_S, 551MB)speech_tokenizer_v3.onnx+campplus.onnxSymptoms
1. HiFT vocoder IM2COL memory explosion (long sequences)
For sequences with >~400 speech tokens, HiFT Conv1d layers produce IM2COL tensors with
ne[1] > 0xFFFF(e.g.ne={896, 149761, 1, 1}). There are 27 such tensors per HiFT pass, each requiring ~200MB+, causing segfaults.The current workaround (
node->op = GGML_OP_NONEfor oversized IM2COL) prevents the crash but completely corrupts vocoder output — producing a repeating 4-sample pattern:2. LLM token generation instability
The Q4_K_S model generates wildly inconsistent speech token counts. For 13 single-sentence inputs with the same reference audio:
3. Non-speech output even for "working" lengths
Even when the LLM generates a reasonable number of tokens and IM2COL is not disabled (short sequences), the output waveform has speech-like statistical properties (zero-crossing rate, autocorrelation) but does not sound like speech to human listeners.
Analysis
Potential fixes
Higher precision quantization — Q8_0 or Q5_K_M should significantly improve LLM stability and vocoder quality. Is a Q8_0 GGUF available or can one be produced from the FP16 weights?
Chunked HiFT processing — Split
speech_featinto overlapping chunks with proper Conv1d receptive field padding, run HiFT on each chunk, crossfade and concatenate. This would avoid the IM2COL memory explosion.Streaming/tiled IM2COL — Implement IM2COL in tiles rather than materializing the full intermediate tensor. This is a ggml-level change.
Reproduction
Related