Skip to content

Q4_K_S: token2wav produces noise/non-speech output on Apple Silicon #4

@jasagiri

Description

@jasagiri

Summary

On Apple Silicon (M3), the Q4_K_S quantized model (CosyVoice3-2512_Q4_K_S.gguf) produces output that sounds like noise/machine sound rather than intelligible speech. The LLM stage generates speech tokens successfully, but the final audio from token2wav (Flow + HiFT) is not recognizable as speech.

Environment

Symptoms

1. HiFT vocoder IM2COL memory explosion (long sequences)

For sequences with >~400 speech tokens, HiFT Conv1d layers produce IM2COL tensors with ne[1] > 0xFFFF (e.g. ne={896, 149761, 1, 1}). There are 27 such tensors per HiFT pass, each requiring ~200MB+, causing segfaults.

The current workaround (node->op = GGML_OP_NONE for oversized IM2COL) prevents the crash but completely corrupts vocoder output — producing a repeating 4-sample pattern:

[0.002363, -0.002809, 0.006419, -0.003742, 0.002363, -0.002809, ...]

2. LLM token generation instability

The Q4_K_S model generates wildly inconsistent speech token counts. For 13 single-sentence inputs with the same reference audio:

  • 4 produced reasonable length (3-12s)
  • 9 produced near-empty output (0.2-0.4s)

3. Non-speech output even for "working" lengths

Even when the LLM generates a reasonable number of tokens and IM2COL is not disabled (short sequences), the output waveform has speech-like statistical properties (zero-crossing rate, autocorrelation) but does not sound like speech to human listeners.

Analysis

Factor Impact
Q4_K_S quantization (4-bit) Model precision severely degraded. Stop token suppression partially helps but LLM remains unstable
HiFT IM2COL limit Conv1d intermediate tensors explode for sequences >~400 tokens. Disabling IM2COL corrupts output
Metal/CPU mixed execution Resolved via CPU-only workaround in token2wav (PR #2), but performance impact

Potential fixes

  1. Higher precision quantization — Q8_0 or Q5_K_M should significantly improve LLM stability and vocoder quality. Is a Q8_0 GGUF available or can one be produced from the FP16 weights?

  2. Chunked HiFT processing — Split speech_feat into overlapping chunks with proper Conv1d receptive field padding, run HiFT on each chunk, crossfade and concatenate. This would avoid the IM2COL memory explosion.

  3. Streaming/tiled IM2COL — Implement IM2COL in tiles rather than materializing the full intermediate tensor. This is a ggml-level change.

Reproduction

# Build with PR #2 patches
cosyvoice-cli \
  --model CosyVoice3-2512_Q4_K_S.gguf \
  --speech-tokenizer speech_tokenizer_v3.onnx \
  --campplus campplus.onnx \
  --prompt-audio reference.wav \
  --prompt-text "reference transcript" \
  --text "Any Japanese text" \
  --output output.wav

# Analyze output
python3 -c "
import struct
with open('output.wav', 'rb') as f:
    data = f.read()
    idx = data.find(b'data') + 8
    floats = struct.unpack_from(f'<{(len(data)-idx)//4}f', data, idx)
    print(f'max={max(abs(f) for f in floats):.4f}')
    # Check for 4-sample repeat pattern (corrupted HiFT)
    if len(floats) > 200:
        is_loop = all(abs(floats[i]-floats[i+4])<0.0001 for i in range(100,120))
        print(f'corrupted={is_loop}')
"

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions