Q4_K_S: token2wav produces noise/non-speech output on Apple Silicon

## Summary

On Apple Silicon (M3), the Q4_K_S quantized model (`CosyVoice3-2512_Q4_K_S.gguf`) produces output that sounds like noise/machine sound rather than intelligible speech. The LLM stage generates speech tokens successfully, but the final audio from token2wav (Flow + HiFT) is not recognizable as speech.

## Environment

- Hardware: Apple Silicon M3 (MacBook Air)
- Model: `Lourdle/Fun-CosyVoice3-0.5B-2512-GGUF` (Q4_K_S, 551MB)
- Frontend: `speech_tokenizer_v3.onnx` + `campplus.onnx`
- Build: Metal + CPU (with PR #2 patches applied)

## Symptoms

### 1. HiFT vocoder IM2COL memory explosion (long sequences)

For sequences with >~400 speech tokens, HiFT Conv1d layers produce IM2COL tensors with `ne[1] > 0xFFFF` (e.g. `ne={896, 149761, 1, 1}`). There are 27 such tensors per HiFT pass, each requiring ~200MB+, causing segfaults.

The current workaround (`node->op = GGML_OP_NONE` for oversized IM2COL) prevents the crash but **completely corrupts vocoder output** — producing a repeating 4-sample pattern:

```
[0.002363, -0.002809, 0.006419, -0.003742, 0.002363, -0.002809, ...]
```

### 2. LLM token generation instability

The Q4_K_S model generates wildly inconsistent speech token counts. For 13 single-sentence inputs with the same reference audio:
- 4 produced reasonable length (3-12s)
- 9 produced near-empty output (0.2-0.4s)

### 3. Non-speech output even for "working" lengths

Even when the LLM generates a reasonable number of tokens and IM2COL is not disabled (short sequences), the output waveform has speech-like statistical properties (zero-crossing rate, autocorrelation) but does **not** sound like speech to human listeners.

## Analysis

| Factor | Impact |
|--------|--------|
| Q4_K_S quantization (4-bit) | Model precision severely degraded. Stop token suppression partially helps but LLM remains unstable |
| HiFT IM2COL limit | Conv1d intermediate tensors explode for sequences >~400 tokens. Disabling IM2COL corrupts output |
| Metal/CPU mixed execution | Resolved via CPU-only workaround in token2wav (PR #2), but performance impact |

## Potential fixes

1. **Higher precision quantization** — Q8_0 or Q5_K_M should significantly improve LLM stability and vocoder quality. Is a Q8_0 GGUF available or can one be produced from the FP16 weights?

2. **Chunked HiFT processing** — Split `speech_feat` into overlapping chunks with proper Conv1d receptive field padding, run HiFT on each chunk, crossfade and concatenate. This would avoid the IM2COL memory explosion.

3. **Streaming/tiled IM2COL** — Implement IM2COL in tiles rather than materializing the full intermediate tensor. This is a ggml-level change.

## Reproduction

```bash
# Build with PR #2 patches
cosyvoice-cli \
  --model CosyVoice3-2512_Q4_K_S.gguf \
  --speech-tokenizer speech_tokenizer_v3.onnx \
  --campplus campplus.onnx \
  --prompt-audio reference.wav \
  --prompt-text "reference transcript" \
  --text "Any Japanese text" \
  --output output.wav

# Analyze output
python3 -c "
import struct
with open('output.wav', 'rb') as f:
    data = f.read()
    idx = data.find(b'data') + 8
    floats = struct.unpack_from(f'<{(len(data)-idx)//4}f', data, idx)
    print(f'max={max(abs(f) for f in floats):.4f}')
    # Check for 4-sample repeat pattern (corrupted HiFT)
    if len(floats) > 200:
        is_loop = all(abs(floats[i]-floats[i+4])<0.0001 for i in range(100,120))
        print(f'corrupted={is_loop}')
"
```

## Related

- PR #2: Apple Silicon support (SIMDe, PAD, op fallback, stop token fix, CPU-only token2wav)
- Issue #3: Metal PAD/IM2COL crash report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q4_K_S: token2wav produces noise/non-speech output on Apple Silicon #4

Summary

Environment

Symptoms

1. HiFT vocoder IM2COL memory explosion (long sequences)

2. LLM token generation instability

3. Non-speech output even for "working" lengths

Analysis

Potential fixes

Reproduction

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Factor	Impact
Q4_K_S quantization (4-bit)	Model precision severely degraded. Stop token suppression partially helps but LLM remains unstable
HiFT IM2COL limit	Conv1d intermediate tensors explode for sequences >~400 tokens. Disabling IM2COL corrupts output
Metal/CPU mixed execution	Resolved via CPU-only workaround in token2wav (PR #2), but performance impact

Q4_K_S: token2wav produces noise/non-speech output on Apple Silicon #4

Description

Summary

Environment

Symptoms

1. HiFT vocoder IM2COL memory explosion (long sequences)

2. LLM token generation instability

3. Non-speech output even for "working" lengths

Analysis

Potential fixes

Reproduction

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions