Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
4042cd8
feat(cohere-transcribe): CoreML conversion + quantization pipeline
Alex-Wengg Apr 22, 2026
3bc70d8
fix(cohere-transcribe): host-side feature extraction + CJK detok + cr…
Alex-Wengg Apr 22, 2026
8013c05
experiment(cohere-transcribe): q8 EOS bias diagnosis + fix
Alex-Wengg Apr 22, 2026
91a3ca3
docs(cohere-transcribe): capture host-side + q8 EOS findings, drop on…
Alex-Wengg Apr 22, 2026
c5de37e
chore(cohere-transcribe): drop f16/ and q8/ HF upload bundles
Alex-Wengg Apr 22, 2026
f2cdf4e
chore(cohere-transcribe): aggressive PR cleanup (-126K lines, -57 files)
Alex-Wengg Apr 22, 2026
010aa1f
revert(cohere-transcribe): restore cache-external decoder pipeline
Alex-Wengg Apr 22, 2026
fb6dd30
chore(cohere-transcribe): drop python_cache_external_*.json result ca…
Alex-Wengg Apr 23, 2026
160454e
chore(cohere-transcribe): drop hf-upload cache-external mirror
Alex-Wengg Apr 23, 2026
114c84c
chore(cohere-transcribe): drop hf-upload workflow docs
Alex-Wengg Apr 23, 2026
f01d2d0
chore(cohere-transcribe): move CACHE_EXTERNAL_*.md into docs/
Alex-Wengg Apr 23, 2026
3c77edc
chore(cohere-transcribe): drop superseded / off-canonical docs
Alex-Wengg Apr 23, 2026
ab9c2ce
chore(cohere-transcribe): consolidate cache-external exports under ex…
Alex-Wengg Apr 23, 2026
8bb9714
chore(cohere-transcribe): move download-fleurs-for-swift.py into tools/
Alex-Wengg Apr 23, 2026
806143a
chore(cohere-transcribe): drop off-canonical decoder exports
Alex-Wengg Apr 23, 2026
f700df8
chore(cohere-transcribe): drop zombie stateful bench scripts
Alex-Wengg Apr 23, 2026
b12fb40
chore(cohere-transcribe/tools): drop stateful-decoder zombies
Alex-Wengg Apr 23, 2026
ad4398d
docs(cohere-transcribe): align README and CACHE_INVESTIGATION_SUMMARY…
Alex-Wengg Apr 23, 2026
9c31edf
fix(cohere/tools): drop stale stateful-decoder compile note
Alex-Wengg Apr 23, 2026
544468a
fix(cohere/exports): align ct.convert with repo convention
Alex-Wengg Apr 23, 2026
e0ee826
docs(cohere): drop --precision from decoder Quick Start command
Alex-Wengg Apr 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,14 @@ __pycache__
*.mlmodelc

# Large numpy arrays (exported constants - regenerate via export_constants.py)
*.npy
*.npy

# PyTorch model weights (download from HuggingFace)
*.safetensors
*.bin
*.pt
*.pth
*.ckpt

# ONNX models
*.onnx
48 changes: 48 additions & 0 deletions models/stt/cohere-transcribe-03-2026/coreml/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Virtual environments
.venv/
.venv312/

# Build artifacts
build/
*.mlpackage
*.mlmodelc

# Large model files
onnx-models/
*.onnx

# Audio test files
*.wav
*.mp3
*.flac

# Logs
*.log

# Python
__pycache__/
*.pyc
*.pyo
*.egg-info/

# Misc
.DS_Store
cross_caches.pkl

# Test results and temporary files
*_results.json
benchmark_*.json
*_fleurs_*.json
*_cache_external*.json
.hf_cache/

# Reference models (external git repo)
barathwaj-models/

# HF source clone (download via huggingface-cli, do not vendor)
../cohere-pytorch/

# Local working dirs for the quantization toolchain
# (populated via huggingface-cli download FluidInference/cohere-transcribe-03-2026-coreml)
f16/
q8/
122 changes: 122 additions & 0 deletions models/stt/cohere-transcribe-03-2026/coreml/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Cohere Transcribe CoreML Export

CoreML export of [CohereLabs/cohere-transcribe-03-2026](https://huggingface.co/CohereLabs/cohere-transcribe-03-2026) for on-device speech recognition on Apple Silicon.

## Status: Cache-External Decoder

The canonical pipeline exports the decoder with **host-managed KV cache**: the Swift loader allocates K/V cache tensors and passes them through each decoding step. This avoids the CoreML State API (macOS 15+/iOS 18+) and works everywhere FluidAudio runs.

| Component | Format | Notes |
|-----------|--------|-------|
| Encoder | `.mlpackage` (FP16) | 3500 frames (35 s), projection fused |
| Decoder (cache-external) | `.mlpackage` (FP16/INT8) | Per-step `input_id` + `k_cache_*` / `v_cache_*` I/O |
| Mel preprocessing | Pure Python / numpy | No `transformers` dependency |

### Shipped artifacts

The corresponding HF model repos consumed by FluidAudio:

- [`FluidInference/cohere-transcribe-cache-external-coreml`](https://huggingface.co/FluidInference/cohere-transcribe-cache-external-coreml) — FP16
- [`FluidInference/cohere-transcribe-q8-cache-external-coreml`](https://huggingface.co/FluidInference/cohere-transcribe-q8-cache-external-coreml) — INT8 hybrid (encoder Q8, decoder FP16)

Loader: `CohereFixedPipeline` in [FluidAudio PR #487](https://github.com/FluidInference/FluidAudio/pull/487).

## Decoder I/O contract

The Swift loader expects this exact decoder signature.

**Inputs**
- `input_id`: shape `(1, 1)` — token at the current step
- `position_id`: shape `(1, 1)` — current position
- `encoder_hidden_states`: shape `(1, enc_len, 1024)`
- `cross_attention_mask`: shape `(1, 1, 1, enc_len)`
- `attention_mask`: shape `(1, 1, 1, max_seq_len)` — masks unfilled cache slots
- `k_cache_0..7`, `v_cache_0..7`: shape `(1, num_heads, max_seq_len, head_dim)`

**Outputs**
- `logits`: shape `(1, 16384)`
- `k_cache_0_out..7_out`, `v_cache_0_out..7_out`: updated caches written back to host

Constants: `num_layers=8`, `num_heads=8`, `head_dim=128`, `hidden=1024`, `vocab=16384`, `max_seq_len=108`.

## Quick Start

```bash
# Encoder (FP16)
uv run python3 exports/export-encoder.py --output-dir build --precision float16

# Decoder (cache-external)
uv run python3 exports/export-decoder-cache-external.py --output-dir build

# Optional: INT8 encoder
uv run python3 tools/quantize_to_int8.py
uv run python3 tools/compile_encoder_to_mlmodelc.py
```

## Decoding loop (Python reference)

```python
PROMPT_IDS = [13764, 7, 4, 16, 62, 62, 5, 9, 11, 13]
# ▁ <|startofcontext|> <|startoftranscript|> <|emo:undefined|>
# <|en|> <|en|> <|pnc|> <|noitn|> <|notimestamp|> <|nodiarize|>
EOS_TOKEN_ID = 3
MAX_SEQ_LEN = 108

# Pre-fill caches with the prompt, then loop one token at a time:
for step in range(len(PROMPT_IDS), MAX_SEQ_LEN):
out = decoder.predict({
"input_id": np.array([[token]], dtype=np.int32),
"position_id": np.array([[step - 1]], dtype=np.int32),
"encoder_hidden_states": encoder_hidden,
"cross_attention_mask": cross_mask,
"attention_mask": self_mask,
**k_caches, **v_caches,
})
token = int(np.argmax(out["logits"][0]))
# write k_cache_*_out / v_cache_*_out back into k_caches / v_caches
if token == EOS_TOKEN_ID:
break
```

## Files

```
exports/
export-encoder.py # Encoder + projection
export-decoder-cache-external.py # Canonical decoder export

tools/
cohere_features_v2.py # Numpy mel-spectrogram extractor
compile_encoder_to_mlmodelc.py # mlpackage → mlmodelc (encoder)
download-fleurs-for-swift.py # FLEURS fetcher for Swift benches
quantize_to_int8.py # Encoder INT8 quantization

tests/
test-feature-parity.py # PyTorch vs CoreML mel parity check

docs/
CACHE_EXTERNAL_ANALYSIS.md
CACHE_EXTERNAL_DELIVERED.md
CACHE_INVESTIGATION_SUMMARY.md
COHERE_ARCHITECTURE_ANALYSIS.md
HOST_SIDE_FIXES.md
Q8_EOS_BIAS.md
```

## Background

Earlier iterations explored a stateful (CoreML State API) decoder and a stateless (re-process all tokens per step) decoder. Both were dropped:

- **Stateful** — required macOS 15+/iOS 18+ and added cache-management complexity in the model.
- **Stateless** — O(n²), produced wrong outputs on longer sequences during validation.

`docs/CACHE_INVESTIGATION_SUMMARY.md` documents the earlier sliding-window cache bug that motivated moving cache management out of the model entirely.

## Requirements

- macOS 14+ / iOS 17+
- Python 3.10+, dependencies in `pyproject.toml` (managed with `uv`)

## License

GPL-3.0 (matches upstream CoreML conversion). Base model: Apache-2.0 ([CohereLabs/cohere-transcribe-03-2026](https://huggingface.co/CohereLabs/cohere-transcribe-03-2026)).
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
# Cache-External Decoder Analysis: Python vs Swift

**Date**: April 8, 2026
**Model**: Cohere Transcribe 03-2026 (Cache-External Decoder)
**Test Dataset**: FLEURS multilingual (10 samples per language)

## Executive Summary

Both Python (CoreML) and Swift implementations of the cache-external decoder exhibit **severe multilingual hallucination issues**, but Swift is significantly worse. The root cause is that **neither implementation uses language conditioning**, and the exported CoreML decoder does not preserve the model's language detection capabilities.

## WER Comparison

| Language | Python WER | Swift WER | Swift vs Python |
|----------|-----------|-----------|-----------------|
| **English** | 55.02% | 263% | **4.8x worse** |
| **French** | 92.33% | 150% | **1.6x worse** |
| **Spanish** | 24.26% | 43% | **1.8x worse** |
| **Chinese** | 105.09% | 111% | Similar (both hallucinating) |

## Detailed Findings

### 1. Language Hallucination Patterns

Both implementations produce **non-target-language output** for most languages:

#### English Samples (Python):
- **Sample 0**: Arabic script `ولو انهم يحبون انهم يحبون...` (100% WER)
- **Sample 1**: Correct English transcription (62% WER)
- **Sample 4**: Arabic script `مين بصوتك في مكانك...` (267% WER)

#### French Samples (Python):
- **Sample 0**: Arabic script `نحن نعلم ان هناك من يحمل حياتنا...` (100% WER)
- **Sample 7**: Partial French transcription (58% WER)
- **Sample 2-6**: All Arabic hallucinations (100% WER each)

#### Spanish Samples (Python):
- **Sample 2**: Nearly perfect `"fue tanta la cantidad de gente que se concentró..."` (4.5% WER)
- **Sample 0**: Good quality Spanish (13.8% WER)
- **Average**: Best performance across all languages (24.26% WER)

#### Chinese Samples (Python):
- **Sample 0**: Polish script `"to tylko szybko odkryć..."` (100% WER)
- **Sample 1**: Arabic script `كعكعك يا شوشو...` (100% WER)
- **Sample 4**: English `"i'm sure the government..."` (122% WER)
- **All samples**: Complete hallucination (105% WER overall)

### 2. Swift Implementation Issues

Swift cache-external decoder produces **even worse hallucinations**:

- **English**: 263% WER (vs Python 55%)
- **French**: 150% WER (vs Python 92%)
- **Spanish**: 43% WER (vs Python 24%) - still best language
- **Chinese**: 111% WER (vs Python 105%)

**Why Swift is worse**:
1. Possible bugs in KV cache management
2. Incorrect attention mask sizing
3. Position ID handling issues
4. All symptoms suggest Swift's cache state is corrupted/incorrect

### 3. Root Cause Analysis

#### Neither Implementation Uses Language Conditioning

**Python code** (test-fleurs-wer.py:109):
```python
current_token = START_TOKEN # Just token 4, no language token
```

**Swift code** (CohereAsrManager.swift):
```swift
let prompt = language?.promptSequence ?? [CohereAsrConfig.SpecialTokens.startToken]
```

While Swift HAS language support in the code, the Python test doesn't use it, proving the model should work without explicit language tokens if properly exported.

#### The CoreML Export Lost Language Detection

The original PyTorch model likely:
1. Auto-detects language from encoder hidden states
2. Conditions decoder output based on detected language
3. Uses language embeddings in the decoder layers

The CoreML export process:
1. Traced with fixed inputs (no language conditioning)
2. Lost dynamic language detection logic
3. Defaults to Arabic/mixed-language tokens

### 4. Why Spanish Works

Spanish achieves 24-43% WER while other languages hallucinate (>90% WER). Possible reasons:

1. **Training data dominance**: Spanish may be the most represented language in training
2. **Default language mode**: Model defaults to Spanish when language detection fails
3. **Simpler phonetics**: Spanish has more regular phoneme-to-grapheme mapping
4. **Export artifacts**: The specific trace inputs used during export may have been Spanish audio

## Recommendations

### Option 1: Re-export with Language Conditioning (RECOMMENDED)

**Action**: Modify `export-decoder-cache-external.py` to:
1. Accept language token as an additional input
2. Embed language token into the decoder's initial state
3. Export separate decoders per language (or one multilingual with language input)

**Pros**:
- Proper language conditioning
- Matches PyTorch model behavior
- Clean architecture

**Cons**:
- Requires re-export and re-testing
- May increase model size
- Need to test all languages

### Option 2: Use Stateful Decoder (iOS Only)

**Action**: Use the stateful decoder (already exported) which may preserve language state better.

**Pros**:
- CoreML manages state internally
- May preserve language context better
- Simpler Swift code

**Cons**:
- iOS/iPadOS only (macOS doesn't support `newState()`)
- Still may have same language detection issues
- Would need iOS device testing

### Option 3: Language-Specific Decoders

**Action**: Export separate decoder models per language.

**Pros**:
- Guaranteed language isolation
- Smaller per-language models
- No language confusion possible

**Cons**:
- 14 separate decoder models to manage
- 14× storage/memory requirements
- Deployment complexity

### Option 4: Accept Spanish-Only

**Action**: Document that cache-external decoder only works for Spanish, use other models for multilingual.

**Pros**:
- Works today (24-43% WER acceptable)
- No additional work required
- Clear user expectations

**Cons**:
- Very limited language support
- Defeats purpose of multilingual model
- Poor user experience for non-Spanish users

## Next Steps

1. **Decide on approach** (recommend Option 1: re-export with language conditioning)
2. **If re-exporting**:
- Modify export script to accept language token input
- Test with all 14 supported languages
- Validate WER across all languages
- Update Swift code to pass language token
3. **If accepting limitations**:
- Document Spanish-only support for cache-external
- Recommend stateful decoder for iOS multilingual use
- Consider alternative models (Whisper, Parakeet) for multilingual

## Technical Details

### Cache-External Decoder Architecture

**Inputs** (17 total):
- `input_id` (1,1) - Current token
- `position_id` (1,1) - Position in sequence
- `encoder_hidden_states` (1, 438, 1024) - Encoder output
- `cross_attention_mask` (1, 1, 1, 438) - Encoder attention mask
- `attention_mask` (1, 1, 1, step+1) - Growing decoder attention mask
- `k_cache_0` through `k_cache_7` (8 arrays: 1, 8, 108, 128) - Key caches for 8 layers
- `v_cache_0` through `v_cache_7` (8 arrays: 1, 8, 108, 128) - Value caches for 8 layers

**Outputs** (17 total):
- `logits` (1, 16384) - Token probabilities
- `k_cache_0_out` through `k_cache_7_out` - Updated key caches
- `v_cache_0_out` through `v_cache_7_out` - Updated value caches

### Test Configuration

- **Python**: CoreMLTools prediction with PyTorch encoder
- **Swift**: Full Swift implementation with encoder + cache-external decoder
- **Dataset**: FLEURS test split (Google's multilingual ASR benchmark)
- **Languages**: en_us, fr_fr, es_419, cmn_hans_cn
- **Samples**: 10 per language (40 total)
- **No language conditioning**: Both tests started with START_TOKEN only

## Conclusion

The cache-external decoder is **fundamentally broken for multilingual use** in both Python and Swift, with Swift being significantly worse. The issue is NOT in Swift's implementation but in the **CoreML export process** which lost the model's language detection capabilities.

**Spanish is the only language that works** (24-43% WER), suggesting it was the export reference language or the most dominant in training.

To make this model usable for multilingual transcription, we must **re-export the decoder with explicit language conditioning** built into the model inputs, or accept Spanish-only deployment.
Loading