FluidInference · Alex-Wengg · Apr 22, 2026 · Apr 22, 2026 · Apr 22, 2026 · Apr 22, 2026
diff --git a/.gitignore b/.gitignore
@@ -9,4 +9,14 @@ __pycache__
 *.mlmodelc
 
 # Large numpy arrays (exported constants - regenerate via export_constants.py)
-*.npy
+*.npy
+
+# PyTorch model weights (download from HuggingFace)
+*.safetensors
+*.bin
+*.pt
+*.pth
+*.ckpt
+
+# ONNX models
+*.onnx
diff --git a/models/stt/cohere-transcribe-03-2026/coreml/.gitignore b/models/stt/cohere-transcribe-03-2026/coreml/.gitignore
@@ -0,0 +1,48 @@
+# Virtual environments
+.venv/
+.venv312/
+
+# Build artifacts
+build/
+*.mlpackage
+*.mlmodelc
+
+# Large model files
+onnx-models/
+*.onnx
+
+# Audio test files
+*.wav
+*.mp3
+*.flac
+
+# Logs
+*.log
+
+# Python
+__pycache__/
+*.pyc
+*.pyo
+*.egg-info/
+
+# Misc
+.DS_Store
+cross_caches.pkl
+
+# Test results and temporary files
+*_results.json
+benchmark_*.json
+*_fleurs_*.json
+*_cache_external*.json
+.hf_cache/
+
+# Reference models (external git repo)
+barathwaj-models/
+
+# HF source clone (download via huggingface-cli, do not vendor)
+../cohere-pytorch/
+
+# Local working dirs for the quantization toolchain
+# (populated via huggingface-cli download FluidInference/cohere-transcribe-03-2026-coreml)
+f16/
+q8/
diff --git a/models/stt/cohere-transcribe-03-2026/coreml/README.md b/models/stt/cohere-transcribe-03-2026/coreml/README.md
@@ -0,0 +1,122 @@
+# Cohere Transcribe CoreML Export
+
+CoreML export of [CohereLabs/cohere-transcribe-03-2026](https://huggingface.co/CohereLabs/cohere-transcribe-03-2026) for on-device speech recognition on Apple Silicon.
+
+## Status: Cache-External Decoder
+
+The canonical pipeline exports the decoder with **host-managed KV cache**: the Swift loader allocates K/V cache tensors and passes them through each decoding step. This avoids the CoreML State API (macOS 15+/iOS 18+) and works everywhere FluidAudio runs.
+
+| Component | Format | Notes |
+|-----------|--------|-------|
+| Encoder | `.mlpackage` (FP16) | 3500 frames (35 s), projection fused |
+| Decoder (cache-external) | `.mlpackage` (FP16/INT8) | Per-step `input_id` + `k_cache_*` / `v_cache_*` I/O |
+| Mel preprocessing | Pure Python / numpy | No `transformers` dependency |
+
+### Shipped artifacts
+
+The corresponding HF model repos consumed by FluidAudio:
+
+- [`FluidInference/cohere-transcribe-cache-external-coreml`](https://huggingface.co/FluidInference/cohere-transcribe-cache-external-coreml) — FP16
+- [`FluidInference/cohere-transcribe-q8-cache-external-coreml`](https://huggingface.co/FluidInference/cohere-transcribe-q8-cache-external-coreml) — INT8 hybrid (encoder Q8, decoder FP16)
+
+Loader: `CohereFixedPipeline` in [FluidAudio PR #487](https://github.com/FluidInference/FluidAudio/pull/487).
+
+## Decoder I/O contract
+
+The Swift loader expects this exact decoder signature.
+
+**Inputs**
+- `input_id`: shape `(1, 1)` — token at the current step
+- `position_id`: shape `(1, 1)` — current position
+- `encoder_hidden_states`: shape `(1, enc_len, 1024)`
+- `cross_attention_mask`: shape `(1, 1, 1, enc_len)`
+- `attention_mask`: shape `(1, 1, 1, max_seq_len)` — masks unfilled cache slots
+- `k_cache_0..7`, `v_cache_0..7`: shape `(1, num_heads, max_seq_len, head_dim)`
+
+**Outputs**
+- `logits`: shape `(1, 16384)`
+- `k_cache_0_out..7_out`, `v_cache_0_out..7_out`: updated caches written back to host
+
+Constants: `num_layers=8`, `num_heads=8`, `head_dim=128`, `hidden=1024`, `vocab=16384`, `max_seq_len=108`.
+
+## Quick Start
+
+```bash
+# Encoder (FP16)
+uv run python3 exports/export-encoder.py --output-dir build --precision float16
+
+# Decoder (cache-external)
+uv run python3 exports/export-decoder-cache-external.py --output-dir build
+
+# Optional: INT8 encoder
+uv run python3 tools/quantize_to_int8.py
+uv run python3 tools/compile_encoder_to_mlmodelc.py
+```
+
+## Decoding loop (Python reference)
+
+```python
+PROMPT_IDS = [13764, 7, 4, 16, 62, 62, 5, 9, 11, 13]
+# ▁ <|startofcontext|> <|startoftranscript|> <|emo:undefined|>
+# <|en|> <|en|> <|pnc|> <|noitn|> <|notimestamp|> <|nodiarize|>
+EOS_TOKEN_ID = 3
+MAX_SEQ_LEN = 108
+
+# Pre-fill caches with the prompt, then loop one token at a time:
+for step in range(len(PROMPT_IDS), MAX_SEQ_LEN):
+    out = decoder.predict({
+        "input_id": np.array([[token]], dtype=np.int32),
+        "position_id": np.array([[step - 1]], dtype=np.int32),
+        "encoder_hidden_states": encoder_hidden,
+        "cross_attention_mask": cross_mask,
+        "attention_mask": self_mask,
+        **k_caches, **v_caches,
+    })
+    token = int(np.argmax(out["logits"][0]))
+    # write k_cache_*_out / v_cache_*_out back into k_caches / v_caches
+    if token == EOS_TOKEN_ID:
+        break
+```
+
+## Files
+
+```
+exports/
+  export-encoder.py                 # Encoder + projection
+  export-decoder-cache-external.py  # Canonical decoder export
+
+tools/
+  cohere_features_v2.py             # Numpy mel-spectrogram extractor
+  compile_encoder_to_mlmodelc.py    # mlpackage → mlmodelc (encoder)
+  download-fleurs-for-swift.py      # FLEURS fetcher for Swift benches
+  quantize_to_int8.py               # Encoder INT8 quantization
+
+tests/
+  test-feature-parity.py            # PyTorch vs CoreML mel parity check
+
+docs/
+  CACHE_EXTERNAL_ANALYSIS.md
+  CACHE_EXTERNAL_DELIVERED.md
+  CACHE_INVESTIGATION_SUMMARY.md
+  COHERE_ARCHITECTURE_ANALYSIS.md
+  HOST_SIDE_FIXES.md
+  Q8_EOS_BIAS.md
+```
+
+## Background
+
+Earlier iterations explored a stateful (CoreML State API) decoder and a stateless (re-process all tokens per step) decoder. Both were dropped:
+
+- **Stateful** — required macOS 15+/iOS 18+ and added cache-management complexity in the model.
+- **Stateless** — O(n²), produced wrong outputs on longer sequences during validation.
+
+`docs/CACHE_INVESTIGATION_SUMMARY.md` documents the earlier sliding-window cache bug that motivated moving cache management out of the model entirely.
+
+## Requirements
+
+- macOS 14+ / iOS 17+
+- Python 3.10+, dependencies in `pyproject.toml` (managed with `uv`)
+
+## License
+
+GPL-3.0 (matches upstream CoreML conversion). Base model: Apache-2.0 ([CohereLabs/cohere-transcribe-03-2026](https://huggingface.co/CohereLabs/cohere-transcribe-03-2026)).
diff --git a/models/stt/cohere-transcribe-03-2026/coreml/docs/CACHE_EXTERNAL_ANALYSIS.md b/models/stt/cohere-transcribe-03-2026/coreml/docs/CACHE_EXTERNAL_ANALYSIS.md
@@ -0,0 +1,206 @@
+# Cache-External Decoder Analysis: Python vs Swift
+
+**Date**: April 8, 2026
+**Model**: Cohere Transcribe 03-2026 (Cache-External Decoder)
+**Test Dataset**: FLEURS multilingual (10 samples per language)
+
+## Executive Summary
+
+Both Python (CoreML) and Swift implementations of the cache-external decoder exhibit **severe multilingual hallucination issues**, but Swift is significantly worse. The root cause is that **neither implementation uses language conditioning**, and the exported CoreML decoder does not preserve the model's language detection capabilities.
+
+## WER Comparison
+
+| Language | Python WER | Swift WER | Swift vs Python |
+|----------|-----------|-----------|-----------------|
+| **English** | 55.02% | 263% | **4.8x worse** |
+| **French** | 92.33% | 150% | **1.6x worse** |
+| **Spanish** | 24.26% | 43% | **1.8x worse** |
+| **Chinese** | 105.09% | 111% | Similar (both hallucinating) |
+
+## Detailed Findings
+
+### 1. Language Hallucination Patterns
+
+Both implementations produce **non-target-language output** for most languages:
+
+#### English Samples (Python):
+- **Sample 0**: Arabic script `ولو انهم يحبون انهم يحبون...` (100% WER)
+- **Sample 1**: Correct English transcription (62% WER)
+- **Sample 4**: Arabic script `مين بصوتك في مكانك...` (267% WER)
+
+#### French Samples (Python):
+- **Sample 0**: Arabic script `نحن نعلم ان هناك من يحمل حياتنا...` (100% WER)
+- **Sample 7**: Partial French transcription (58% WER)
+- **Sample 2-6**: All Arabic hallucinations (100% WER each)
+
+#### Spanish Samples (Python):
+- **Sample 2**: Nearly perfect `"fue tanta la cantidad de gente que se concentró..."` (4.5% WER)
+- **Sample 0**: Good quality Spanish (13.8% WER)
+- **Average**: Best performance across all languages (24.26% WER)
+
+#### Chinese Samples (Python):
+- **Sample 0**: Polish script `"to tylko szybko odkryć..."` (100% WER)
+- **Sample 1**: Arabic script `كعكعك يا شوشو...` (100% WER)
+- **Sample 4**: English `"i'm sure the government..."` (122% WER)
+- **All samples**: Complete hallucination (105% WER overall)
+
+### 2. Swift Implementation Issues
+
+Swift cache-external decoder produces **even worse hallucinations**:
+
+- **English**: 263% WER (vs Python 55%)
+- **French**: 150% WER (vs Python 92%)
+- **Spanish**: 43% WER (vs Python 24%) - still best language
+- **Chinese**: 111% WER (vs Python 105%)
+
+**Why Swift is worse**:
+1. Possible bugs in KV cache management
+2. Incorrect attention mask sizing
+3. Position ID handling issues
+4. All symptoms suggest Swift's cache state is corrupted/incorrect
+
+### 3. Root Cause Analysis
+
+#### Neither Implementation Uses Language Conditioning
+
+**Python code** (test-fleurs-wer.py:109):
+```python
+current_token = START_TOKEN  # Just token 4, no language token
+```
+
+**Swift code** (CohereAsrManager.swift):
+```swift
+let prompt = language?.promptSequence ?? [CohereAsrConfig.SpecialTokens.startToken]
+```
+
+While Swift HAS language support in the code, the Python test doesn't use it, proving the model should work without explicit language tokens if properly exported.
+
+#### The CoreML Export Lost Language Detection
+
+The original PyTorch model likely:
+1. Auto-detects language from encoder hidden states
+2. Conditions decoder output based on detected language
+3. Uses language embeddings in the decoder layers
+
+The CoreML export process:
+1. Traced with fixed inputs (no language conditioning)
+2. Lost dynamic language detection logic
+3. Defaults to Arabic/mixed-language tokens
+
+### 4. Why Spanish Works
+
+Spanish achieves 24-43% WER while other languages hallucinate (>90% WER). Possible reasons:
+
+1. **Training data dominance**: Spanish may be the most represented language in training
+2. **Default language mode**: Model defaults to Spanish when language detection fails
+3. **Simpler phonetics**: Spanish has more regular phoneme-to-grapheme mapping
+4. **Export artifacts**: The specific trace inputs used during export may have been Spanish audio
+
+## Recommendations
+
+### Option 1: Re-export with Language Conditioning (RECOMMENDED)
+
+**Action**: Modify `export-decoder-cache-external.py` to:
+1. Accept language token as an additional input
+2. Embed language token into the decoder's initial state
+3. Export separate decoders per language (or one multilingual with language input)
+
+**Pros**:
+- Proper language conditioning
+- Matches PyTorch model behavior
+- Clean architecture
+
+**Cons**:
+- Requires re-export and re-testing
+- May increase model size
+- Need to test all languages
+
+### Option 2: Use Stateful Decoder (iOS Only)
+
+**Action**: Use the stateful decoder (already exported) which may preserve language state better.
+
+**Pros**:
+- CoreML manages state internally
+- May preserve language context better
+- Simpler Swift code
+
+**Cons**:
+- iOS/iPadOS only (macOS doesn't support `newState()`)
+- Still may have same language detection issues
+- Would need iOS device testing
+
+### Option 3: Language-Specific Decoders
+
+**Action**: Export separate decoder models per language.
+
+**Pros**:
+- Guaranteed language isolation
+- Smaller per-language models
+- No language confusion possible
+
+**Cons**:
+- 14 separate decoder models to manage
+- 14× storage/memory requirements
+- Deployment complexity
+
+### Option 4: Accept Spanish-Only
+
+**Action**: Document that cache-external decoder only works for Spanish, use other models for multilingual.
+
+**Pros**:
+- Works today (24-43% WER acceptable)
+- No additional work required
+- Clear user expectations
+
+**Cons**:
+- Very limited language support
+- Defeats purpose of multilingual model
+- Poor user experience for non-Spanish users
+
+## Next Steps
+
+1. **Decide on approach** (recommend Option 1: re-export with language conditioning)
+2. **If re-exporting**:
+   - Modify export script to accept language token input
+   - Test with all 14 supported languages
+   - Validate WER across all languages
+   - Update Swift code to pass language token
+3. **If accepting limitations**:
+   - Document Spanish-only support for cache-external
+   - Recommend stateful decoder for iOS multilingual use
+   - Consider alternative models (Whisper, Parakeet) for multilingual
+
+## Technical Details
+
+### Cache-External Decoder Architecture
+
+**Inputs** (17 total):
+- `input_id` (1,1) - Current token
+- `position_id` (1,1) - Position in sequence
+- `encoder_hidden_states` (1, 438, 1024) - Encoder output
+- `cross_attention_mask` (1, 1, 1, 438) - Encoder attention mask
+- `attention_mask` (1, 1, 1, step+1) - Growing decoder attention mask
+- `k_cache_0` through `k_cache_7` (8 arrays: 1, 8, 108, 128) - Key caches for 8 layers
+- `v_cache_0` through `v_cache_7` (8 arrays: 1, 8, 108, 128) - Value caches for 8 layers
+
+**Outputs** (17 total):
+- `logits` (1, 16384) - Token probabilities
+- `k_cache_0_out` through `k_cache_7_out` - Updated key caches
+- `v_cache_0_out` through `v_cache_7_out` - Updated value caches
+
+### Test Configuration
+
+- **Python**: CoreMLTools prediction with PyTorch encoder
+- **Swift**: Full Swift implementation with encoder + cache-external decoder
+- **Dataset**: FLEURS test split (Google's multilingual ASR benchmark)
+- **Languages**: en_us, fr_fr, es_419, cmn_hans_cn
+- **Samples**: 10 per language (40 total)
+- **No language conditioning**: Both tests started with START_TOKEN only
+
+## Conclusion
+
+The cache-external decoder is **fundamentally broken for multilingual use** in both Python and Swift, with Swift being significantly worse. The issue is NOT in Swift's implementation but in the **CoreML export process** which lost the model's language detection capabilities.
+
+**Spanish is the only language that works** (24-43% WER), suggesting it was the export reference language or the most dominant in training.
+
+To make this model usable for multilingual transcription, we must **re-export the decoder with explicit language conditioning** built into the model inputs, or accept Spanish-only deployment.