feat(asr): add Cohere Transcribe (INT8 encoder + FP16 cache-external decoder) by Alex-Wengg · Pull Request #487 · FluidInference/FluidAudio

Alex-Wengg · 2026-04-06T21:36:13Z

Summary

Adds Cohere Transcribe ASR for 14 languages, shipped as an INT8 encoder

FP16 cache-external decoder hybrid (CoherePipeline). One CLI for
single-file transcription, one CLI for dataset benchmarking (FLEURS and
LibriSpeech).

Languages

English, French, German, Spanish, Italian, Portuguese, Dutch, Polish,
Greek, Arabic, Japanese, Chinese (Simplified), Korean, Vietnamese.

What's added

Library (`Sources/FluidAudio/ASR/Cohere/`)

CoherePipeline — encoder + cache-external decoder runner. Allocates
the K/V cache host-side (no CoreML State API; iOS 17+), applies the
additive cross-attention mask, and detokenizes via SentencePiece byte
fallback so CJK comes out as real characters. Accepts separate
encoderDir / decoderDir to support the q8/f16 split.
CohereAsrConfig — per-language prompt sequences and token IDs;
shared 35 s / 3500-frame audio window and 108-token decoder cache window
constants. The 35 s cap traces directly to upstream max_audio_clip_s: 35.
CohereMelSpectrogram — 128-mel front-end matching the reference
model (preemph, Slaney mel, CMVN).

CLI (`Sources/FluidAudioCLI/Commands/ASR/Cohere/`)

fluidaudiocli cohere-transcribe <audio> --language <lang> — single-file
transcription. Accepts either --model-dir (single dir with both
encoder and decoder) or --encoder-dir + --decoder-dir for the q8/f16
split.
fluidaudiocli cohere-benchmark — dataset benchmark with
--dataset fleurs|librispeech, --subset for LibriSpeech splits,
--languages for FLEURS codes, --auto-download, and
--checkpoint-every N (default 100) so long runs persist partial
results and survive mid-run crashes.

`ModelNames.swift`

New Repo.cohereTranscribeCoreml →
FluidInference/cohere-transcribe-03-2026-coreml/q8.
New ModelNames.CohereTranscribe enum with encoder,
decoderCacheExternal, vocab and the corresponding .mlmodelc paths.

Documentation

Documentation/ASR/Cohere.md — architecture, API, CLI, LibriSpeech +
FLEURS results, upstream config provenance (max_audio_clip_s,
overlap_chunk_second), comparison vs Cohere's Figure 4 reference
numbers, caveats.

FLEURS coverage

Extends FleursBenchmark.supportedLanguages with the 6 non-European
Cohere languages (pt_br, ar_eg, ja_jp, cmn_hans_cn, ko_kr,
vi_vn).

LibriSpeech test-clean (Apple M2 2022, Tahoe 26.0)

Full split, all 2,620 utterances, single-chunk.

Subset	Samples	WER	CER	RTFx (per-file mean)	RTFx (total audio/compute)
test-clean	2,620	1.77%	0.60%	2.04×	1.72×

5h 24m audio processed in 3h 09m compute (3h 12m wall time including
one-time ~6 min ANE cold-start compile). Competitive with Parakeet TDT
0.6B v3 (~1.7%) and Whisper large-v3 (~1.8%).

FLEURS results (full splits, single-chunk)

M4 Pro / Tahoe 26.0, 9,911 samples total.

FLEURS code	Language	Samples	WER	CER	RTFx
en_us	English	647	5.63%	3.19%	2.49×
fr_fr	French	676	6.22%	3.11%	2.21×
de_de	German	862	5.84%	2.83%	1.98×
es_419	Spanish (LATAM)	908	4.53%	2.40%	1.34×
it_it	Italian	865	4.03%	2.04%	3.15×
pt_br	Portuguese (BR)	919	6.44%	3.38%	2.79×
nl_nl	Dutch	364	8.07%	4.14%	2.04×
pl_pl	Polish	758	7.49%	3.23%	1.98×
el_gr	Greek	650	11.50%	5.45%	2.00×
ar_eg	Arabic (EG)	428	18.46%	6.71%	2.06×
ja_jp	Japanese	650	60.13%†	6.25%	2.23×
cmn_hans_cn	Mandarin	945	98.52%†	12.01%	1.85×
ko_kr	Korean	382	16.39%	6.67%	1.84×
vi_vn	Vietnamese	857	9.55%	6.87%	1.55×

†Japanese and Mandarin are written without word boundaries, so WER on the
raw hypothesis is a tokenization artifact — CER is the real accuracy
metric. Cohere's own Figure 4 uses CER for zh/ja/ko for the same reason.

Usage

let models = try await CoherePipeline.loadModels(
    encoderDir: q8Dir,
    decoderDir: q8Dir,
    vocabDir: q8Dir
)
let pipeline = CoherePipeline()
let result = try await pipeline.transcribe(
    audio: samples,        // 16 kHz mono Float32, up to 35 s
    models: models,
    language: .english
)

# Single file
swift run -c release fluidaudiocli cohere-transcribe audio.wav --language en

# LibriSpeech
swift run -c release fluidaudiocli cohere-benchmark \
    --dataset librispeech --subset test-clean \
    --model-dir /path/to/q8 --auto-download

# FLEURS
swift run -c release fluidaudiocli cohere-benchmark \
    --dataset fleurs --languages en_us,fr_fr --auto-download

HuggingFace

INT8 hybrid (shipped):
https://huggingface.co/FluidInference/cohere-transcribe-03-2026-coreml
(subdir q8/)
Upstream model: https://huggingface.co/CohereLabs/cohere-transcribe-03-2026

Notes

35 s single-chunk limit is baked into the upstream model
(max_audio_clip_s: 35 in cohere-pytorch/config.json). Upstream
Python also supports >35 s via 5 s-overlap chunking
(overlap_chunk_second: 5); this port does not implement that wrapper
yet and skips longer utterances with a warning.
Cache-external decoder stays FP16: INT8 decoder quantization
regresses quality significantly in testing and is not shipped.

Test plan

Library + CLI release build clean
Single-file transcription via `cohere-transcribe`
FLEURS en_us sanity (5.63% WER)
Full 14-language FLEURS benchmark (9,911 samples)
Full LibriSpeech test-clean benchmark (2,620 samples, WER 1.77%)
CJK CER validated (word-boundary-agnostic metric for ja/zh)
Checkpoint-every survives kill mid-run
`printFinalSummary` no longer aborts on macOS 26

Add Cohere Transcribe CoreML ASR implementation supporting 14 languages: - English, French, German, Spanish, Italian, Portuguese, Dutch, Polish - Greek, Arabic, Japanese, Chinese, Korean, Vietnamese Features: - Core ASR manager with stateful decoder - Mel spectrogram preprocessing compatible with Cohere models - CLI transcription command with language selection - Benchmark command supporting LibriSpeech and FLEURS datasets - INT8 quantized models for efficient inference Usage: swift run fluidaudiocli cohere-transcribe audio.wav --language ja_jp swift run fluidaudiocli cohere-benchmark --dataset fleurs --languages en_us,fr_fr swift run fluidaudiocli download --dataset fleurs Models: FluidInference/cohere-transcribe-03-2026-coreml

Add HuggingFace integration for Cohere Transcribe CoreML models with INT8 quantization support. Changes: - Add CohereTranscribe model names enum with encoder, decoder, and vocab - Add Cohere repository definitions (FP16 and INT8 variants) - Update CohereAsrModels to use stateful decoder from HuggingFace - Support automatic download from FluidInference/cohere-transcribe-03-2026-coreml Model details: - 35-second window architecture (3500 frames → 438 encoder outputs) - INT8 W8A16 quantization (~2.0 GB vs ~4.2 GB FP16) - 14-language support with token primer system - Quality: 16.44% WER on LibriSpeech test-clean (INT8)

github-actions · 2026-04-06T21:40:07Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	644.1x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	661.4x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2026-04-06T21:40:19Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	10.4%	<20%	✅	Diarization Error Rate (lower is better)
RTFx	11.83x	>1.0x	✅	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	12.972	14.6	Fetching diarization models
Model Compile	5.559	6.3	CoreML compilation
Audio Load	0.039	0.0	Loading audio file
Segmentation	23.306	26.3	VAD + speech detection
Embedding	88.464	99.7	Speaker embedding extraction
Clustering (VBx)	0.117	0.1	Hungarian algorithm + VBx clustering
Total	88.737	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	10.4%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 111.9s processing • Test runtime: 1m 57s • 04/23/2026, 11:02 AM EST}

github-actions · 2026-04-06T21:44:16Z

Qwen3-ASR int8 Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Transcription pipeline	✅
Decoder size	571 MB (vs 1.1 GB f32)

Performance Metrics

Metric	CI Value	Expected on Apple Silicon
Median RTFx	0.06x	~2.5x
Overall RTFx	0.06x	~2.5x

_{Runtime: 3m56s}

_{Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.}

github-actions · 2026-04-06T21:44:45Z

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric	Value	Description
WER (Avg)	7.03%	Average Word Error Rate
WER (Med)	4.17%	Median Word Error Rate
RTFx	7.77x	Real-time factor (higher = faster)
Total Audio	470.6s	Total audio duration processed
Total Time	69.7s	Total processing time

Streaming Metrics

Metric	Value	Description
Avg Chunk Time	0.070s	Average chunk processing time
Max Chunk Time	0.139s	Maximum chunk processing time
EOU Detections	0	Total End-of-Utterance detections

_{Test runtime: 1m21s • 04/23/2026, 11:02 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O}

github-actions · 2026-04-06T21:44:46Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	5.01x	✅
test-other	1.35%	0.00%	3.40x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.80%	0.00%	4.14x	✅
test-other	1.22%	0.00%	2.41x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.48x	Streaming real-time factor
Avg Chunk Time	2.106s	Average time to process each chunk
Max Chunk Time	4.287s	Maximum chunk processing time
First Token	2.495s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.46x	Streaming real-time factor
Avg Chunk Time	1.913s	Average time to process each chunk
Max Chunk Time	2.733s	Maximum chunk processing time
First Token	2.145s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 8m1s • 04/23/2026, 11:11 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

github-actions · 2026-04-06T21:46:34Z

PocketTTS Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Synthesis pipeline	✅
Output WAV	✅ (202.5 KB)

_{Runtime: 0m41s}

_{Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.}

github-actions · 2026-04-06T21:53:08Z

Kokoro TTS Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Synthesis pipeline	✅
Output WAV	✅ (634.8 KB)

_{Runtime: 0m51s}

_{Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon.}

github-actions · 2026-04-06T22:00:30Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	20.11x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	9.273	17.8	Fetching diarization models
Model Compile	3.974	7.6	CoreML compilation
Audio Load	0.077	0.1	Loading audio file
Segmentation	15.641	30.0	Detecting speech regions
Embedding	26.068	50.0	Extracting speaker voices
Clustering	10.427	20.0	Grouping same speakers
Total	52.181	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 52.1s diarization time • Test runtime: 2m 10s • 04/23/2026, 11:09 AM EST}

github-actions · 2026-04-06T22:00:57Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	33.4%	<35%	✅
Miss Rate	24.4%	-	-
False Alarm	0.2%	-	-
Speaker Error	8.8%	-	-
RTFx	4.5x	>1.0x	✅
Speakers	4/4	-	-

_{Sortformer High-Latency • ES2004a • Runtime: 5m 48s • 2026-04-23T15:18:45.259Z}

Fixes 4 critical issues identified in PR #487 review: 1. **KV cache buffer overflow** (CohereAsrManager.swift:197): - Bound decode loop with min(maxNewTokens, maxSeqLen=108) - Prevents out-of-bounds cache access when step >= 108 2. **Unsafe pointer rebound** (CohereMelSpectrogram.swift:174-178): - Move vDSP_ctoz call inside withMemoryRebound closure - Fixes undefined behavior from escaped pointer 3. **Division by zero** (CohereBenchmark.swift:229, 393-394): - Add empty array checks before computing averages - Prevents NaN when all transcriptions fail 4. **Missing unit tests**: - Add CohereAsrConfigTests (config validation, special tokens, languages) - Add CohereMelSpectrogramTests (mel computation, padding, edge cases) - Add CohereTokenConversionTests (token-to-text, special token filtering) All fixes follow project coding standards and ensure memory safety.

Implements the Parakeet pattern for cache-external decoding of Cohere Transcribe models. Cache is managed in Swift and passed to/from CoreML as inputs/outputs each step. Key features: - CohereDecoderState: Manages 16 KV cache arrays (8 layers × 2) - CohereModelInference: Executes decoder with cache-external pattern - CohereStatelessManager: Stateless O(n²) decoder (simpler alternative) - Correct EOS token (3, not 151643) verified from model config Implementation: - Cache-external achieves O(n) complexity with 11.95% WER - Growing attention mask: [1,1,1,1] → [1,1,1,108] - Compatible with .mlmodelc compiled models for faster loading - Tested and verified in mobius (see commit 5d12a80) Files: - CohereDecoderState.swift - Cache state management - CohereModelInference.swift - Decoder execution - CohereStatelessManager.swift - Stateless alternative (EOS fixed)

… Cohere ASR Three fixes for Cohere ASR compatibility: 1. **Mel padding**: 3001 → 3500 frames to match encoder input shape - CohereAsrManager.swift: All 3001 references changed to 3500 - CohereStatelessManager.swift: All 3001 references changed to 3500 2. **Encoder output name**: encoder_outputs → hidden_states - Matches the actual encoder model export (see mobius export scripts) 3. **Explicit self capture**: maxSeqLen in closure - CohereStatelessManager.swift: Added explicit self.maxSeqLen These align with the encoder/decoder models exported in mobius. Note: Full WER benchmark requires matching decoder models. The current auto-downloaded stateful decoder has a different interface than the cache-external decoder implemented in CohereDecoderState/CohereModelInference.

After extensive testing with FLEURS multilingual dataset, the Cohere Transcribe cache-external decoder only works reliably for Spanish (18-24% WER). Other languages hallucinate with >50% WER, producing Arabic/Polish/wrong-language output. ## Test Results (10 samples per language) - Spanish: 18.6% WER ✅ Production ready - English: 57.5% WER ❌ Hallucinating - French: 88.0% WER ❌ Hallucinating - Chinese: 113.5% WER ❌ Hallucinating ## Attempted Fixes (All Failed) 1. Language token prompts (10-token sequence) - Made it worse (142% WER) 2. Language embeddings in decoder V2 - No improvement (57.5% WER) 3. Multilingual encoder (traced with 4 languages) - No improvement ## Root Cause The encoder outputs language-agnostic hidden states that don't preserve which language was spoken. The decoder's language conditioning cannot override the encoder's lost language information. This is a fundamental issue with the CoreML export process. ## Changes - Add warning in CohereAsrManager.transcribe() for non-Spanish languages - Document limitation in CohereAsrConfig, CohereAsrModels docstrings - Add language parameter support (full prompt sequence implementation) - Update FLEURS benchmark to support language parameter ## Recommendation For multilingual ASR, use Whisper or Qwen3 models instead. Cache-external decoder should only be deployed for Spanish-language transcription. Related investigation files (in mobius/): - CACHE_EXTERNAL_ANALYSIS.md - Python vs Swift comparison - MULTILINGUAL_INVESTIGATION_FINAL.md - Comprehensive test results

Added language enum and configuration to support multilingual ASR testing. After extensive investigation (see mobius/models/stt/cohere-transcribe-03-2026/coreml/RESEARCH_REPORT.md), confirmed that cache-external decoder only works reliably for Spanish. Changes: - CohereAsrConfig: Added Language enum with 14 languages and token IDs - CohereAsrConfig: Added promptSequence() method for language-specific prompts - CohereAsrManager: Added language parameter to transcribe() - CohereAsrManager: Added warning logs for non-Spanish languages - CohereAsrModels: Added DecoderType detection (stateful vs cache-external) Language support tested on FLEURS dataset (40 samples): - Spanish: 18.6% WER ✅ (production ready) - English: 57.5% WER ❌ (hallucinating) - French: 88.0% WER ❌ (hallucinating) - Chinese: 113.5% WER ❌ (hallucinating) Recommendation: Deploy for Spanish-only. For multilingual, use Whisper or Qwen3. See research report in mobius repo for full investigation details.

@available

- Add CohereFixedPipeline: self-contained INT8-encoder + FP16-decoder pipeline with fp16-safe cross-attention mask (vImage), repetition penalty, no-repeat-ngram, and SentencePiece byte-fallback detok. - Add cohere-mixed CLI command to exercise the mixed pipeline on a single audio file with per-language config. - Add cohere-mixed-benchmark CLI command: 14-language FLEURS benchmark with per-language WER/CER/RTFx, JSON output, and --auto-download. - Fix CohereAsrManager macOS SDK 26.4 compatibility: use Swift-refined makeState() (newState is NS_REFINED_FOR_SWIFT / macOS-unavailable) and gate decodeStateful with @available(macOS 15, iOS 18, *) so transcribe() remains usable on macOS 14 / iOS 17. Verified end-to-end on english_original.wav and multilingual FLEURS samples (en_us, fr_fr, cmn_hans_cn) all decode correctly.

Extends FLEURSBenchmark.supportedLanguages with the 6 non-European languages required to cover the 14-language Cohere Transcribe matrix (pt_br, ar_eg, ja_jp, cmn_hans_cn, ko_kr, vi_vn). The 8 European languages Cohere supports were already in the map. Adds two standalone scripts under Scripts/ for running the hybrid INT8-encoder + FP16-decoder benchmark one language at a time: - run_cohere_per_lang.sh resumable per-language runner (each language writes its own JSON so the run survives interruption / cleanup segfaults that happen after results are persisted) - fetch_fleurs_from_google.py adapter that pulls the 5 languages not yet in FluidInference's FLEURS mirror (ar_eg, ja_jp, cmn_hans_cn, ko_kr, vi_vn) from google/fleurs on HuggingFace and materialises them in the cache layout expected by the CLI. Whitelists both new scripts in .gitignore alongside the existing parakeet/diarizer benchmark helpers.

Empty MAX_ARGS=() array expanded to "${MAX_ARGS[@]}" triggered an unbound-variable error under set -u on some bash/zsh versions, which broke the uncapped full-splits run. Use the defensive ${MAX_ARGS[@]+"${MAX_ARGS[@]}"} expansion so the runner works both with and without MAX_FILES set.

…ults Full FLEURS benchmark numbers for the INT8 encoder + FP16 cache-external decoder hybrid across all 14 supported languages (9,911 samples total), measured via the per-language runner on M4 Pro, Tahoe 26.0. Also documents: - CohereFixedPipeline Swift API (load + transcribe) - cohere-mixed / cohere-mixed-benchmark CLI surface - Approximate comparison vs Cohere's Figure 4 reference (with the caveat that Cohere's numbers are averaged across FLEURS + Common Voice 17.0 + MLS + Wenet, not FLEURS-only) - Why Japanese/Mandarin WER is meaningless (no word boundaries) and CER should be read instead - Single-chunk (35 s) and language-hint requirements

- Add Δ column and per-language sample context - Explain the two gap sources (dataset mix vs INT8 quantization) - Flag the ja CER win and ko CER outlier explicitly

- Guard CohereMelSpectrogram.compute against audio shorter than nFFT/2+1 (prevents OOB crash in reflectionPad) - Fix CohereAsrModels.modelsExist to check the cache-external decoder that load() actually consumes, and accept either .mlmodelc or .mlpackage so the local HF cache isn't re-downloaded on every run - Correct CohereAsrConfig.maxAudioSeconds (30s -> 35s) and maxSamples (480k -> 560k) to match the [1,128,3500] encoder input - Switch the four Cohere library loggers to AppLogger per CLAUDE.md - Update tests to match new fail-safe short-audio semantics - Fix a pre-existing Double/Float type error in testComputeWithSineWaveProducesNonZeroMel

- CohereMelSpectrogram: split DC and Nyquist out of vDSP packed-format bin 0 so the last bin holds the correct Nyquist power. - CohereBenchmark: scale WER/CER from fractions to percentages to match the rest of the CLI (output was displaying "WER: 0.06%" instead of "WER: 5.63%"). - CohereTranscribeCommand: parse --language/-l so users can actually transcribe non-English audio; plumb the value through to manager.transcribe() and document it in the help text. - CohereFixedPipeline: during the prompt-feeding phase, record the actually-consumed prompt token in the repetition-penalty history instead of the model's discarded prediction, so noRepeatNgram no longer suppresses valid output tokens based on phantom predictions. - FluidAudioCLI: list cohere-mixed / cohere-mixed-benchmark in the command help and drop the bogus ja_jp example (raw values are 2-letter codes).

Tier A — fully dead code (zero callers anywhere in the tree): - CohereStatelessManager.swift - CohereModelInference.swift - CohereDecoderState.swift (only referenced by CohereModelInference) Tier B — "original buggy" pipeline that CohereFixedPipeline was written to replace (per its own header comment), kept in parallel until now: - CohereAsrManager.swift - CohereAsrModels.swift (only consumer was CohereAsrManager + its CLI) - CohereMelSpectrogram.swift (CohereFixedPipeline has its own internal CohereFixedMelSpectrogram; only the buggy manager + dead stateless manager + tests used this one) - CohereTranscribeCommand.swift / CohereBenchmark.swift (only entries to the buggy manager) - CohereMelSpectrogramTests.swift / CohereTokenConversionTests.swift Lifted CohereAsrError into CohereFixedPipeline.swift (the lone surviving consumer). Updated ModelNames.CohereTranscribe.requiredModels to point at the cache-external compiled artifacts that actually ship; removed the dangling decoderFile alias. Trimmed cohere-transcribe and cohere-benchmark from the CLI dispatch and help text — cohere-mixed and cohere-mixed-benchmark (which run the canonical CohereFixedPipeline, the source of the published FLEURS numbers) remain. Net: -2,584 deleted, ~28 added. swift build clean.

…ipelines are gone The 'Fixed' in CohereFixedPipeline only made sense as a contrast with the buggy CohereAsrManager (deleted in the previous commit), and 'Mixed' in the CLI commands referred to mixed-precision contrasting with the single-precision FP16 path (also gone). With the parallel pipelines removed, the qualifiers are noise. Renames: - CohereFixedPipeline -> CoherePipeline - CohereFixedMelSpectrogram -> CohereMelSpectrogram (the previous owner of that name was deleted, so it is free) - CohereMixedCommand -> CohereTranscribeCommand - CohereMixedBenchmark -> CohereBenchmark - CohereMixedBenchmarkResult -> CohereBenchmarkResult - cohere-mixed -> cohere-transcribe (CLI command) - cohere-mixed-benchmark -> cohere-benchmark (CLI command) - logger category 'CohereFixedPipeline' -> 'CoherePipeline' Scripts/run_cohere_per_lang.sh and Documentation/ASR/Cohere.md updated to use the new command names. swift build clean.

These two scripts were PR-local helpers, not generally-useful FluidAudio benchmark tooling: - Scripts/run_cohere_per_lang.sh wraps the cohere-benchmark CLI in a per-language loop. Anyone reproducing the FLEURS table can invoke cohere-benchmark directly per the docs. - Scripts/fetch_fleurs_from_google.py mirrors a 5-language slice of the google/fleurs dataset; the cohere-benchmark --auto-download flag already pulls the FluidInference FLEURS subset. Also drops the two new !Scripts/ exceptions added by this PR and the dangling docs reference to run_cohere_per_lang.sh.

The CoherePipeline integration only ever exposes the INT8-encoder + FP16 cache-external-decoder hybrid. Carrying a separate cohereTranscribeCoreml (f16) Repo case alongside cohereTranscribeCoremlInt8 was dead surface: nothing in Sources/ or Tests/ references either case explicitly. - Collapse the two enum cases into a single .cohereTranscribeCoreml pointing at FluidInference/cohere-transcribe-03-2026-coreml/q8. - Drop the unused decoderStateful / encoderFile / decoderCacheExternalFile (.mlpackage) entries from ModelNames.CohereTranscribe — the stateful decoder pipeline was already removed in 65487ec, and the runtime loader only consumes .mlmodelc compiled artifacts.

1. CohereAsrConfig.MelSpec.nFFT was 1024 but the actual FFT used by CohereMelSpectrogram is nextPowerOfTwo(winLength=400) = 512 (CoherePipeline.swift:88). The header comment at CoherePipeline.swift:6 already states n_fft=512. Anyone using the public constant for buffer sizing or frequency-bin math would get wrong results. 2. Decoder loop was missing the first real output token from the penalty-history buffer. At step == prompt.count - 1, the previous conditional appended currentToken (last prompt token) and then rotated nextToken (the first output token) into currentToken; on the next iteration it appended nextToken (the SECOND output) instead — so the first output never appeared in allTokens. applyRepetitionPenalty and applyNoRepeatNgram could not penalise repeats of the first output or detect n-grams beginning with it. Replace the conditional with the unified `allTokens.append(currentToken)` so we always record what was actually consumed at this step. The first output is then recorded on the iteration after it is generated, once it has rotated into currentToken. Also update the test that asserted the wrong nFFT value.

Cohere transcribe benchmark previously only ran on FLEURS. Add a `--dataset librispeech|fleurs` switch (default: fleurs) and a `--subset` flag for LibriSpeech (default: test-clean). LibriSpeech path reuses Parakeet's `ASRBenchmark.downloadLibriSpeech` + `getLibriSpeechDirectory()` for cache layout, walks the `*.trans.txt` files under the subset directory, and routes through the same per-file inference loop as FLEURS (now extracted into a shared `transcribeFiles` helper). Cohere is single-chunk (35s max) so files exceeding the limit are skipped with a warning rather than silently failing. Renamed the default output JSON to `cohere_benchmark_results.json` and updated `printUsage` + the summary header now that this is no longer FLEURS-only.

`String(format: "%-14s ...", swiftString)` is fatal on macOS 26: Swift's String maps to %@, the format specifier says %s (a C string), and the Foundation runtime now aborts on the mismatch. The benchmark would write its JSON output successfully and then crash in the summary print right before exit (SIGABRT, exit 139), making the run look failed even though results were good. Replace the format-string column layout with a small `row(...)` helper plus a `String.leftPad(to:)` extension so column widths and decimal formatting stay readable without going through `%s`.

devin-ai-integration

Devin Review found 2 new potential issues.

View 16 additional findings in Devin Review.

devin-ai-integration · 2026-04-23T04:11:56Z

+final class CohereAsrConfigTests: XCTestCase {
+
+    // MARK: - Config Constants
+
+    func testSampleRateIs16kHz() {
+        XCTAssertEqual(CohereAsrConfig.sampleRate, 16000)
+    }
+
+    func testMaxAudioDurationIs35Seconds() {
+        // Matches the encoder mel input [1, 128, 3500] (3500 * 160 / 16000 = 35s).
+        XCTAssertEqual(CohereAsrConfig.maxAudioSeconds, 35.0)
+    }
+
+    func testMaxSamplesMatchesDurationAndSampleRate() {
+        let expectedSamples = Int(CohereAsrConfig.maxAudioSeconds * Float(CohereAsrConfig.sampleRate))
+        XCTAssertEqual(CohereAsrConfig.maxSamples, expectedSamples)
+        XCTAssertEqual(CohereAsrConfig.maxSamples, 560_000)
+    }
+
+    func testVocabSizeIs16384() {
+        XCTAssertEqual(CohereAsrConfig.vocabSize, 16_384)
+    }
+
+    func testMaxSeqLenIs108() {
+        // KV cache capacity
+        XCTAssertEqual(CohereAsrConfig.maxSeqLen, 108)
+    }
+
+    func testHeadDimMatchesDecoderDimension() {
+        let expectedHeadDim = CohereAsrConfig.decoderHiddenSize / CohereAsrConfig.numDecoderHeads
+        XCTAssertEqual(CohereAsrConfig.headDim, expectedHeadDim)
+        XCTAssertEqual(CohereAsrConfig.headDim, 128)
+    }
+
+    // MARK: - Special Tokens
+
+    func testSpecialTokenIdsAreInRange() {
+        let vocabSize = CohereAsrConfig.vocabSize
+        let tokenIds = [
+            CohereAsrConfig.SpecialTokens.unkToken,
+            CohereAsrConfig.SpecialTokens.noSpeechToken,
+            CohereAsrConfig.SpecialTokens.padToken,
+            CohereAsrConfig.SpecialTokens.eosToken,
+            CohereAsrConfig.SpecialTokens.startToken,
+        ]
+
+        for tokenId in tokenIds {
+            XCTAssertGreaterThanOrEqual(tokenId, 0, "Token ID \(tokenId) should be non-negative")
+            XCTAssertLessThan(tokenId, vocabSize, "Token ID \(tokenId) should be < vocabSize (\(vocabSize))")
+        }
+    }
+
+    func testSpecialTokensAreUnique() {
+        let tokens = Set([
+            CohereAsrConfig.SpecialTokens.unkToken,
+            CohereAsrConfig.SpecialTokens.noSpeechToken,
+            CohereAsrConfig.SpecialTokens.padToken,
+            CohereAsrConfig.SpecialTokens.eosToken,
+            CohereAsrConfig.SpecialTokens.startToken,
+        ])
+        XCTAssertEqual(tokens.count, 5, "Special tokens should be unique")
+    }
+
+    func testEosTokenId() {
+        XCTAssertEqual(CohereAsrConfig.SpecialTokens.eosToken, 3)
+    }
+
+    func testStartTokenId() {
+        XCTAssertEqual(CohereAsrConfig.SpecialTokens.startToken, 4)
+    }
+
+    // MARK: - Mel Spectrogram Parameters
+
+    func testMelSpecParametersAreValid() {
+        XCTAssertEqual(CohereAsrConfig.MelSpec.nFFT, 512)
+        XCTAssertEqual(CohereAsrConfig.MelSpec.hopLength, 160)
+        XCTAssertEqual(CohereAsrConfig.MelSpec.nMels, 128)
+        XCTAssertEqual(CohereAsrConfig.numMelBins, 128)
+    }
+
+    func testMelSpecFrequencyRange() {
+        XCTAssertEqual(CohereAsrConfig.MelSpec.fMin, 0.0)
+        XCTAssertEqual(CohereAsrConfig.MelSpec.fMax, 8000.0)
+        XCTAssertLessThanOrEqual(
+            CohereAsrConfig.MelSpec.fMax,
+            Float(CohereAsrConfig.sampleRate) / 2.0,
+            "fMax should not exceed Nyquist frequency"
+        )
+    }
+
+    func testPreemphasisIsValid() {
+        XCTAssertGreaterThan(CohereAsrConfig.MelSpec.preemphasis, 0.0)
+        XCTAssertLessThanOrEqual(CohereAsrConfig.MelSpec.preemphasis, 1.0)
+    }
+
+    func testNFFTIsPowerOfTwo() {
+        let nFFT = CohereAsrConfig.MelSpec.nFFT
+        XCTAssertTrue(nFFT > 0 && (nFFT & (nFFT - 1)) == 0, "nFFT should be a power of 2")
+    }
+
+    // MARK: - Language
+
+    func testLanguageRawValuesAreIsoCodes() {
+        XCTAssertEqual(CohereAsrConfig.Language.english.rawValue, "en")
+        XCTAssertEqual(CohereAsrConfig.Language.french.rawValue, "fr")
+        XCTAssertEqual(CohereAsrConfig.Language.german.rawValue, "de")
+        XCTAssertEqual(CohereAsrConfig.Language.spanish.rawValue, "es")
+        XCTAssertEqual(CohereAsrConfig.Language.italian.rawValue, "it")
+        XCTAssertEqual(CohereAsrConfig.Language.portuguese.rawValue, "pt")
+        XCTAssertEqual(CohereAsrConfig.Language.dutch.rawValue, "nl")
+        XCTAssertEqual(CohereAsrConfig.Language.polish.rawValue, "pl")
+        XCTAssertEqual(CohereAsrConfig.Language.greek.rawValue, "el")
+        XCTAssertEqual(CohereAsrConfig.Language.arabic.rawValue, "ar")
+        XCTAssertEqual(CohereAsrConfig.Language.japanese.rawValue, "ja")
+        XCTAssertEqual(CohereAsrConfig.Language.chinese.rawValue, "zh")
+        XCTAssertEqual(CohereAsrConfig.Language.vietnamese.rawValue, "vi")
+        XCTAssertEqual(CohereAsrConfig.Language.korean.rawValue, "ko")
+    }
+
+    func testAllLanguagesHaveEnglishNames() {
+        for language in CohereAsrConfig.Language.allCases {
+            XCTAssertFalse(language.englishName.isEmpty, "\(language) should have a non-empty English name")
+        }
+    }
+
+    func testLanguageCount() {
+        XCTAssertEqual(CohereAsrConfig.Language.allCases.count, 14, "Cohere supports 14 languages")
+    }
+
+    func testEnglishNameExamples() {
+        XCTAssertEqual(CohereAsrConfig.Language.english.englishName, "English")
+        XCTAssertEqual(CohereAsrConfig.Language.french.englishName, "French")
+        XCTAssertEqual(CohereAsrConfig.Language.japanese.englishName, "Japanese")
+    }
+
+    // MARK: - Model Architecture
+
+    func testEncoderParameters() {
+        XCTAssertEqual(CohereAsrConfig.encoderHiddenSize, 1280)
+        XCTAssertEqual(CohereAsrConfig.numEncoderLayers, 48)
+    }
+
+    func testDecoderParameters() {
+        XCTAssertEqual(CohereAsrConfig.decoderHiddenSize, 1024)
+        XCTAssertEqual(CohereAsrConfig.numDecoderLayers, 8)
+        XCTAssertEqual(CohereAsrConfig.numDecoderHeads, 8)
+    }
+}


🟡 Missing unit tests for CoherePipeline and CohereMelSpectrogram utility functions

AGENTS.md mandates "Add unit tests when writing new code." The PR adds ~800 lines of pipeline logic in CoherePipeline.swift containing many testable pure functions (applyRepetitionPenalty, applyNoRepeatNgram, argmax, convertTokensToText, parseByteFallback, encoderValidFrames, buildCrossAttentionMask, copyLogitsFloat32, zeroFill) and CohereMelSpectrogram (compute, validFrameCount, padOrTruncate, slaneyMelFilter), but only provides config constant tests in CohereAsrConfigTests.swift. Other ASR modules in the repo test their utility functions (e.g., AudioMelSpectrogramTests, TdtDecoderTests, Qwen3RoPETests).

Prompt for agents

AGENTS.md requires unit tests for new code. The PR adds CoherePipeline with many testable pure static functions but only tests CohereAsrConfig constants. Add a new test file Tests/FluidAudioTests/ASR/Cohere/CoherePipelineTests.swift with tests for at minimum: applyRepetitionPenalty (penalty > 1 reduces positive logits, amplifies negative), applyNoRepeatNgram (forbids completing seen n-grams), argmax (returns index of max value), convertTokensToText (handles byte-fallback tokens, special token filtering, Unicode replacement character), parseByteFallback (parses <0xHH> patterns, rejects invalid), encoderValidFrames (ceiling division, clamping), and padOrTruncate (truncation, padding, passthrough). These are all static/class methods on CoherePipeline and CohereMelSpectrogram that can be tested without loading CoreML models.

Was this helpful? React with 👍 or 👎 to provide feedback.

Long LibriSpeech runs (2620 files in test-clean) take 10+ hours wall time. A crash near the end loses everything because saveResults only ran on success. Add --checkpoint-every <n> (default 100) so every N successful transcriptions persist the running results array to the output JSON. On crash the user keeps the last multiple-of-N results.

The 35-second per-call limit traces directly to the upstream cohere-pytorch config (max_audio_clip_s: 35, 100 fps × 35 = 3500 mel frames). Document the provenance plus the upstream's overlap_chunk_second: 5 chunking strategy that we don't yet wrap in Swift, so users understand why long-form audio is skipped instead of stitched.

….72x) Full 2620-utterance run on M4 Pro: WER 1.77%, CER 0.60%, total RTFx 1.72x. Competitive with Parakeet TDT 0.6B v3 (~1.7%) and Whisper large-v3 (~1.8%) while running ~1.7x faster than real time.

- Move the duplicated `generateInlineDiff` edit-distance diff renderer (byte-identical copies in AsrBenchmark.swift and FleursBenchmark.swift, ~110 lines each) to a shared `InlineDiff.generate(reference:hypothesis:)` utility in `FluidAudioCLI/Utils/InlineDiff.swift`. - Fix Devin Review finding in CohereBenchmark.swift line 263: when `--max-files` was omitted for FLEURS auto-download, `samplesPerLanguage: maxFiles ?? 100` silently capped each language at 100 samples. Switch to `Int.max`, which is FLEURSBenchmark's documented sentinel for "download all available".

- CohereTranscribeCommand.swift:162 — guard RTFx division against zero-duration totalSeconds with `max(_, 1e-9)`, matching the convention already used in CohereBenchmark.swift:463. - CoherePipeline.swift:596-599 — flatten nested if in decoder loop to comply with the AGENTS.md "avoid nested ifs" rule. Both are no-ops on the correct-input happy path.

Alex-Wengg mentioned this pull request Apr 6, 2026

feat(asr): Add Cohere Transcribe INT8 model support #486

Closed

4 tasks

Alex-Wengg added 2 commits April 6, 2026 17:39

Alex-Wengg force-pushed the feat/cohere-transcribe-int8-integration branch from 00d3e72 to 4eb8c0e Compare April 6, 2026 21:40

Alex-Wengg mentioned this pull request Apr 6, 2026

feat(cohere-transcribe): CoreML export + host-side pipeline fix + q8 hybrid FluidInference/mobius#41

Open

9 tasks