Skip to content

feat(asr): add Cohere Transcribe (INT8 encoder + FP16 cache-external decoder)#487

Merged
Alex-Wengg merged 31 commits intomainfrom
feat/cohere-transcribe-int8-integration
Apr 23, 2026
Merged

feat(asr): add Cohere Transcribe (INT8 encoder + FP16 cache-external decoder)#487
Alex-Wengg merged 31 commits intomainfrom
feat/cohere-transcribe-int8-integration

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Apr 6, 2026

Summary

Adds Cohere Transcribe ASR for 14 languages, shipped as an INT8 encoder

  • FP16 cache-external decoder hybrid (CoherePipeline). One CLI for
    single-file transcription, one CLI for dataset benchmarking (FLEURS and
    LibriSpeech).

Languages

English, French, German, Spanish, Italian, Portuguese, Dutch, Polish,
Greek, Arabic, Japanese, Chinese (Simplified), Korean, Vietnamese.

What's added

Library (Sources/FluidAudio/ASR/Cohere/)

  • CoherePipeline — encoder + cache-external decoder runner. Allocates
    the K/V cache host-side (no CoreML State API; iOS 17+), applies the
    additive cross-attention mask, and detokenizes via SentencePiece byte
    fallback so CJK comes out as real characters. Accepts separate
    encoderDir / decoderDir to support the q8/f16 split.
  • CohereAsrConfig — per-language prompt sequences and token IDs;
    shared 35 s / 3500-frame audio window and 108-token decoder cache window
    constants. The 35 s cap traces directly to upstream max_audio_clip_s: 35.
  • CohereMelSpectrogram — 128-mel front-end matching the reference
    model (preemph, Slaney mel, CMVN).

CLI (Sources/FluidAudioCLI/Commands/ASR/Cohere/)

  • fluidaudiocli cohere-transcribe <audio> --language <lang> — single-file
    transcription. Accepts either --model-dir (single dir with both
    encoder and decoder) or --encoder-dir + --decoder-dir for the q8/f16
    split.
  • fluidaudiocli cohere-benchmark — dataset benchmark with
    --dataset fleurs|librispeech, --subset for LibriSpeech splits,
    --languages for FLEURS codes, --auto-download, and
    --checkpoint-every N (default 100) so long runs persist partial
    results and survive mid-run crashes.

ModelNames.swift

  • New Repo.cohereTranscribeCoreml
    FluidInference/cohere-transcribe-03-2026-coreml/q8.
  • New ModelNames.CohereTranscribe enum with encoder,
    decoderCacheExternal, vocab and the corresponding .mlmodelc paths.

Documentation

  • Documentation/ASR/Cohere.md — architecture, API, CLI, LibriSpeech +
    FLEURS results, upstream config provenance (max_audio_clip_s,
    overlap_chunk_second), comparison vs Cohere's Figure 4 reference
    numbers, caveats.

FLEURS coverage

  • Extends FleursBenchmark.supportedLanguages with the 6 non-European
    Cohere languages (pt_br, ar_eg, ja_jp, cmn_hans_cn, ko_kr,
    vi_vn).

LibriSpeech test-clean (Apple M2 2022, Tahoe 26.0)

Full split, all 2,620 utterances, single-chunk.

Subset Samples WER CER RTFx (per-file mean) RTFx (total audio/compute)
test-clean 2,620 1.77% 0.60% 2.04× 1.72×

5h 24m audio processed in 3h 09m compute (3h 12m wall time including
one-time ~6 min ANE cold-start compile). Competitive with Parakeet TDT
0.6B v3 (~1.7%) and Whisper large-v3 (~1.8%).

FLEURS results (full splits, single-chunk)

M4 Pro / Tahoe 26.0, 9,911 samples total.

FLEURS code Language Samples WER CER RTFx
en_us English 647 5.63% 3.19% 2.49×
fr_fr French 676 6.22% 3.11% 2.21×
de_de German 862 5.84% 2.83% 1.98×
es_419 Spanish (LATAM) 908 4.53% 2.40% 1.34×
it_it Italian 865 4.03% 2.04% 3.15×
pt_br Portuguese (BR) 919 6.44% 3.38% 2.79×
nl_nl Dutch 364 8.07% 4.14% 2.04×
pl_pl Polish 758 7.49% 3.23% 1.98×
el_gr Greek 650 11.50% 5.45% 2.00×
ar_eg Arabic (EG) 428 18.46% 6.71% 2.06×
ja_jp Japanese 650 60.13%† 6.25% 2.23×
cmn_hans_cn Mandarin 945 98.52%† 12.01% 1.85×
ko_kr Korean 382 16.39% 6.67% 1.84×
vi_vn Vietnamese 857 9.55% 6.87% 1.55×

†Japanese and Mandarin are written without word boundaries, so WER on the
raw hypothesis is a tokenization artifact — CER is the real accuracy
metric
. Cohere's own Figure 4 uses CER for zh/ja/ko for the same reason.

Usage

let models = try await CoherePipeline.loadModels(
    encoderDir: q8Dir,
    decoderDir: q8Dir,
    vocabDir: q8Dir
)
let pipeline = CoherePipeline()
let result = try await pipeline.transcribe(
    audio: samples,        // 16 kHz mono Float32, up to 35 s
    models: models,
    language: .english
)
# Single file
swift run -c release fluidaudiocli cohere-transcribe audio.wav --language en

# LibriSpeech
swift run -c release fluidaudiocli cohere-benchmark \
    --dataset librispeech --subset test-clean \
    --model-dir /path/to/q8 --auto-download

# FLEURS
swift run -c release fluidaudiocli cohere-benchmark \
    --dataset fleurs --languages en_us,fr_fr --auto-download

HuggingFace

Notes

  • 35 s single-chunk limit is baked into the upstream model
    (max_audio_clip_s: 35 in cohere-pytorch/config.json). Upstream
    Python also supports >35 s via 5 s-overlap chunking
    (overlap_chunk_second: 5); this port does not implement that wrapper
    yet and skips longer utterances with a warning.
  • Cache-external decoder stays FP16: INT8 decoder quantization
    regresses quality significantly in testing and is not shipped.

Test plan

  • Library + CLI release build clean
  • Single-file transcription via `cohere-transcribe`
  • FLEURS en_us sanity (5.63% WER)
  • Full 14-language FLEURS benchmark (9,911 samples)
  • Full LibriSpeech test-clean benchmark (2,620 samples, WER 1.77%)
  • CJK CER validated (word-boundary-agnostic metric for ja/zh)
  • Checkpoint-every survives kill mid-run
  • `printFinalSummary` no longer aborts on macOS 26

Add Cohere Transcribe CoreML ASR implementation supporting 14 languages:
- English, French, German, Spanish, Italian, Portuguese, Dutch, Polish
- Greek, Arabic, Japanese, Chinese, Korean, Vietnamese

Features:
- Core ASR manager with stateful decoder
- Mel spectrogram preprocessing compatible with Cohere models
- CLI transcription command with language selection
- Benchmark command supporting LibriSpeech and FLEURS datasets
- INT8 quantized models for efficient inference

Usage:
  swift run fluidaudiocli cohere-transcribe audio.wav --language ja_jp
  swift run fluidaudiocli cohere-benchmark --dataset fleurs --languages en_us,fr_fr
  swift run fluidaudiocli download --dataset fleurs

Models: FluidInference/cohere-transcribe-03-2026-coreml
Add HuggingFace integration for Cohere Transcribe CoreML models with INT8 quantization support.

Changes:
- Add CohereTranscribe model names enum with encoder, decoder, and vocab
- Add Cohere repository definitions (FP16 and INT8 variants)
- Update CohereAsrModels to use stateful decoder from HuggingFace
- Support automatic download from FluidInference/cohere-transcribe-03-2026-coreml

Model details:
- 35-second window architecture (3500 frames → 438 encoder outputs)
- INT8 W8A16 quantization (~2.0 GB vs ~4.2 GB FP16)
- 14-language support with token primer system
- Quality: 16.44% WER on LibriSpeech test-clean (INT8)
@Alex-Wengg Alex-Wengg force-pushed the feat/cohere-transcribe-int8-integration branch from 00d3e72 to 4eb8c0e Compare April 6, 2026 21:40
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 644.1x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 661.4x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 10.4% <20% Diarization Error Rate (lower is better)
RTFx 11.83x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 12.972 14.6 Fetching diarization models
Model Compile 5.559 6.3 CoreML compilation
Audio Load 0.039 0.0 Loading audio file
Segmentation 23.306 26.3 VAD + speech detection
Embedding 88.464 99.7 Speaker embedding extraction
Clustering (VBx) 0.117 0.1 Hungarian algorithm + VBx clustering
Total 88.737 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 10.4% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 111.9s processing • Test runtime: 1m 57s • 04/23/2026, 11:02 AM EST

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

Qwen3-ASR int8 Smoke Test ✅

Check Result
Build
Model download
Model load
Transcription pipeline
Decoder size 571 MB (vs 1.1 GB f32)

Performance Metrics

Metric CI Value Expected on Apple Silicon
Median RTFx 0.06x ~2.5x
Overall RTFx 0.06x ~2.5x

Runtime: 3m56s

Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 7.77x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 69.7s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.070s Average chunk processing time
Max Chunk Time 0.139s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 1m21s • 04/23/2026, 11:02 AM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 5.01x
test-other 1.35% 0.00% 3.40x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 4.14x
test-other 1.22% 0.00% 2.41x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.48x Streaming real-time factor
Avg Chunk Time 2.106s Average time to process each chunk
Max Chunk Time 4.287s Maximum chunk processing time
First Token 2.495s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.46x Streaming real-time factor
Avg Chunk Time 1.913s Average time to process each chunk
Max Chunk Time 2.733s Maximum chunk processing time
First Token 2.145s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 8m1s • 04/23/2026, 11:11 AM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

PocketTTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (202.5 KB)

Runtime: 0m41s

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

Kokoro TTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (634.8 KB)

Runtime: 0m51s

Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 20.11x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 9.273 17.8 Fetching diarization models
Model Compile 3.974 7.6 CoreML compilation
Audio Load 0.077 0.1 Loading audio file
Segmentation 15.641 30.0 Detecting speech regions
Embedding 26.068 50.0 Extracting speaker voices
Clustering 10.427 20.0 Grouping same speakers
Total 52.181 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 52.1s diarization time • Test runtime: 2m 10s • 04/23/2026, 11:09 AM EST

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 6, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 33.4% <35%
Miss Rate 24.4% - -
False Alarm 0.2% - -
Speaker Error 8.8% - -
RTFx 4.5x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 5m 48s • 2026-04-23T15:18:45.259Z

devin-ai-integration[bot]

This comment was marked as resolved.

Fixes 4 critical issues identified in PR #487 review:

1. **KV cache buffer overflow** (CohereAsrManager.swift:197):
   - Bound decode loop with min(maxNewTokens, maxSeqLen=108)
   - Prevents out-of-bounds cache access when step >= 108

2. **Unsafe pointer rebound** (CohereMelSpectrogram.swift:174-178):
   - Move vDSP_ctoz call inside withMemoryRebound closure
   - Fixes undefined behavior from escaped pointer

3. **Division by zero** (CohereBenchmark.swift:229, 393-394):
   - Add empty array checks before computing averages
   - Prevents NaN when all transcriptions fail

4. **Missing unit tests**:
   - Add CohereAsrConfigTests (config validation, special tokens, languages)
   - Add CohereMelSpectrogramTests (mel computation, padding, edge cases)
   - Add CohereTokenConversionTests (token-to-text, special token filtering)

All fixes follow project coding standards and ensure memory safety.
devin-ai-integration[bot]

This comment was marked as resolved.

Implements the Parakeet pattern for cache-external decoding of Cohere
Transcribe models. Cache is managed in Swift and passed to/from CoreML
as inputs/outputs each step.

Key features:
- CohereDecoderState: Manages 16 KV cache arrays (8 layers × 2)
- CohereModelInference: Executes decoder with cache-external pattern
- CohereStatelessManager: Stateless O(n²) decoder (simpler alternative)
- Correct EOS token (3, not 151643) verified from model config

Implementation:
- Cache-external achieves O(n) complexity with 11.95% WER
- Growing attention mask: [1,1,1,1] → [1,1,1,108]
- Compatible with .mlmodelc compiled models for faster loading
- Tested and verified in mobius (see commit 5d12a80)

Files:
- CohereDecoderState.swift - Cache state management
- CohereModelInference.swift - Decoder execution
- CohereStatelessManager.swift - Stateless alternative (EOS fixed)
devin-ai-integration[bot]

This comment was marked as resolved.

… Cohere ASR

Three fixes for Cohere ASR compatibility:

1. **Mel padding**: 3001 → 3500 frames to match encoder input shape
   - CohereAsrManager.swift: All 3001 references changed to 3500
   - CohereStatelessManager.swift: All 3001 references changed to 3500

2. **Encoder output name**: encoder_outputs → hidden_states
   - Matches the actual encoder model export (see mobius export scripts)

3. **Explicit self capture**: maxSeqLen in closure
   - CohereStatelessManager.swift: Added explicit self.maxSeqLen

These align with the encoder/decoder models exported in mobius.

Note: Full WER benchmark requires matching decoder models. The current
auto-downloaded stateful decoder has a different interface than the
cache-external decoder implemented in CohereDecoderState/CohereModelInference.
devin-ai-integration[bot]

This comment was marked as resolved.

After extensive testing with FLEURS multilingual dataset, the Cohere
Transcribe cache-external decoder only works reliably for Spanish
(18-24% WER). Other languages hallucinate with >50% WER, producing
Arabic/Polish/wrong-language output.

## Test Results (10 samples per language)

- Spanish: 18.6% WER ✅ Production ready
- English: 57.5% WER ❌ Hallucinating
- French: 88.0% WER ❌ Hallucinating
- Chinese: 113.5% WER ❌ Hallucinating

## Attempted Fixes (All Failed)

1. Language token prompts (10-token sequence) - Made it worse (142% WER)
2. Language embeddings in decoder V2 - No improvement (57.5% WER)
3. Multilingual encoder (traced with 4 languages) - No improvement

## Root Cause

The encoder outputs language-agnostic hidden states that don't preserve
which language was spoken. The decoder's language conditioning cannot
override the encoder's lost language information. This is a fundamental
issue with the CoreML export process.

## Changes

- Add warning in CohereAsrManager.transcribe() for non-Spanish languages
- Document limitation in CohereAsrConfig, CohereAsrModels docstrings
- Add language parameter support (full prompt sequence implementation)
- Update FLEURS benchmark to support language parameter

## Recommendation

For multilingual ASR, use Whisper or Qwen3 models instead. Cache-external
decoder should only be deployed for Spanish-language transcription.

Related investigation files (in mobius/):
- CACHE_EXTERNAL_ANALYSIS.md - Python vs Swift comparison
- MULTILINGUAL_INVESTIGATION_FINAL.md - Comprehensive test results
devin-ai-integration[bot]

This comment was marked as resolved.

Added language enum and configuration to support multilingual ASR testing.
After extensive investigation (see mobius/models/stt/cohere-transcribe-03-2026/coreml/RESEARCH_REPORT.md),
confirmed that cache-external decoder only works reliably for Spanish.

Changes:
- CohereAsrConfig: Added Language enum with 14 languages and token IDs
- CohereAsrConfig: Added promptSequence() method for language-specific prompts
- CohereAsrManager: Added language parameter to transcribe()
- CohereAsrManager: Added warning logs for non-Spanish languages
- CohereAsrModels: Added DecoderType detection (stateful vs cache-external)

Language support tested on FLEURS dataset (40 samples):
- Spanish: 18.6% WER ✅ (production ready)
- English: 57.5% WER ❌ (hallucinating)
- French: 88.0% WER ❌ (hallucinating)
- Chinese: 113.5% WER ❌ (hallucinating)

Recommendation: Deploy for Spanish-only. For multilingual, use Whisper or Qwen3.

See research report in mobius repo for full investigation details.
- Add CohereFixedPipeline: self-contained INT8-encoder + FP16-decoder
  pipeline with fp16-safe cross-attention mask (vImage), repetition
  penalty, no-repeat-ngram, and SentencePiece byte-fallback detok.
- Add cohere-mixed CLI command to exercise the mixed pipeline on a
  single audio file with per-language config.
- Add cohere-mixed-benchmark CLI command: 14-language FLEURS benchmark
  with per-language WER/CER/RTFx, JSON output, and --auto-download.
- Fix CohereAsrManager macOS SDK 26.4 compatibility: use Swift-refined
  makeState() (newState is NS_REFINED_FOR_SWIFT / macOS-unavailable)
  and gate decodeStateful with @available(macOS 15, iOS 18, *) so
  transcribe() remains usable on macOS 14 / iOS 17.

Verified end-to-end on english_original.wav and multilingual FLEURS
samples (en_us, fr_fr, cmn_hans_cn) all decode correctly.
devin-ai-integration[bot]

This comment was marked as resolved.

Extends FLEURSBenchmark.supportedLanguages with the 6 non-European
languages required to cover the 14-language Cohere Transcribe matrix
(pt_br, ar_eg, ja_jp, cmn_hans_cn, ko_kr, vi_vn). The 8 European
languages Cohere supports were already in the map.

Adds two standalone scripts under Scripts/ for running the hybrid
INT8-encoder + FP16-decoder benchmark one language at a time:

  - run_cohere_per_lang.sh     resumable per-language runner (each
                               language writes its own JSON so the
                               run survives interruption / cleanup
                               segfaults that happen after results
                               are persisted)
  - fetch_fleurs_from_google.py adapter that pulls the 5 languages
                               not yet in FluidInference's FLEURS
                               mirror (ar_eg, ja_jp, cmn_hans_cn,
                               ko_kr, vi_vn) from google/fleurs on
                               HuggingFace and materialises them in
                               the cache layout expected by the CLI.

Whitelists both new scripts in .gitignore alongside the existing
parakeet/diarizer benchmark helpers.
Empty MAX_ARGS=() array expanded to "${MAX_ARGS[@]}" triggered an
unbound-variable error under set -u on some bash/zsh versions, which
broke the uncapped full-splits run. Use the defensive
${MAX_ARGS[@]+"${MAX_ARGS[@]}"} expansion so the runner works
both with and without MAX_FILES set.
…ults

Full FLEURS benchmark numbers for the INT8 encoder + FP16 cache-external
decoder hybrid across all 14 supported languages (9,911 samples total),
measured via the per-language runner on M4 Pro, Tahoe 26.0.

Also documents:
- CohereFixedPipeline Swift API (load + transcribe)
- cohere-mixed / cohere-mixed-benchmark CLI surface
- Approximate comparison vs Cohere's Figure 4 reference (with the caveat
  that Cohere's numbers are averaged across FLEURS + Common Voice 17.0 +
  MLS + Wenet, not FLEURS-only)
- Why Japanese/Mandarin WER is meaningless (no word boundaries) and CER
  should be read instead
- Single-chunk (35 s) and language-hint requirements
- Add Δ column and per-language sample context
- Explain the two gap sources (dataset mix vs INT8 quantization)
- Flag the ja CER win and ko CER outlier explicitly
devin-ai-integration[bot]

This comment was marked as resolved.

- Guard CohereMelSpectrogram.compute against audio shorter than nFFT/2+1
  (prevents OOB crash in reflectionPad)
- Fix CohereAsrModels.modelsExist to check the cache-external decoder that
  load() actually consumes, and accept either .mlmodelc or .mlpackage so the
  local HF cache isn't re-downloaded on every run
- Correct CohereAsrConfig.maxAudioSeconds (30s -> 35s) and maxSamples
  (480k -> 560k) to match the [1,128,3500] encoder input
- Switch the four Cohere library loggers to AppLogger per CLAUDE.md
- Update tests to match new fail-safe short-audio semantics
- Fix a pre-existing Double/Float type error in
  testComputeWithSineWaveProducesNonZeroMel
devin-ai-integration[bot]

This comment was marked as low quality.

- CohereMelSpectrogram: split DC and Nyquist out of vDSP packed-format
  bin 0 so the last bin holds the correct Nyquist power.
- CohereBenchmark: scale WER/CER from fractions to percentages to match
  the rest of the CLI (output was displaying "WER: 0.06%" instead of
  "WER: 5.63%").
- CohereTranscribeCommand: parse --language/-l so users can actually
  transcribe non-English audio; plumb the value through to
  manager.transcribe() and document it in the help text.
- CohereFixedPipeline: during the prompt-feeding phase, record the
  actually-consumed prompt token in the repetition-penalty history
  instead of the model's discarded prediction, so noRepeatNgram no
  longer suppresses valid output tokens based on phantom predictions.
- FluidAudioCLI: list cohere-mixed / cohere-mixed-benchmark in the
  command help and drop the bogus ja_jp example (raw values are 2-letter
  codes).
Tier A — fully dead code (zero callers anywhere in the tree):
- CohereStatelessManager.swift
- CohereModelInference.swift
- CohereDecoderState.swift (only referenced by CohereModelInference)

Tier B — "original buggy" pipeline that CohereFixedPipeline was written
to replace (per its own header comment), kept in parallel until now:
- CohereAsrManager.swift
- CohereAsrModels.swift (only consumer was CohereAsrManager + its CLI)
- CohereMelSpectrogram.swift (CohereFixedPipeline has its own internal
  CohereFixedMelSpectrogram; only the buggy manager + dead stateless
  manager + tests used this one)
- CohereTranscribeCommand.swift / CohereBenchmark.swift (only entries
  to the buggy manager)
- CohereMelSpectrogramTests.swift / CohereTokenConversionTests.swift

Lifted CohereAsrError into CohereFixedPipeline.swift (the lone surviving
consumer). Updated ModelNames.CohereTranscribe.requiredModels to point
at the cache-external compiled artifacts that actually ship; removed
the dangling decoderFile alias. Trimmed cohere-transcribe and
cohere-benchmark from the CLI dispatch and help text — cohere-mixed
and cohere-mixed-benchmark (which run the canonical CohereFixedPipeline,
the source of the published FLEURS numbers) remain.

Net: -2,584 deleted, ~28 added. swift build clean.
devin-ai-integration[bot]

This comment was marked as resolved.

…ipelines are gone

The 'Fixed' in CohereFixedPipeline only made sense as a contrast with the
buggy CohereAsrManager (deleted in the previous commit), and 'Mixed' in
the CLI commands referred to mixed-precision contrasting with the
single-precision FP16 path (also gone). With the parallel pipelines
removed, the qualifiers are noise.

Renames:
- CohereFixedPipeline       -> CoherePipeline
- CohereFixedMelSpectrogram -> CohereMelSpectrogram (the previous owner
                                of that name was deleted, so it is free)
- CohereMixedCommand        -> CohereTranscribeCommand
- CohereMixedBenchmark      -> CohereBenchmark
- CohereMixedBenchmarkResult -> CohereBenchmarkResult
- cohere-mixed              -> cohere-transcribe (CLI command)
- cohere-mixed-benchmark    -> cohere-benchmark  (CLI command)
- logger category 'CohereFixedPipeline' -> 'CoherePipeline'

Scripts/run_cohere_per_lang.sh and Documentation/ASR/Cohere.md updated
to use the new command names. swift build clean.
devin-ai-integration[bot]

This comment was marked as resolved.

These two scripts were PR-local helpers, not generally-useful FluidAudio
benchmark tooling:
- Scripts/run_cohere_per_lang.sh wraps the cohere-benchmark CLI in a
  per-language loop. Anyone reproducing the FLEURS table can invoke
  cohere-benchmark directly per the docs.
- Scripts/fetch_fleurs_from_google.py mirrors a 5-language slice of the
  google/fleurs dataset; the cohere-benchmark --auto-download flag
  already pulls the FluidInference FLEURS subset.

Also drops the two new !Scripts/ exceptions added by this PR and the
dangling docs reference to run_cohere_per_lang.sh.
The CoherePipeline integration only ever exposes the INT8-encoder + FP16
cache-external-decoder hybrid. Carrying a separate cohereTranscribeCoreml
(f16) Repo case alongside cohereTranscribeCoremlInt8 was dead surface:
nothing in Sources/ or Tests/ references either case explicitly.

- Collapse the two enum cases into a single .cohereTranscribeCoreml
  pointing at FluidInference/cohere-transcribe-03-2026-coreml/q8.
- Drop the unused decoderStateful / encoderFile / decoderCacheExternalFile
  (.mlpackage) entries from ModelNames.CohereTranscribe — the stateful
  decoder pipeline was already removed in 65487ec, and the runtime
  loader only consumes .mlmodelc compiled artifacts.
@Alex-Wengg Alex-Wengg changed the title feat(asr): Add Cohere Transcribe with INT8 support feat(asr): add Cohere Transcribe (INT8 encoder + FP16 cache-external decoder) Apr 23, 2026
1. CohereAsrConfig.MelSpec.nFFT was 1024 but the actual FFT used by
   CohereMelSpectrogram is nextPowerOfTwo(winLength=400) = 512
   (CoherePipeline.swift:88). The header comment at CoherePipeline.swift:6
   already states n_fft=512. Anyone using the public constant for buffer
   sizing or frequency-bin math would get wrong results.

2. Decoder loop was missing the first real output token from the
   penalty-history buffer. At step == prompt.count - 1, the previous
   conditional appended currentToken (last prompt token) and then rotated
   nextToken (the first output token) into currentToken; on the next
   iteration it appended nextToken (the SECOND output) instead — so the
   first output never appeared in allTokens. applyRepetitionPenalty and
   applyNoRepeatNgram could not penalise repeats of the first output or
   detect n-grams beginning with it.

   Replace the conditional with the unified `allTokens.append(currentToken)`
   so we always record what was actually consumed at this step. The first
   output is then recorded on the iteration after it is generated, once it
   has rotated into currentToken.

   Also update the test that asserted the wrong nFFT value.
devin-ai-integration[bot]

This comment was marked as resolved.

Cohere transcribe benchmark previously only ran on FLEURS. Add a
`--dataset librispeech|fleurs` switch (default: fleurs) and a
`--subset` flag for LibriSpeech (default: test-clean).

LibriSpeech path reuses Parakeet's `ASRBenchmark.downloadLibriSpeech`
+ `getLibriSpeechDirectory()` for cache layout, walks the `*.trans.txt`
files under the subset directory, and routes through the same per-file
inference loop as FLEURS (now extracted into a shared `transcribeFiles`
helper). Cohere is single-chunk (35s max) so files exceeding the limit
are skipped with a warning rather than silently failing.

Renamed the default output JSON to `cohere_benchmark_results.json` and
updated `printUsage` + the summary header now that this is no longer
FLEURS-only.
`String(format: "%-14s ...", swiftString)` is fatal on macOS 26: Swift's
String maps to %@, the format specifier says %s (a C string), and the
Foundation runtime now aborts on the mismatch. The benchmark would write
its JSON output successfully and then crash in the summary print right
before exit (SIGABRT, exit 139), making the run look failed even though
results were good.

Replace the format-string column layout with a small `row(...)` helper
plus a `String.leftPad(to:)` extension so column widths and decimal
formatting stay readable without going through `%s`.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 16 additional findings in Devin Review.

Open in Devin Review

Comment thread Sources/FluidAudioCLI/Commands/ASR/Cohere/CohereBenchmark.swift Outdated
Comment on lines +6 to +153
final class CohereAsrConfigTests: XCTestCase {

// MARK: - Config Constants

func testSampleRateIs16kHz() {
XCTAssertEqual(CohereAsrConfig.sampleRate, 16000)
}

func testMaxAudioDurationIs35Seconds() {
// Matches the encoder mel input [1, 128, 3500] (3500 * 160 / 16000 = 35s).
XCTAssertEqual(CohereAsrConfig.maxAudioSeconds, 35.0)
}

func testMaxSamplesMatchesDurationAndSampleRate() {
let expectedSamples = Int(CohereAsrConfig.maxAudioSeconds * Float(CohereAsrConfig.sampleRate))
XCTAssertEqual(CohereAsrConfig.maxSamples, expectedSamples)
XCTAssertEqual(CohereAsrConfig.maxSamples, 560_000)
}

func testVocabSizeIs16384() {
XCTAssertEqual(CohereAsrConfig.vocabSize, 16_384)
}

func testMaxSeqLenIs108() {
// KV cache capacity
XCTAssertEqual(CohereAsrConfig.maxSeqLen, 108)
}

func testHeadDimMatchesDecoderDimension() {
let expectedHeadDim = CohereAsrConfig.decoderHiddenSize / CohereAsrConfig.numDecoderHeads
XCTAssertEqual(CohereAsrConfig.headDim, expectedHeadDim)
XCTAssertEqual(CohereAsrConfig.headDim, 128)
}

// MARK: - Special Tokens

func testSpecialTokenIdsAreInRange() {
let vocabSize = CohereAsrConfig.vocabSize
let tokenIds = [
CohereAsrConfig.SpecialTokens.unkToken,
CohereAsrConfig.SpecialTokens.noSpeechToken,
CohereAsrConfig.SpecialTokens.padToken,
CohereAsrConfig.SpecialTokens.eosToken,
CohereAsrConfig.SpecialTokens.startToken,
]

for tokenId in tokenIds {
XCTAssertGreaterThanOrEqual(tokenId, 0, "Token ID \(tokenId) should be non-negative")
XCTAssertLessThan(tokenId, vocabSize, "Token ID \(tokenId) should be < vocabSize (\(vocabSize))")
}
}

func testSpecialTokensAreUnique() {
let tokens = Set([
CohereAsrConfig.SpecialTokens.unkToken,
CohereAsrConfig.SpecialTokens.noSpeechToken,
CohereAsrConfig.SpecialTokens.padToken,
CohereAsrConfig.SpecialTokens.eosToken,
CohereAsrConfig.SpecialTokens.startToken,
])
XCTAssertEqual(tokens.count, 5, "Special tokens should be unique")
}

func testEosTokenId() {
XCTAssertEqual(CohereAsrConfig.SpecialTokens.eosToken, 3)
}

func testStartTokenId() {
XCTAssertEqual(CohereAsrConfig.SpecialTokens.startToken, 4)
}

// MARK: - Mel Spectrogram Parameters

func testMelSpecParametersAreValid() {
XCTAssertEqual(CohereAsrConfig.MelSpec.nFFT, 512)
XCTAssertEqual(CohereAsrConfig.MelSpec.hopLength, 160)
XCTAssertEqual(CohereAsrConfig.MelSpec.nMels, 128)
XCTAssertEqual(CohereAsrConfig.numMelBins, 128)
}

func testMelSpecFrequencyRange() {
XCTAssertEqual(CohereAsrConfig.MelSpec.fMin, 0.0)
XCTAssertEqual(CohereAsrConfig.MelSpec.fMax, 8000.0)
XCTAssertLessThanOrEqual(
CohereAsrConfig.MelSpec.fMax,
Float(CohereAsrConfig.sampleRate) / 2.0,
"fMax should not exceed Nyquist frequency"
)
}

func testPreemphasisIsValid() {
XCTAssertGreaterThan(CohereAsrConfig.MelSpec.preemphasis, 0.0)
XCTAssertLessThanOrEqual(CohereAsrConfig.MelSpec.preemphasis, 1.0)
}

func testNFFTIsPowerOfTwo() {
let nFFT = CohereAsrConfig.MelSpec.nFFT
XCTAssertTrue(nFFT > 0 && (nFFT & (nFFT - 1)) == 0, "nFFT should be a power of 2")
}

// MARK: - Language

func testLanguageRawValuesAreIsoCodes() {
XCTAssertEqual(CohereAsrConfig.Language.english.rawValue, "en")
XCTAssertEqual(CohereAsrConfig.Language.french.rawValue, "fr")
XCTAssertEqual(CohereAsrConfig.Language.german.rawValue, "de")
XCTAssertEqual(CohereAsrConfig.Language.spanish.rawValue, "es")
XCTAssertEqual(CohereAsrConfig.Language.italian.rawValue, "it")
XCTAssertEqual(CohereAsrConfig.Language.portuguese.rawValue, "pt")
XCTAssertEqual(CohereAsrConfig.Language.dutch.rawValue, "nl")
XCTAssertEqual(CohereAsrConfig.Language.polish.rawValue, "pl")
XCTAssertEqual(CohereAsrConfig.Language.greek.rawValue, "el")
XCTAssertEqual(CohereAsrConfig.Language.arabic.rawValue, "ar")
XCTAssertEqual(CohereAsrConfig.Language.japanese.rawValue, "ja")
XCTAssertEqual(CohereAsrConfig.Language.chinese.rawValue, "zh")
XCTAssertEqual(CohereAsrConfig.Language.vietnamese.rawValue, "vi")
XCTAssertEqual(CohereAsrConfig.Language.korean.rawValue, "ko")
}

func testAllLanguagesHaveEnglishNames() {
for language in CohereAsrConfig.Language.allCases {
XCTAssertFalse(language.englishName.isEmpty, "\(language) should have a non-empty English name")
}
}

func testLanguageCount() {
XCTAssertEqual(CohereAsrConfig.Language.allCases.count, 14, "Cohere supports 14 languages")
}

func testEnglishNameExamples() {
XCTAssertEqual(CohereAsrConfig.Language.english.englishName, "English")
XCTAssertEqual(CohereAsrConfig.Language.french.englishName, "French")
XCTAssertEqual(CohereAsrConfig.Language.japanese.englishName, "Japanese")
}

// MARK: - Model Architecture

func testEncoderParameters() {
XCTAssertEqual(CohereAsrConfig.encoderHiddenSize, 1280)
XCTAssertEqual(CohereAsrConfig.numEncoderLayers, 48)
}

func testDecoderParameters() {
XCTAssertEqual(CohereAsrConfig.decoderHiddenSize, 1024)
XCTAssertEqual(CohereAsrConfig.numDecoderLayers, 8)
XCTAssertEqual(CohereAsrConfig.numDecoderHeads, 8)
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing unit tests for CoherePipeline and CohereMelSpectrogram utility functions

AGENTS.md mandates "Add unit tests when writing new code." The PR adds ~800 lines of pipeline logic in CoherePipeline.swift containing many testable pure functions (applyRepetitionPenalty, applyNoRepeatNgram, argmax, convertTokensToText, parseByteFallback, encoderValidFrames, buildCrossAttentionMask, copyLogitsFloat32, zeroFill) and CohereMelSpectrogram (compute, validFrameCount, padOrTruncate, slaneyMelFilter), but only provides config constant tests in CohereAsrConfigTests.swift. Other ASR modules in the repo test their utility functions (e.g., AudioMelSpectrogramTests, TdtDecoderTests, Qwen3RoPETests).

Prompt for agents
AGENTS.md requires unit tests for new code. The PR adds CoherePipeline with many testable pure static functions but only tests CohereAsrConfig constants. Add a new test file Tests/FluidAudioTests/ASR/Cohere/CoherePipelineTests.swift with tests for at minimum: applyRepetitionPenalty (penalty > 1 reduces positive logits, amplifies negative), applyNoRepeatNgram (forbids completing seen n-grams), argmax (returns index of max value), convertTokensToText (handles byte-fallback tokens, special token filtering, Unicode replacement character), parseByteFallback (parses <0xHH> patterns, rejects invalid), encoderValidFrames (ceiling division, clamping), and padOrTruncate (truncation, padding, passthrough). These are all static/class methods on CoherePipeline and CohereMelSpectrogram that can be tested without loading CoreML models.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Long LibriSpeech runs (2620 files in test-clean) take 10+ hours wall
time. A crash near the end loses everything because saveResults only
ran on success. Add --checkpoint-every <n> (default 100) so every N
successful transcriptions persist the running results array to the
output JSON. On crash the user keeps the last multiple-of-N results.
The 35-second per-call limit traces directly to the upstream
cohere-pytorch config (max_audio_clip_s: 35, 100 fps × 35 = 3500 mel
frames). Document the provenance plus the upstream's overlap_chunk_second:
5 chunking strategy that we don't yet wrap in Swift, so users understand
why long-form audio is skipped instead of stitched.
….72x)

Full 2620-utterance run on M4 Pro: WER 1.77%, CER 0.60%, total RTFx
1.72x. Competitive with Parakeet TDT 0.6B v3 (~1.7%) and Whisper
large-v3 (~1.8%) while running ~1.7x faster than real time.
- Move the duplicated `generateInlineDiff` edit-distance diff renderer
  (byte-identical copies in AsrBenchmark.swift and FleursBenchmark.swift,
  ~110 lines each) to a shared `InlineDiff.generate(reference:hypothesis:)`
  utility in `FluidAudioCLI/Utils/InlineDiff.swift`.
- Fix Devin Review finding in CohereBenchmark.swift line 263: when
  `--max-files` was omitted for FLEURS auto-download,
  `samplesPerLanguage: maxFiles ?? 100` silently capped each language at
  100 samples. Switch to `Int.max`, which is FLEURSBenchmark's documented
  sentinel for "download all available".
devin-ai-integration[bot]

This comment was marked as resolved.

- CohereTranscribeCommand.swift:162 — guard RTFx division against
  zero-duration totalSeconds with `max(_, 1e-9)`, matching the
  convention already used in CohereBenchmark.swift:463.
- CoherePipeline.swift:596-599 — flatten nested if in decoder loop
  to comply with the AGENTS.md "avoid nested ifs" rule.

Both are no-ops on the correct-input happy path.
@Alex-Wengg Alex-Wengg merged commit b10bdcb into main Apr 23, 2026
12 checks passed
@Alex-Wengg Alex-Wengg deleted the feat/cohere-transcribe-int8-integration branch April 23, 2026 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant