feat(asr): add Cohere Transcribe (INT8 encoder + FP16 cache-external decoder)#487
Conversation
Add Cohere Transcribe CoreML ASR implementation supporting 14 languages: - English, French, German, Spanish, Italian, Portuguese, Dutch, Polish - Greek, Arabic, Japanese, Chinese, Korean, Vietnamese Features: - Core ASR manager with stateful decoder - Mel spectrogram preprocessing compatible with Cohere models - CLI transcription command with language selection - Benchmark command supporting LibriSpeech and FLEURS datasets - INT8 quantized models for efficient inference Usage: swift run fluidaudiocli cohere-transcribe audio.wav --language ja_jp swift run fluidaudiocli cohere-benchmark --dataset fleurs --languages en_us,fr_fr swift run fluidaudiocli download --dataset fleurs Models: FluidInference/cohere-transcribe-03-2026-coreml
Add HuggingFace integration for Cohere Transcribe CoreML models with INT8 quantization support. Changes: - Add CohereTranscribe model names enum with encoder, decoder, and vocab - Add Cohere repository definitions (FP16 and INT8 variants) - Update CohereAsrModels to use stateful decoder from HuggingFace - Support automatic download from FluidInference/cohere-transcribe-03-2026-coreml Model details: - 35-second window architecture (3500 frames → 438 encoder outputs) - INT8 W8A16 quantization (~2.0 GB vs ~4.2 GB FP16) - 14-language support with token primer system - Quality: 16.44% WER on LibriSpeech test-clean (INT8)
00d3e72 to
4eb8c0e
Compare
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 111.9s processing • Test runtime: 1m 57s • 04/23/2026, 11:02 AM EST |
Qwen3-ASR int8 Smoke Test ✅
Performance Metrics
Runtime: 3m56s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 1m21s • 04/23/2026, 11:02 AM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 8m1s • 04/23/2026, 11:11 AM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
PocketTTS Smoke Test ✅
Runtime: 0m41s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon. |
Kokoro TTS Smoke Test ✅
Runtime: 0m51s Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon. |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 52.1s diarization time • Test runtime: 2m 10s • 04/23/2026, 11:09 AM EST |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 5m 48s • 2026-04-23T15:18:45.259Z |
Fixes 4 critical issues identified in PR #487 review: 1. **KV cache buffer overflow** (CohereAsrManager.swift:197): - Bound decode loop with min(maxNewTokens, maxSeqLen=108) - Prevents out-of-bounds cache access when step >= 108 2. **Unsafe pointer rebound** (CohereMelSpectrogram.swift:174-178): - Move vDSP_ctoz call inside withMemoryRebound closure - Fixes undefined behavior from escaped pointer 3. **Division by zero** (CohereBenchmark.swift:229, 393-394): - Add empty array checks before computing averages - Prevents NaN when all transcriptions fail 4. **Missing unit tests**: - Add CohereAsrConfigTests (config validation, special tokens, languages) - Add CohereMelSpectrogramTests (mel computation, padding, edge cases) - Add CohereTokenConversionTests (token-to-text, special token filtering) All fixes follow project coding standards and ensure memory safety.
Implements the Parakeet pattern for cache-external decoding of Cohere Transcribe models. Cache is managed in Swift and passed to/from CoreML as inputs/outputs each step. Key features: - CohereDecoderState: Manages 16 KV cache arrays (8 layers × 2) - CohereModelInference: Executes decoder with cache-external pattern - CohereStatelessManager: Stateless O(n²) decoder (simpler alternative) - Correct EOS token (3, not 151643) verified from model config Implementation: - Cache-external achieves O(n) complexity with 11.95% WER - Growing attention mask: [1,1,1,1] → [1,1,1,108] - Compatible with .mlmodelc compiled models for faster loading - Tested and verified in mobius (see commit 5d12a80) Files: - CohereDecoderState.swift - Cache state management - CohereModelInference.swift - Decoder execution - CohereStatelessManager.swift - Stateless alternative (EOS fixed)
… Cohere ASR Three fixes for Cohere ASR compatibility: 1. **Mel padding**: 3001 → 3500 frames to match encoder input shape - CohereAsrManager.swift: All 3001 references changed to 3500 - CohereStatelessManager.swift: All 3001 references changed to 3500 2. **Encoder output name**: encoder_outputs → hidden_states - Matches the actual encoder model export (see mobius export scripts) 3. **Explicit self capture**: maxSeqLen in closure - CohereStatelessManager.swift: Added explicit self.maxSeqLen These align with the encoder/decoder models exported in mobius. Note: Full WER benchmark requires matching decoder models. The current auto-downloaded stateful decoder has a different interface than the cache-external decoder implemented in CohereDecoderState/CohereModelInference.
After extensive testing with FLEURS multilingual dataset, the Cohere Transcribe cache-external decoder only works reliably for Spanish (18-24% WER). Other languages hallucinate with >50% WER, producing Arabic/Polish/wrong-language output. ## Test Results (10 samples per language) - Spanish: 18.6% WER ✅ Production ready - English: 57.5% WER ❌ Hallucinating - French: 88.0% WER ❌ Hallucinating - Chinese: 113.5% WER ❌ Hallucinating ## Attempted Fixes (All Failed) 1. Language token prompts (10-token sequence) - Made it worse (142% WER) 2. Language embeddings in decoder V2 - No improvement (57.5% WER) 3. Multilingual encoder (traced with 4 languages) - No improvement ## Root Cause The encoder outputs language-agnostic hidden states that don't preserve which language was spoken. The decoder's language conditioning cannot override the encoder's lost language information. This is a fundamental issue with the CoreML export process. ## Changes - Add warning in CohereAsrManager.transcribe() for non-Spanish languages - Document limitation in CohereAsrConfig, CohereAsrModels docstrings - Add language parameter support (full prompt sequence implementation) - Update FLEURS benchmark to support language parameter ## Recommendation For multilingual ASR, use Whisper or Qwen3 models instead. Cache-external decoder should only be deployed for Spanish-language transcription. Related investigation files (in mobius/): - CACHE_EXTERNAL_ANALYSIS.md - Python vs Swift comparison - MULTILINGUAL_INVESTIGATION_FINAL.md - Comprehensive test results
Added language enum and configuration to support multilingual ASR testing. After extensive investigation (see mobius/models/stt/cohere-transcribe-03-2026/coreml/RESEARCH_REPORT.md), confirmed that cache-external decoder only works reliably for Spanish. Changes: - CohereAsrConfig: Added Language enum with 14 languages and token IDs - CohereAsrConfig: Added promptSequence() method for language-specific prompts - CohereAsrManager: Added language parameter to transcribe() - CohereAsrManager: Added warning logs for non-Spanish languages - CohereAsrModels: Added DecoderType detection (stateful vs cache-external) Language support tested on FLEURS dataset (40 samples): - Spanish: 18.6% WER ✅ (production ready) - English: 57.5% WER ❌ (hallucinating) - French: 88.0% WER ❌ (hallucinating) - Chinese: 113.5% WER ❌ (hallucinating) Recommendation: Deploy for Spanish-only. For multilingual, use Whisper or Qwen3. See research report in mobius repo for full investigation details.
- Add CohereFixedPipeline: self-contained INT8-encoder + FP16-decoder pipeline with fp16-safe cross-attention mask (vImage), repetition penalty, no-repeat-ngram, and SentencePiece byte-fallback detok. - Add cohere-mixed CLI command to exercise the mixed pipeline on a single audio file with per-language config. - Add cohere-mixed-benchmark CLI command: 14-language FLEURS benchmark with per-language WER/CER/RTFx, JSON output, and --auto-download. - Fix CohereAsrManager macOS SDK 26.4 compatibility: use Swift-refined makeState() (newState is NS_REFINED_FOR_SWIFT / macOS-unavailable) and gate decodeStateful with @available(macOS 15, iOS 18, *) so transcribe() remains usable on macOS 14 / iOS 17. Verified end-to-end on english_original.wav and multilingual FLEURS samples (en_us, fr_fr, cmn_hans_cn) all decode correctly.
Extends FLEURSBenchmark.supportedLanguages with the 6 non-European
languages required to cover the 14-language Cohere Transcribe matrix
(pt_br, ar_eg, ja_jp, cmn_hans_cn, ko_kr, vi_vn). The 8 European
languages Cohere supports were already in the map.
Adds two standalone scripts under Scripts/ for running the hybrid
INT8-encoder + FP16-decoder benchmark one language at a time:
- run_cohere_per_lang.sh resumable per-language runner (each
language writes its own JSON so the
run survives interruption / cleanup
segfaults that happen after results
are persisted)
- fetch_fleurs_from_google.py adapter that pulls the 5 languages
not yet in FluidInference's FLEURS
mirror (ar_eg, ja_jp, cmn_hans_cn,
ko_kr, vi_vn) from google/fleurs on
HuggingFace and materialises them in
the cache layout expected by the CLI.
Whitelists both new scripts in .gitignore alongside the existing
parakeet/diarizer benchmark helpers.
Empty MAX_ARGS=() array expanded to "${MAX_ARGS[@]}" triggered an
unbound-variable error under set -u on some bash/zsh versions, which
broke the uncapped full-splits run. Use the defensive
${MAX_ARGS[@]+"${MAX_ARGS[@]}"} expansion so the runner works
both with and without MAX_FILES set.
…ults Full FLEURS benchmark numbers for the INT8 encoder + FP16 cache-external decoder hybrid across all 14 supported languages (9,911 samples total), measured via the per-language runner on M4 Pro, Tahoe 26.0. Also documents: - CohereFixedPipeline Swift API (load + transcribe) - cohere-mixed / cohere-mixed-benchmark CLI surface - Approximate comparison vs Cohere's Figure 4 reference (with the caveat that Cohere's numbers are averaged across FLEURS + Common Voice 17.0 + MLS + Wenet, not FLEURS-only) - Why Japanese/Mandarin WER is meaningless (no word boundaries) and CER should be read instead - Single-chunk (35 s) and language-hint requirements
- Add Δ column and per-language sample context - Explain the two gap sources (dataset mix vs INT8 quantization) - Flag the ja CER win and ko CER outlier explicitly
- Guard CohereMelSpectrogram.compute against audio shorter than nFFT/2+1 (prevents OOB crash in reflectionPad) - Fix CohereAsrModels.modelsExist to check the cache-external decoder that load() actually consumes, and accept either .mlmodelc or .mlpackage so the local HF cache isn't re-downloaded on every run - Correct CohereAsrConfig.maxAudioSeconds (30s -> 35s) and maxSamples (480k -> 560k) to match the [1,128,3500] encoder input - Switch the four Cohere library loggers to AppLogger per CLAUDE.md - Update tests to match new fail-safe short-audio semantics - Fix a pre-existing Double/Float type error in testComputeWithSineWaveProducesNonZeroMel
- CohereMelSpectrogram: split DC and Nyquist out of vDSP packed-format bin 0 so the last bin holds the correct Nyquist power. - CohereBenchmark: scale WER/CER from fractions to percentages to match the rest of the CLI (output was displaying "WER: 0.06%" instead of "WER: 5.63%"). - CohereTranscribeCommand: parse --language/-l so users can actually transcribe non-English audio; plumb the value through to manager.transcribe() and document it in the help text. - CohereFixedPipeline: during the prompt-feeding phase, record the actually-consumed prompt token in the repetition-penalty history instead of the model's discarded prediction, so noRepeatNgram no longer suppresses valid output tokens based on phantom predictions. - FluidAudioCLI: list cohere-mixed / cohere-mixed-benchmark in the command help and drop the bogus ja_jp example (raw values are 2-letter codes).
Tier A — fully dead code (zero callers anywhere in the tree): - CohereStatelessManager.swift - CohereModelInference.swift - CohereDecoderState.swift (only referenced by CohereModelInference) Tier B — "original buggy" pipeline that CohereFixedPipeline was written to replace (per its own header comment), kept in parallel until now: - CohereAsrManager.swift - CohereAsrModels.swift (only consumer was CohereAsrManager + its CLI) - CohereMelSpectrogram.swift (CohereFixedPipeline has its own internal CohereFixedMelSpectrogram; only the buggy manager + dead stateless manager + tests used this one) - CohereTranscribeCommand.swift / CohereBenchmark.swift (only entries to the buggy manager) - CohereMelSpectrogramTests.swift / CohereTokenConversionTests.swift Lifted CohereAsrError into CohereFixedPipeline.swift (the lone surviving consumer). Updated ModelNames.CohereTranscribe.requiredModels to point at the cache-external compiled artifacts that actually ship; removed the dangling decoderFile alias. Trimmed cohere-transcribe and cohere-benchmark from the CLI dispatch and help text — cohere-mixed and cohere-mixed-benchmark (which run the canonical CohereFixedPipeline, the source of the published FLEURS numbers) remain. Net: -2,584 deleted, ~28 added. swift build clean.
…ipelines are gone
The 'Fixed' in CohereFixedPipeline only made sense as a contrast with the
buggy CohereAsrManager (deleted in the previous commit), and 'Mixed' in
the CLI commands referred to mixed-precision contrasting with the
single-precision FP16 path (also gone). With the parallel pipelines
removed, the qualifiers are noise.
Renames:
- CohereFixedPipeline -> CoherePipeline
- CohereFixedMelSpectrogram -> CohereMelSpectrogram (the previous owner
of that name was deleted, so it is free)
- CohereMixedCommand -> CohereTranscribeCommand
- CohereMixedBenchmark -> CohereBenchmark
- CohereMixedBenchmarkResult -> CohereBenchmarkResult
- cohere-mixed -> cohere-transcribe (CLI command)
- cohere-mixed-benchmark -> cohere-benchmark (CLI command)
- logger category 'CohereFixedPipeline' -> 'CoherePipeline'
Scripts/run_cohere_per_lang.sh and Documentation/ASR/Cohere.md updated
to use the new command names. swift build clean.
These two scripts were PR-local helpers, not generally-useful FluidAudio benchmark tooling: - Scripts/run_cohere_per_lang.sh wraps the cohere-benchmark CLI in a per-language loop. Anyone reproducing the FLEURS table can invoke cohere-benchmark directly per the docs. - Scripts/fetch_fleurs_from_google.py mirrors a 5-language slice of the google/fleurs dataset; the cohere-benchmark --auto-download flag already pulls the FluidInference FLEURS subset. Also drops the two new !Scripts/ exceptions added by this PR and the dangling docs reference to run_cohere_per_lang.sh.
The CoherePipeline integration only ever exposes the INT8-encoder + FP16 cache-external-decoder hybrid. Carrying a separate cohereTranscribeCoreml (f16) Repo case alongside cohereTranscribeCoremlInt8 was dead surface: nothing in Sources/ or Tests/ references either case explicitly. - Collapse the two enum cases into a single .cohereTranscribeCoreml pointing at FluidInference/cohere-transcribe-03-2026-coreml/q8. - Drop the unused decoderStateful / encoderFile / decoderCacheExternalFile (.mlpackage) entries from ModelNames.CohereTranscribe — the stateful decoder pipeline was already removed in 65487ec, and the runtime loader only consumes .mlmodelc compiled artifacts.
1. CohereAsrConfig.MelSpec.nFFT was 1024 but the actual FFT used by CohereMelSpectrogram is nextPowerOfTwo(winLength=400) = 512 (CoherePipeline.swift:88). The header comment at CoherePipeline.swift:6 already states n_fft=512. Anyone using the public constant for buffer sizing or frequency-bin math would get wrong results. 2. Decoder loop was missing the first real output token from the penalty-history buffer. At step == prompt.count - 1, the previous conditional appended currentToken (last prompt token) and then rotated nextToken (the first output token) into currentToken; on the next iteration it appended nextToken (the SECOND output) instead — so the first output never appeared in allTokens. applyRepetitionPenalty and applyNoRepeatNgram could not penalise repeats of the first output or detect n-grams beginning with it. Replace the conditional with the unified `allTokens.append(currentToken)` so we always record what was actually consumed at this step. The first output is then recorded on the iteration after it is generated, once it has rotated into currentToken. Also update the test that asserted the wrong nFFT value.
Cohere transcribe benchmark previously only ran on FLEURS. Add a `--dataset librispeech|fleurs` switch (default: fleurs) and a `--subset` flag for LibriSpeech (default: test-clean). LibriSpeech path reuses Parakeet's `ASRBenchmark.downloadLibriSpeech` + `getLibriSpeechDirectory()` for cache layout, walks the `*.trans.txt` files under the subset directory, and routes through the same per-file inference loop as FLEURS (now extracted into a shared `transcribeFiles` helper). Cohere is single-chunk (35s max) so files exceeding the limit are skipped with a warning rather than silently failing. Renamed the default output JSON to `cohere_benchmark_results.json` and updated `printUsage` + the summary header now that this is no longer FLEURS-only.
`String(format: "%-14s ...", swiftString)` is fatal on macOS 26: Swift's String maps to %@, the format specifier says %s (a C string), and the Foundation runtime now aborts on the mismatch. The benchmark would write its JSON output successfully and then crash in the summary print right before exit (SIGABRT, exit 139), making the run look failed even though results were good. Replace the format-string column layout with a small `row(...)` helper plus a `String.leftPad(to:)` extension so column widths and decimal formatting stay readable without going through `%s`.
| final class CohereAsrConfigTests: XCTestCase { | ||
|
|
||
| // MARK: - Config Constants | ||
|
|
||
| func testSampleRateIs16kHz() { | ||
| XCTAssertEqual(CohereAsrConfig.sampleRate, 16000) | ||
| } | ||
|
|
||
| func testMaxAudioDurationIs35Seconds() { | ||
| // Matches the encoder mel input [1, 128, 3500] (3500 * 160 / 16000 = 35s). | ||
| XCTAssertEqual(CohereAsrConfig.maxAudioSeconds, 35.0) | ||
| } | ||
|
|
||
| func testMaxSamplesMatchesDurationAndSampleRate() { | ||
| let expectedSamples = Int(CohereAsrConfig.maxAudioSeconds * Float(CohereAsrConfig.sampleRate)) | ||
| XCTAssertEqual(CohereAsrConfig.maxSamples, expectedSamples) | ||
| XCTAssertEqual(CohereAsrConfig.maxSamples, 560_000) | ||
| } | ||
|
|
||
| func testVocabSizeIs16384() { | ||
| XCTAssertEqual(CohereAsrConfig.vocabSize, 16_384) | ||
| } | ||
|
|
||
| func testMaxSeqLenIs108() { | ||
| // KV cache capacity | ||
| XCTAssertEqual(CohereAsrConfig.maxSeqLen, 108) | ||
| } | ||
|
|
||
| func testHeadDimMatchesDecoderDimension() { | ||
| let expectedHeadDim = CohereAsrConfig.decoderHiddenSize / CohereAsrConfig.numDecoderHeads | ||
| XCTAssertEqual(CohereAsrConfig.headDim, expectedHeadDim) | ||
| XCTAssertEqual(CohereAsrConfig.headDim, 128) | ||
| } | ||
|
|
||
| // MARK: - Special Tokens | ||
|
|
||
| func testSpecialTokenIdsAreInRange() { | ||
| let vocabSize = CohereAsrConfig.vocabSize | ||
| let tokenIds = [ | ||
| CohereAsrConfig.SpecialTokens.unkToken, | ||
| CohereAsrConfig.SpecialTokens.noSpeechToken, | ||
| CohereAsrConfig.SpecialTokens.padToken, | ||
| CohereAsrConfig.SpecialTokens.eosToken, | ||
| CohereAsrConfig.SpecialTokens.startToken, | ||
| ] | ||
|
|
||
| for tokenId in tokenIds { | ||
| XCTAssertGreaterThanOrEqual(tokenId, 0, "Token ID \(tokenId) should be non-negative") | ||
| XCTAssertLessThan(tokenId, vocabSize, "Token ID \(tokenId) should be < vocabSize (\(vocabSize))") | ||
| } | ||
| } | ||
|
|
||
| func testSpecialTokensAreUnique() { | ||
| let tokens = Set([ | ||
| CohereAsrConfig.SpecialTokens.unkToken, | ||
| CohereAsrConfig.SpecialTokens.noSpeechToken, | ||
| CohereAsrConfig.SpecialTokens.padToken, | ||
| CohereAsrConfig.SpecialTokens.eosToken, | ||
| CohereAsrConfig.SpecialTokens.startToken, | ||
| ]) | ||
| XCTAssertEqual(tokens.count, 5, "Special tokens should be unique") | ||
| } | ||
|
|
||
| func testEosTokenId() { | ||
| XCTAssertEqual(CohereAsrConfig.SpecialTokens.eosToken, 3) | ||
| } | ||
|
|
||
| func testStartTokenId() { | ||
| XCTAssertEqual(CohereAsrConfig.SpecialTokens.startToken, 4) | ||
| } | ||
|
|
||
| // MARK: - Mel Spectrogram Parameters | ||
|
|
||
| func testMelSpecParametersAreValid() { | ||
| XCTAssertEqual(CohereAsrConfig.MelSpec.nFFT, 512) | ||
| XCTAssertEqual(CohereAsrConfig.MelSpec.hopLength, 160) | ||
| XCTAssertEqual(CohereAsrConfig.MelSpec.nMels, 128) | ||
| XCTAssertEqual(CohereAsrConfig.numMelBins, 128) | ||
| } | ||
|
|
||
| func testMelSpecFrequencyRange() { | ||
| XCTAssertEqual(CohereAsrConfig.MelSpec.fMin, 0.0) | ||
| XCTAssertEqual(CohereAsrConfig.MelSpec.fMax, 8000.0) | ||
| XCTAssertLessThanOrEqual( | ||
| CohereAsrConfig.MelSpec.fMax, | ||
| Float(CohereAsrConfig.sampleRate) / 2.0, | ||
| "fMax should not exceed Nyquist frequency" | ||
| ) | ||
| } | ||
|
|
||
| func testPreemphasisIsValid() { | ||
| XCTAssertGreaterThan(CohereAsrConfig.MelSpec.preemphasis, 0.0) | ||
| XCTAssertLessThanOrEqual(CohereAsrConfig.MelSpec.preemphasis, 1.0) | ||
| } | ||
|
|
||
| func testNFFTIsPowerOfTwo() { | ||
| let nFFT = CohereAsrConfig.MelSpec.nFFT | ||
| XCTAssertTrue(nFFT > 0 && (nFFT & (nFFT - 1)) == 0, "nFFT should be a power of 2") | ||
| } | ||
|
|
||
| // MARK: - Language | ||
|
|
||
| func testLanguageRawValuesAreIsoCodes() { | ||
| XCTAssertEqual(CohereAsrConfig.Language.english.rawValue, "en") | ||
| XCTAssertEqual(CohereAsrConfig.Language.french.rawValue, "fr") | ||
| XCTAssertEqual(CohereAsrConfig.Language.german.rawValue, "de") | ||
| XCTAssertEqual(CohereAsrConfig.Language.spanish.rawValue, "es") | ||
| XCTAssertEqual(CohereAsrConfig.Language.italian.rawValue, "it") | ||
| XCTAssertEqual(CohereAsrConfig.Language.portuguese.rawValue, "pt") | ||
| XCTAssertEqual(CohereAsrConfig.Language.dutch.rawValue, "nl") | ||
| XCTAssertEqual(CohereAsrConfig.Language.polish.rawValue, "pl") | ||
| XCTAssertEqual(CohereAsrConfig.Language.greek.rawValue, "el") | ||
| XCTAssertEqual(CohereAsrConfig.Language.arabic.rawValue, "ar") | ||
| XCTAssertEqual(CohereAsrConfig.Language.japanese.rawValue, "ja") | ||
| XCTAssertEqual(CohereAsrConfig.Language.chinese.rawValue, "zh") | ||
| XCTAssertEqual(CohereAsrConfig.Language.vietnamese.rawValue, "vi") | ||
| XCTAssertEqual(CohereAsrConfig.Language.korean.rawValue, "ko") | ||
| } | ||
|
|
||
| func testAllLanguagesHaveEnglishNames() { | ||
| for language in CohereAsrConfig.Language.allCases { | ||
| XCTAssertFalse(language.englishName.isEmpty, "\(language) should have a non-empty English name") | ||
| } | ||
| } | ||
|
|
||
| func testLanguageCount() { | ||
| XCTAssertEqual(CohereAsrConfig.Language.allCases.count, 14, "Cohere supports 14 languages") | ||
| } | ||
|
|
||
| func testEnglishNameExamples() { | ||
| XCTAssertEqual(CohereAsrConfig.Language.english.englishName, "English") | ||
| XCTAssertEqual(CohereAsrConfig.Language.french.englishName, "French") | ||
| XCTAssertEqual(CohereAsrConfig.Language.japanese.englishName, "Japanese") | ||
| } | ||
|
|
||
| // MARK: - Model Architecture | ||
|
|
||
| func testEncoderParameters() { | ||
| XCTAssertEqual(CohereAsrConfig.encoderHiddenSize, 1280) | ||
| XCTAssertEqual(CohereAsrConfig.numEncoderLayers, 48) | ||
| } | ||
|
|
||
| func testDecoderParameters() { | ||
| XCTAssertEqual(CohereAsrConfig.decoderHiddenSize, 1024) | ||
| XCTAssertEqual(CohereAsrConfig.numDecoderLayers, 8) | ||
| XCTAssertEqual(CohereAsrConfig.numDecoderHeads, 8) | ||
| } | ||
| } |
There was a problem hiding this comment.
🟡 Missing unit tests for CoherePipeline and CohereMelSpectrogram utility functions
AGENTS.md mandates "Add unit tests when writing new code." The PR adds ~800 lines of pipeline logic in CoherePipeline.swift containing many testable pure functions (applyRepetitionPenalty, applyNoRepeatNgram, argmax, convertTokensToText, parseByteFallback, encoderValidFrames, buildCrossAttentionMask, copyLogitsFloat32, zeroFill) and CohereMelSpectrogram (compute, validFrameCount, padOrTruncate, slaneyMelFilter), but only provides config constant tests in CohereAsrConfigTests.swift. Other ASR modules in the repo test their utility functions (e.g., AudioMelSpectrogramTests, TdtDecoderTests, Qwen3RoPETests).
Prompt for agents
AGENTS.md requires unit tests for new code. The PR adds CoherePipeline with many testable pure static functions but only tests CohereAsrConfig constants. Add a new test file Tests/FluidAudioTests/ASR/Cohere/CoherePipelineTests.swift with tests for at minimum: applyRepetitionPenalty (penalty > 1 reduces positive logits, amplifies negative), applyNoRepeatNgram (forbids completing seen n-grams), argmax (returns index of max value), convertTokensToText (handles byte-fallback tokens, special token filtering, Unicode replacement character), parseByteFallback (parses <0xHH> patterns, rejects invalid), encoderValidFrames (ceiling division, clamping), and padOrTruncate (truncation, padding, passthrough). These are all static/class methods on CoherePipeline and CohereMelSpectrogram that can be tested without loading CoreML models.
Was this helpful? React with 👍 or 👎 to provide feedback.
Long LibriSpeech runs (2620 files in test-clean) take 10+ hours wall time. A crash near the end loses everything because saveResults only ran on success. Add --checkpoint-every <n> (default 100) so every N successful transcriptions persist the running results array to the output JSON. On crash the user keeps the last multiple-of-N results.
The 35-second per-call limit traces directly to the upstream cohere-pytorch config (max_audio_clip_s: 35, 100 fps × 35 = 3500 mel frames). Document the provenance plus the upstream's overlap_chunk_second: 5 chunking strategy that we don't yet wrap in Swift, so users understand why long-form audio is skipped instead of stitched.
….72x) Full 2620-utterance run on M4 Pro: WER 1.77%, CER 0.60%, total RTFx 1.72x. Competitive with Parakeet TDT 0.6B v3 (~1.7%) and Whisper large-v3 (~1.8%) while running ~1.7x faster than real time.
- Move the duplicated `generateInlineDiff` edit-distance diff renderer (byte-identical copies in AsrBenchmark.swift and FleursBenchmark.swift, ~110 lines each) to a shared `InlineDiff.generate(reference:hypothesis:)` utility in `FluidAudioCLI/Utils/InlineDiff.swift`. - Fix Devin Review finding in CohereBenchmark.swift line 263: when `--max-files` was omitted for FLEURS auto-download, `samplesPerLanguage: maxFiles ?? 100` silently capped each language at 100 samples. Switch to `Int.max`, which is FLEURSBenchmark's documented sentinel for "download all available".
- CohereTranscribeCommand.swift:162 — guard RTFx division against zero-duration totalSeconds with `max(_, 1e-9)`, matching the convention already used in CohereBenchmark.swift:463. - CoherePipeline.swift:596-599 — flatten nested if in decoder loop to comply with the AGENTS.md "avoid nested ifs" rule. Both are no-ops on the correct-input happy path.
Summary
Adds Cohere Transcribe ASR for 14 languages, shipped as an INT8 encoder
CoherePipeline). One CLI forsingle-file transcription, one CLI for dataset benchmarking (FLEURS and
LibriSpeech).
Languages
English, French, German, Spanish, Italian, Portuguese, Dutch, Polish,
Greek, Arabic, Japanese, Chinese (Simplified), Korean, Vietnamese.
What's added
Library (
Sources/FluidAudio/ASR/Cohere/)CoherePipeline— encoder + cache-external decoder runner. Allocatesthe K/V cache host-side (no CoreML State API; iOS 17+), applies the
additive cross-attention mask, and detokenizes via SentencePiece byte
fallback so CJK comes out as real characters. Accepts separate
encoderDir/decoderDirto support the q8/f16 split.CohereAsrConfig— per-language prompt sequences and token IDs;shared 35 s / 3500-frame audio window and 108-token decoder cache window
constants. The 35 s cap traces directly to upstream
max_audio_clip_s: 35.CohereMelSpectrogram— 128-mel front-end matching the referencemodel (preemph, Slaney mel, CMVN).
CLI (
Sources/FluidAudioCLI/Commands/ASR/Cohere/)fluidaudiocli cohere-transcribe <audio> --language <lang>— single-filetranscription. Accepts either
--model-dir(single dir with bothencoder and decoder) or
--encoder-dir+--decoder-dirfor the q8/f16split.
fluidaudiocli cohere-benchmark— dataset benchmark with--dataset fleurs|librispeech,--subsetfor LibriSpeech splits,--languagesfor FLEURS codes,--auto-download, and--checkpoint-every N(default 100) so long runs persist partialresults and survive mid-run crashes.
ModelNames.swiftRepo.cohereTranscribeCoreml→FluidInference/cohere-transcribe-03-2026-coreml/q8.ModelNames.CohereTranscribeenum withencoder,decoderCacheExternal,vocaband the corresponding.mlmodelcpaths.Documentation
Documentation/ASR/Cohere.md— architecture, API, CLI, LibriSpeech +FLEURS results, upstream config provenance (
max_audio_clip_s,overlap_chunk_second), comparison vs Cohere's Figure 4 referencenumbers, caveats.
FLEURS coverage
FleursBenchmark.supportedLanguageswith the 6 non-EuropeanCohere languages (
pt_br,ar_eg,ja_jp,cmn_hans_cn,ko_kr,vi_vn).LibriSpeech test-clean (Apple M2 2022, Tahoe 26.0)
Full split, all 2,620 utterances, single-chunk.
5h 24m audio processed in 3h 09m compute (3h 12m wall time including
one-time ~6 min ANE cold-start compile). Competitive with Parakeet TDT
0.6B v3 (~1.7%) and Whisper large-v3 (~1.8%).
FLEURS results (full splits, single-chunk)
M4 Pro / Tahoe 26.0, 9,911 samples total.
†Japanese and Mandarin are written without word boundaries, so WER on the
raw hypothesis is a tokenization artifact — CER is the real accuracy
metric. Cohere's own Figure 4 uses CER for zh/ja/ko for the same reason.
Usage
HuggingFace
https://huggingface.co/FluidInference/cohere-transcribe-03-2026-coreml
(subdir
q8/)Notes
(
max_audio_clip_s: 35incohere-pytorch/config.json). UpstreamPython also supports >35 s via 5 s-overlap chunking
(
overlap_chunk_second: 5); this port does not implement that wrapperyet and skips longer utterances with a warning.
regresses quality significantly in testing and is not shipped.
Test plan