feat(cohere-transcribe): CoreML export + host-side pipeline fix + q8 hybrid by Alex-Wengg · Pull Request #41 · FluidInference/mobius

Alex-Wengg · 2026-04-06T00:27:54Z

Summary

CoreML conversion pipeline for Cohere Transcribe 03-2026 (14-language encoder-decoder ASR), plus the host-side preprocessing and decode fixes required for the shipped weights to actually produce correct output.

Scope is intentionally limited to models/stt/cohere-transcribe-03-2026/coreml/ — no changes to VAD, Qwen3, or other model directories.

What's in here

Exports (`exports/`)

export-encoder.py — 48-layer Conformer encoder, fixed mel input [1, 128, 3500], output [1, 438, 1024]. Uses the correct length masking parameter so padded frames are ignored internally.
export-decoder-cache-external.py — 8-layer decoder with Parakeet-style external KV cache (108-token window). This is the shipped decoder path; the stateful-decoder variant was dropped after the cache-external version landed.

Tools (`tools/`)

cohere_features_v2.py — canonical numpy port of FilterbankFeatures (see host-side fixes below).
quantize_to_int8.py — W8A16 quantization for the encoder.
compile_encoder_to_mlmodelc.py — .mlpackage → .mlmodelc compile step with ANE targeting.
download-fleurs-for-swift.py — FLEURS slice downloader used by the Swift benchmark.

Host-side pipeline fixes (the important part)

The shipped cohere_mel_spectrogram.py did not match processing_cohere_asr.py::FilterbankFeatures on any parameter that matters: wrong n_fft, wrong window, wrong mel normalization, wrong log base, and no per-feature CMVN at all. Without CMVN, every utterance's features drift by tens of dB per bin, the encoder is fed out-of-distribution data, and the decoder emits whatever language cluster happened to be nearest — Arabic for French, Polish for Chinese, etc.

Four host-only fixes (no retraining, same weights) make the failures disappear:

tools/cohere_features_v2.py — faithful numpy port of FilterbankFeatures: n_fft=512, Hann(400) zero-padded to 512, preemph=0.97, Slaney mel, natural log + 2^-24 guard, per-feature CMVN ddof=1 ε=1e-5, mag_power=2.0. Verified against AutoFeatureExtractor.from_pretrained(..., trust_remote_code=True) on real samples × 4 languages via tests/test-feature-parity.py — residual is within HF's own dither variance.
Cross-attention masking — the encoder always emits 438 frames but only ceil(feature_length * 438/3500) correspond to real audio. Padded frames are masked with -1e4.
Repetition penalty + no-repeat-ngram in greedy decode (defaults 1.1 and 3).
SentencePiece byte-fallback detok — CJK characters emit as <0xHH> triples; tokens_to_text buffers and flushes via bytes(...).decode(\"utf-8\", errors=\"replace\").

FLEURS impact (3 samples × 4 languages, same CoreML model files)

Language	Metric	OLD	NEW	Δ
en_us	WER	55.3%	10.6%	−44.6 pp
es_419	WER	11.3%	4.9%	−6.4 pp
fr_fr	WER	92.1%	16.8%	−75.2 pp
cmn_hans_cn	CER	261.7%	14.1%	−247.6 pp

Q8 findings

The shipped q8 decoder has a separate, orthogonal failure mode from the host-side bugs: over-generation. It produces a correct transcript, then keeps going past the real EOS. Instrumented logging shows EOS is consistently rank 1 or 2 with a ~2 logit gap to the winner — textbook weight-only INT8 error on the final classifier.

Re-quantization experiments (kept decoder lm_head at FP16 via tied-const handling, per-tensor variants, threshold skipping) never matched FP16; the q8 quality loss is distributed across many layers, not localized to lm_head. A proper fix would need calibration-aware quantization with end-of-utterance frames, or mixed-precision keeping attention output projections at FP16 — both outside the coremltools.optimize.coreml op-level API.

Recommendation adopted downstream: ship INT8 encoder (W8A16) + FP16 cache-external decoder as the hybrid default. The encoder INT8 is lossless (±0.5% WER) and the FP16 decoder sidesteps the EOS regression entirely. See docs/HOST_SIDE_FIXES.md and docs/Q8_EOS_BIAS.md for full detail.

Downstream validation (FluidAudio#487)

Swift integration with the INT8 encoder + FP16 cache-external decoder is live in FluidAudio#487. Benchmarks using the exports and feature extractor from this PR:

LibriSpeech test-clean (full split, Apple M2 2022)

Samples	WER	CER	RTFx (per-file mean)	RTFx (total)
2,620	1.77%	0.60%	2.04×	1.72×

FLEURS (full splits, 14 languages, M4 Pro)

Code	Lang	Samples	WER	CER	RTFx
en_us	English	647	5.63%	3.19%	2.49×
fr_fr	French	676	6.22%	3.11%	2.21×
de_de	German	862	5.84%	2.83%	1.98×
es_419	Spanish (LatAm)	908	4.53%	2.40%	1.34×
it_it	Italian	865	4.03%	2.04%	3.15×
pt_br	Portuguese (BR)	919	6.44%	3.38%	2.79×
nl_nl	Dutch	364	8.07%	4.14%	2.04×
pl_pl	Polish	758	7.49%	3.23%	1.98×
el_gr	Greek	650	11.50%	5.45%	2.00×
ar_eg	Arabic (EG)	428	18.46%	6.71%	2.06×
ja_jp	Japanese	650	60.13%†	6.25%	2.23×
cmn_hans_cn	Mandarin (Simp)	945	98.52%†	12.01%	1.85×
ko_kr	Korean	382	16.39%	6.67%	1.84×
vi_vn	Vietnamese	857	9.55%	6.87%	1.55×

†ja/zh written without word boundaries → WER is tokenization artifact; CER is the real accuracy metric for those languages.

Comparison vs. Cohere's technical-report Figure 4 (FLEURS+CV17+MLS+Wenet avg, FP16 PyTorch): ours lands ~1–3 pp higher on most languages, consistent with (a) FLEURS-only being harder than a 4-corpus average and (b) the INT8 encoder. Japanese CER is actually below Cohere's reported number.

Docs landed in this PR

coreml/README.md — end-to-end CoreML conversion walkthrough.
docs/HOST_SIDE_FIXES.md — the four host-side fixes and FLEURS A/B tables.
docs/Q8_EOS_BIAS.md — EOS-bias analysis and re-quantization experiments.
docs/CACHE_EXTERNAL_ANALYSIS.md, CACHE_EXTERNAL_DELIVERED.md, CACHE_INVESTIGATION_SUMMARY.md — cache-external decoder path rationale and delivery notes.
docs/COHERE_ARCHITECTURE_ANALYSIS.md — upstream model walkthrough and CoreML mapping.

Test plan

Encoder export with correct length masking
Cache-external decoder export (external KV cache, 108-token window)
INT8 W8A16 encoder quantization
cohere_features_v2.py parity vs HF AutoFeatureExtractor (tests/test-feature-parity.py)
FLEURS 4-language A/B (broken vs fixed host pipeline)
HuggingFace uploads (f16 and q8 hybrid)
Swift integration landed in FluidAudio#487
LibriSpeech test-clean full benchmark (2,620 samples, WER 1.77%)
Full 14-language FLEURS benchmark

Known limitations

Fully-q8 decoder is unusable without a runtime EOS bias or re-quantization with EOU calibration; ship INT8-encoder + FP16-decoder hybrid instead.
INT4 encoder rejected (293% avg WER).
Swift CoreML pipeline is single-chunk (≤35 s per call); >35 s audio requires the upstream 5-s-overlap sliding window, not yet ported.

Fixes 3 critical correctness issues identified in PR #41 reviews: 1. **Language Token IDs Completely Broken** (f16/example_inference.py, q8/example_inference.py): - Fix LANGUAGE_PROMPTS dictionary with correct language token IDs - Position 4-5: Use correct language tokens (e.g., 169 for Spanish, not hardcoded 62) - Position 9: Use 13 (<|nodiarize|>) for all languages, not 14-26 - Language tokens from vocab.json: en=62, es=169, fr=69, de=76, it=97, pt=149, pl=148, nl=60, sv=173, tr=186, ru=155, zh=50, ja=98, ko=110 - Impact: Non-English transcription was silently producing English output 2. **Encoder Parameter Name Typo** (exports/export-encoder.py, export-encoder-ios18.py): - Fix encoder call from `lengths=feature_length` to `length=feature_length` - Since encoder accepts **kwargs, the typo was silently ignored - Impact: Feature length masking was never applied, causing incorrect attention for shorter audio 3. **pyproject.toml Name Field** (pyproject.toml): - Fix copy-paste error: "parakeet-coreml" → "cohere-transcribe-coreml" - Update description to match project purpose

Fixes 3 test-related issues identified in PR #41 reviews: 1. **Wrong EOS Token Fallback** (tests/benchmark-models.py:46): - Fix fallback EOS token: 2 (PAD) → 3 (actual EOS) - Impact: Decoder will stop at correct token when processor unavailable 2. **Mel Padding Frame Mismatch** (tests/*.py): - Fix padding: 3001 frames → 3500 frames (35-second window) - Files: benchmark-models.py, compare-models.py, measure-memory.py - Impact: Prevents dimension mismatches and crashes on longer audio 3. **Operator Precedence Bug** (tests/compare-models.py:164, 166): - Add parentheses to fix condition parsing - Before: `len(...) == 4 and 'cache_k' in key or key == 'new_cache_k'` - After: `len(...) == 4 and ('cache_k' in key or key == 'new_cache_k')` - Impact: Cache assignments now correctly check tensor dimensions

Fixes 2 decoder-related issues identified in PR #41 reviews: 1. **Stateful Decoder Missing log_softmax** (exports/export-decoder-stateful.py:148): - Add torch.log_softmax() after lm_head projection - Before: Returned raw logits from Linear layer - After: Returns log-probabilities - Impact: Beam search and probability-based decoding now work correctly - Greedy decoding unaffected (argmax works on both logits and log-probs) 2. **Multi-Step Validation Feeds Same Token** (exports/export-decoder-stateful.py:407-414): - Fix autoregressive validation loop to feed predicted tokens - Before: Fed start token (4) at every step - After: Feeds previous step's predicted token (current_token = next_token) - Impact: Validation can now detect autoregressive generation bugs

Fixes issue identified in PR #41 reviews: - Remove uv.lock from .gitignore - Commit uv.lock to ensure reproducible dependency versions - Compliance with AGENTS.md requirement for self-contained directories Impact: Contributors now get consistent dependency versions across environments

Add the export + quantization pipeline for the Cohere Command-A Transcribe 03-2026 model, targeting on-device CoreML inference: - exports/: stateful / stateless / encoder export scripts (the stateful stateless-decoder is the one shipped to HF). - tools/: compile_encoder_to_mlmodelc.py, compile_q8_to_mlmodelc.py, quantize_to_int8.py for the INT8 weight-quantization pass. - quantize_encoder_to_int4.py for the INT4 encoder experiment. - Root export-*.py scripts covering earlier cache-external and per-language decoder iterations (kept for reference). - hf-upload/ bundles + f16/q8 package templates are prepared for upload to FluidInference/cohere-transcribe-03-2026-coreml on HuggingFace (README, example, requirements, tokenizer.model). - cohere-pytorch/ vendors the HF pytorch reference (modeling, configuration, processing, tokenizer) for reproducibility and as the export source of truth. - docs/ and top-level .md files document architecture analysis, cache strategies, research notes, and known implementation issues. - uv.lock / pyproject.toml pin the Python toolchain used for export. This commit is pipeline + artifacts only; the host-side fix and q8 investigation land in the two follow-up commits.

…oss-attn masking The shipped hf-upload example code had three host-side bugs that caused drastically degraded transcription even when the CoreML weights were correct. This commit adds the fixed reference implementation (f16/ and q8/ packages), the parity tools that prove the diagnosis, and the benchmarks that quantify the impact. What was broken: 1. Mel spectrogram preprocessing differed from HF CohereAsrFeatureExtractor (different window, fft size rounding, normalization). Every frame was slightly off, which compounded across the encoder. 2. cross_attention_mask was set to all-ones instead of masking the padded encoder frames, letting the decoder attend to zeros. 3. CJK languages were detokenized token-by-token, dropping the SentencePiece byte-fallback triples (<0xE4><0xB8><0xAD>) that encode Chinese/Japanese/Korean characters, producing gibberish. Fixes, all host-side (no weight re-export required): - tools/cohere_features_v2.py: numpy port of CohereAsrFeatureExtractor with bit-close parity to the HF implementation. - f16/example_inference.py, f16/quickstart.py, q8/example_inference.py, q8/quickstart.py: use cohere_features_v2, set cross_attention_mask from encoder lengths, buffer byte-fallback tokens for CJK. - f16/ and q8/ packages (vocab, requirements, pyproject, README) are the canonical uploads for HF. - tests/test-feature-parity.py: parity test vs HF extractor. - tests/diagnose-feature-diff.py: per-frame diff tool. - tests/bench-fix-vs-broken.py: f16 FLEURS broken-vs-fixed comparison. - tests/benchmark-librispeech.py, tests/benchmark-cjk-cer.py: canonical FLEURS + LibriSpeech benchmarks (with --normalize + CER for CJK). Impact (from docs/FP16_VS_INT8_FLEURS_COMPARISON.md): the FLEURS WER / CER numbers move from near-unusable to close to PyTorch reference across all 14 tested languages on f16. Full impact table in the PR body.

After the host-side fix (prior commit), the HF-shipped q8 stateful decoder still over-generates on short utterances (EN/FR/ES), running past the true sentence boundary. The root cause is INT8 weight quantization noise on the EOS logit: at the true end-of-utterance the EOS token sits at rank 1-2 with only a ~2 logit gap to the top token, well within the quantization noise budget. Diagnostic + fix in three scripts: - tests/probe-q8-eos.py: instrument the q8 stateful decoder and dump per-step top-5 tokens, EOS (id=3) logit, EOS rank, and the cumulative hypothesis. Confirms EOS is rank 1-2 with 2-3 unit gap at true boundary. - tests/bench-q8-fleurs.py: FLEURS benchmark of the HF-shipped q8 stateful decoder on the 3-samples-per-language slice used by bench-fix-vs-broken. Establishes the q8 baseline. - tests/bench-q8-eosboost.py: run fixed q8 pipeline with EOS logit boosted by 0.0 / +2.0 / +4.0 on EN/FR/ES slices. +4.0 recovers most of the f16 quality without retraining or re-quantization. No new export / re-quantization was needed; this is a pure inference-side patch (logit_bias_eos=+4.0) that host code can apply when shipping the q8 weights downloaded from FluidInference/cohere-transcribe-03-2026-coreml.

…e-shot scripts Promote the findings previously embedded in one-shot experimental scripts to proper docs, and remove the scripts now that the findings are captured: - docs/HOST_SIDE_FIXES.md documents the three host-side inference bugs (mel spectrogram drift vs HF extractor, all-ones cross_attention_mask, token-by-token CJK detokenization), with exact fix locations (python and Swift) and a reproduction recipe. - docs/Q8_EOS_BIAS.md documents the q8 over-generation diagnosis (EOS at rank 1-2, ~2-3 logit gap), the +4.0 EOS-bias fix, and the independence of this fix from the host-side fixes. - docs/FP16_VS_INT8_FLEURS_COMPARISON.md is flagged as historical — its 200-500% WER numbers predate the host-side fixes and reflect the bugs, not the weights. Removed scripts: - tests/bench-fix-vs-broken.py (368 lines) — broken-vs-fixed demo on the cache-external decoder (not the shipped stateful architecture). Finding: the three host-side bugs are the dominant error source. Captured in HOST_SIDE_FIXES.md. - tests/bench-q8-eosboost.py (244 lines) — +0/+2/+4 bias sweep on EN/FR/ES/zh. Captured in Q8_EOS_BIAS.md. - tests/probe-q8-eos.py (172 lines) — per-step top-5 logit dump. Captured in Q8_EOS_BIAS.md. - tests/diagnose-feature-diff.py (63 lines) — per-frame diff tool; superseded by tests/test-feature-parity.py which covers this. Remaining runnable tests (4): - tests/benchmark-librispeech.py — parameterized LibriSpeech / FLEURS benchmark for any precision. - tests/benchmark-cjk-cer.py — CJK CER benchmark. - tests/bench-q8-fleurs.py — FLEURS bench against the HF-shipped q8 stateful decoder. - tests/test-feature-parity.py — regression test for tools/cohere_features_v2.py vs HF CohereAsrFeatureExtractor.

The f16/ and q8/ directories were staging copies of the HF upload bundle that ships alongside the CoreML weights in FluidInference/cohere-transcribe-03-2026-coreml. They duplicated content now available canonically from HF: - vocab.json (identical) - cohere_mel_spectrogram.py (byte-identical to tools/cohere_features_v2.py) - example_inference.py / quickstart.py / pyproject.toml / requirements.txt (identical between f16/ and q8/) Net effect: ~35K lines removed, zero duplication, one canonical feature-extractor source (tools/cohere_features_v2.py). Updates in-repo tests to source the feature extractor from tools/ and accept a --models-dir flag so local reproduction works by pointing at a HF-downloaded snapshot: huggingface-cli download FluidInference/cohere-transcribe-03-2026-coreml \\ --local-dir ./cohere-models python tests/benchmark-librispeech.py --precision fp16 \\ --models-dir ./cohere-models/f16 tools/quantize_to_int8.py and tools/compile_*.py still reference f16/ and q8/ as local working dirs; those are populated on demand by the quantization pipeline and are not tracked in git. Also drops the OLD-vs-NEW comparison block in test-feature-parity.py since the former "old broken" extractor (cohere_mel_spectrogram.py) has been replaced with the fixed v2 port.

Strips superseded experiments, vendored HF source, transient progress notes, and abandoned variant staging. PR diff drops from 78 -> 21 files and from ~135K -> ~9K added lines. Removed: - cohere-pytorch/ (19 files, ~115K lines): vendored clone of CohereLabs/cohere-transcribe-03-2026. Reproducibility convention is to reference HF, not vendor. test-feature-parity.py now downloads the reference on demand and accepts --pytorch-dir for offline use. - 14 root *_STATUS.md / *_COMPLETE.md / *_FAILURE.md (~3K lines): transient progress narratives whose durable findings are already in docs/HOST_SIDE_FIXES.md and docs/Q8_EOS_BIAS.md. Some pairs (e.g. MLMODELC_LIMITATION + MLMODELC_VERIFIED) directly contradicted each other. - 8 root export-*.py / quantize_encoder_to_int4.py (~2K lines): superseded by exports/{export-encoder,export-decoder-stateful, export-decoder-stateless}.py + tools/quantize_to_int8.py. - hf-upload/ (8 files): staging dir for the abandoned cache-external decoder variant. - 3 root benchmark JSON caches (~850 lines): captured outputs from pre-fix runs. - 7 docs/ pre-fix investigation notes (~2.4K lines): CACHE_INVESTIGATION_SUMMARY, DECODER_CACHE_FIX, OFFICIAL_USAGE_ANALYSIS, QWEN3_VS_COHERE_STATEFUL_CACHE, RESEARCH_INSIGHTS, REVERSE_ENGINEERING, FP16_VS_INT8_FLEURS_COMPARISON. The host-side fixes invalidated the WER/CER measurements these were built on; superseded by HOST_SIDE_FIXES. Kept (canonical surface): - README.md - docs/{HOST_SIDE_FIXES,Q8_EOS_BIAS,STATELESS_VS_STATEFUL, COHERE_ARCHITECTURE_ANALYSIS}.md - tools/{cohere_features_v2,quantize_to_int8,compile_*}.py - exports/{export-encoder,export-decoder-stateful,export-decoder-stateless}.py - tests/{benchmark-librispeech,benchmark-cjk-cer,bench-q8-fleurs, test-feature-parity}.py - download-fleurs-for-swift.py, pyproject.toml, uv.lock, .gitignore Also updates .gitignore to prevent regressions: ignores f16/, q8/, benchmark_*.json, *_fleurs_*.json, *_cache_external*.json, and the external cohere-pytorch/ snapshot path.

Earlier cleanup (f2cdf4e) deleted the cache-external decoder export and its supporting docs / hf-upload bundle on the assumption it was an abandoned variant. That was wrong: FluidAudio PR 487 ships cache-external as the canonical decoder (HF repos `cohere-transcribe-cache-external-coreml` and `cohere-transcribe-q8-cache-external-coreml`). Stateful and stateless are the real research variants. Restored from f2cdf4e^ (+3,824 lines, 18 files): Export scripts (only place these exist): - export-decoder-cache-external-v2.py (canonical, language-conditioned) - export-decoder-cache-external.py (earlier variant) HF-upload bundle (mirror of the published model card): - hf-upload/cohere-transcribe-cache-external-coreml/{README,example, requirements,tokenizer.model,wer_results_cache_external.json, .gitattributes} - hf-upload/{README_UPLOAD,UPLOAD_INSTRUCTIONS}.md Decision / status docs: - CACHE_EXTERNAL_ANALYSIS.md - CACHE_EXTERNAL_DELIVERED.md Investigation docs in docs/: - CACHE_INVESTIGATION_SUMMARY.md (why cache-external) - DECODER_CACHE_FIX.md (concise rationale) - FP16_VS_INT8_FLEURS_COMPARISON.md (numbers behind PR 487's table) - RESEARCH_INSIGHTS.md Result caches: - python_cache_external_full.json - python_cache_external_test.json README.md, docs/STATELESS_VS_STATEFUL.md, tools/quantize_to_int8.py, tools/compile_q8_to_mlmodelc.py, exports/export-decoder-{stateful, stateless}.py, and tests/{benchmark-librispeech,benchmark-cjk-cer, bench-q8-fleurs}.py still describe / operate on the stateful variant and need a follow-up reconciliation pass to make cache-external the documented and tooled canonical path.

…ches Pure data dumps from one-off bench runs; regenerable from tests/bench-q8-fleurs.py. Not referenced by any script.

The cache-external model card, example.py, requirements, and tokenizer already live canonically at huggingface.co/FluidInference/cohere- transcribe-cache-external-coreml. Maintaining a parallel copy here invited drift; the conversion repo's job is the export pipeline, not the deployment artifact.

README_UPLOAD.md and UPLOAD_INSTRUCTIONS.md referenced the cache-external bundle that was just removed. With no bundle to upload from this repo, the instructions are stale.

Consolidate the cache-external decision/status docs alongside the rest of the investigation notes. README.md stays at root as the entry point.

Removes four docs that misrepresent the canonical pipeline now that cache-external (with host-side fixes) is the shipped variant: - DECODER_CACHE_FIX.md (113): claims stateless is the cache-bug fix. Cache-external is the actual fix; stateless never shipped. - FP16_VS_INT8_FLEURS_COMPARISON.md (435): self-flagged as superseded in its header — all numbers predate HOST_SIDE_FIXES.md. - STATELESS_VS_STATEFUL.md (358): frames decoder choice as a 2-way bake-off; both options are non-canonical research variants. - RESEARCH_INSIGHTS.md (494): generic Cohere-architecture essay; no actionable export-pipeline content.

…ports/ Move both cache-external decoder export scripts into exports/ alongside the encoder + stateful + stateless variants, so all decoder export pipelines live in one directory.

It's a utility script (FLEURS dataset fetcher), not a top-level entry point. Belongs alongside the other tools/ scripts.

Removes three decoder export variants that don't produce the artifact FluidAudio actually loads: - export-decoder-stateful.py (436): research variant using CoreML State API. Doesn't match CohereFixedPipeline's I/O. No consumer. - export-decoder-stateless.py (296): unverified Parakeet-style variant. README itself flagged it as broken (icon-repetition, 10x slower). No consumer. - export-decoder-cache-external-v2.py (342): adds a language_id input that FluidAudio never passes; the artifact would fail to load in CohereFixedPipeline. Language-conditioning experiment that didn't ship. exports/ now holds only the two scripts that produce the artifacts on HuggingFace (cohere-transcribe-cache-external-coreml / cohere-transcribe-q8-cache-external-coreml): export-encoder.py and export-decoder-cache-external.py. README.md and docs/CACHE_INVESTIGATION_SUMMARY.md still reference the deleted stateless export by name; those will be reconciled in the README rewrite.

All three Python bench scripts loaded cohere_decoder_stateful.mlpackage and used the stateful CoreML State API (make_state, input_id, per-step attention_mask). With export-decoder-stateful.py deleted and the canonical HF artifacts (cohere-transcribe-cache-external-coreml / cohere-transcribe-q8-cache-external-coreml) shipping cache-external instead, these scripts cannot run. Removed: - bench-q8-fleurs.py (301) - benchmark-librispeech.py (336) - benchmark-cjk-cer.py (419) The actual benchmark numbers in FluidAudio PR 487 come from the Swift CLI (Scripts/run_cohere_per_lang.sh + CohereMixedBenchmark.swift), not from these Python harnesses, so no benchmark capability is lost. tests/ now keeps just test-feature-parity.py (mel-extractor parity against the HF reference, decoder-independent).

- Delete compile_q8_to_mlmodelc.py (loaded deleted cohere_decoder_stateful.mlpackage) - Strip decoder block from quantize_to_int8.py; encoder quantization remains valid The cache-external Q8 decoder is already published on HF (FluidInference/cohere-transcribe-q8-cache-external-coreml); regenerating a stateful-decoder Q8 artifact is no longer in this pipeline's scope.

… with cache-external pipeline - README: rewrite around the cache-external decoder (canonical), drop stateful/stateless framing, document the actual decoder I/O contract consumed by FluidAudio's CohereFixedPipeline, and point at the shipped HF artifacts. - CACHE_INVESTIGATION_SUMMARY: replace dangling references to deleted export-decoder-stateless.py / test scripts with a pointer to the cache-external pipeline; rewrite the conclusion to explain why the stateless stepping stone was abandoned in favor of moving cache management into the host.

`compile_encoder_to_mlmodelc.py` ended with a printed warning that "the decoder MUST remain .mlpackage (State API requirement)". That claim is stale: this PR ships the cache-external decoder, which uses no State API and compiles to .mlmodelc like the encoder. Update the closing message so someone running the script end-to-end doesn't come away thinking the decoder has a packaging restriction it doesn't actually have.

Addresses Devin review #4163032119 on PR #41: - export-encoder.py: add convert_to="mlprogram" and compute_units=ct.ComputeUnit.CPU_ONLY (was relying on ALL default), matching parakeet-tdt-v3-0.6b/coreml/individual_components.py and the CLAUDE.md "Trace with .CpuOnly" constraint. - export-encoder.py: fix stale hardcoded output-shape print (1, 376, 1024) -> (1, 438, 1024). - export-decoder-cache-external.py: change ct.ComputeUnit.ALL -> ct.ComputeUnit.CPU_ONLY to match the same convention.

Addresses Devin review finding on PR #41: the decoder export script's argparse only accepts --model-id and --output-dir, so the README's `--precision float16` caused argparse to exit with "unrecognized arguments". The decoder intentionally uses the coremltools default precision — drop the flag from the documented command instead of adding an unused argument.

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

Alex-Wengg changed the title ~~fix(cohere): Implement stateless decoder to fix cache repetition bug~~ feat(cohere): Add Cohere Transcribe CoreML conversion pipeline with stateful decoder Apr 6, 2026

Alex-Wengg mentioned this pull request Apr 6, 2026

Model Support Requests FluidInference/FluidAudio#49

Open

This comment was marked as resolved.

Sign in to view

Alex-Wengg changed the title ~~feat(cohere): Add Cohere Transcribe CoreML conversion pipeline with stateful decoder~~ feat(cohere): Add Cohere Transcribe CoreML conversion with critical fixes Apr 7, 2026

BrandonWeng reviewed Apr 8, 2026

View reviewed changes

Comment thread models/stt/cohere-transcribe-03-2026/cohere-pytorch/.gitattributes Outdated

Alex-Wengg marked this pull request as draft April 8, 2026 19:15

BrandonWeng requested review from BrandonWeng and removed request for BrandonWeng April 12, 2026 15:29

Alex-Wengg changed the title ~~feat(cohere): Add Cohere Transcribe CoreML conversion with critical fixes~~ fix(cohere): correct host-side mel features + CJK detokenization (resolves 71% FLEURS failure) Apr 21, 2026

Alex-Wengg changed the title ~~fix(cohere): correct host-side mel features + CJK detokenization (resolves 71% FLEURS failure)~~ feat(cohere-transcribe): CoreML conversion + host-side pipeline fix + q8 findings Apr 22, 2026

Alex-Wengg added 3 commits April 22, 2026 18:10

Alex-Wengg force-pushed the docs/cohere-transcribe-coreml-decoder-fix branch from b401305 to 8013c05 Compare April 22, 2026 22:12

Alex-Wengg added 6 commits April 22, 2026 18:18

chore(cohere-transcribe): drop python_cache_external_*.json result ca…

fb6dd30

…ches Pure data dumps from one-off bench runs; regenerable from tests/bench-q8-fleurs.py. Not referenced by any script.

Alex-Wengg added 9 commits April 22, 2026 20:11

chore(cohere-transcribe): drop hf-upload workflow docs

114c84c

README_UPLOAD.md and UPLOAD_INSTRUCTIONS.md referenced the cache-external bundle that was just removed. With no bundle to upload from this repo, the instructions are stale.

chore(cohere-transcribe): move CACHE_EXTERNAL_*.md into docs/

f01d2d0

Consolidate the cache-external decision/status docs alongside the rest of the investigation notes. README.md stays at root as the entry point.

chore(cohere-transcribe): consolidate cache-external exports under ex…

ab9c2ce

…ports/ Move both cache-external decoder export scripts into exports/ alongside the encoder + stateful + stateless variants, so all decoder export pipelines live in one directory.

chore(cohere-transcribe): move download-fleurs-for-swift.py into tools/

8bb9714

It's a utility script (FLEURS dataset fetcher), not a top-level entry point. Belongs alongside the other tools/ scripts.

Alex-Wengg marked this pull request as ready for review April 23, 2026 13:40

Alex-Wengg changed the title ~~feat(cohere-transcribe): CoreML conversion + host-side pipeline fix + q8 findings~~ feat(cohere-transcribe): CoreML export + host-side pipeline fix + q8 hybrid Apr 23, 2026

This comment was marked as resolved.

Sign in to view

Alex-Wengg added 2 commits April 23, 2026 11:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cohere-transcribe): CoreML export + host-side pipeline fix + q8 hybrid#41

feat(cohere-transcribe): CoreML export + host-side pipeline fix + q8 hybrid#41
Alex-Wengg wants to merge 21 commits intomainfrom
docs/cohere-transcribe-coreml-decoder-fix

Alex-Wengg commented Apr 6, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Alex-Wengg commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in here

Exports (exports/)

Tools (tools/)

Host-side pipeline fixes (the important part)

FLEURS impact (3 samples × 4 languages, same CoreML model files)

Q8 findings

Downstream validation (FluidAudio#487)

LibriSpeech test-clean (full split, Apple M2 2022)

FLEURS (full splits, 14 languages, M4 Pro)

Docs landed in this PR

Test plan

Known limitations

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Alex-Wengg commented Apr 6, 2026 •

edited

Loading

Exports (`exports/`)

Tools (`tools/`)