feat(cohere-transcribe): CoreML export + host-side pipeline fix + q8 hybrid#41
Open
Alex-Wengg wants to merge 21 commits intomainfrom
Open
feat(cohere-transcribe): CoreML export + host-side pipeline fix + q8 hybrid#41Alex-Wengg wants to merge 21 commits intomainfrom
Alex-Wengg wants to merge 21 commits intomainfrom
Conversation
Alex-Wengg
added a commit
that referenced
this pull request
Apr 7, 2026
Fixes 3 critical correctness issues identified in PR #41 reviews: 1. **Language Token IDs Completely Broken** (f16/example_inference.py, q8/example_inference.py): - Fix LANGUAGE_PROMPTS dictionary with correct language token IDs - Position 4-5: Use correct language tokens (e.g., 169 for Spanish, not hardcoded 62) - Position 9: Use 13 (<|nodiarize|>) for all languages, not 14-26 - Language tokens from vocab.json: en=62, es=169, fr=69, de=76, it=97, pt=149, pl=148, nl=60, sv=173, tr=186, ru=155, zh=50, ja=98, ko=110 - Impact: Non-English transcription was silently producing English output 2. **Encoder Parameter Name Typo** (exports/export-encoder.py, export-encoder-ios18.py): - Fix encoder call from `lengths=feature_length` to `length=feature_length` - Since encoder accepts **kwargs, the typo was silently ignored - Impact: Feature length masking was never applied, causing incorrect attention for shorter audio 3. **pyproject.toml Name Field** (pyproject.toml): - Fix copy-paste error: "parakeet-coreml" → "cohere-transcribe-coreml" - Update description to match project purpose
Alex-Wengg
added a commit
that referenced
this pull request
Apr 7, 2026
Fixes 3 test-related issues identified in PR #41 reviews: 1. **Wrong EOS Token Fallback** (tests/benchmark-models.py:46): - Fix fallback EOS token: 2 (PAD) → 3 (actual EOS) - Impact: Decoder will stop at correct token when processor unavailable 2. **Mel Padding Frame Mismatch** (tests/*.py): - Fix padding: 3001 frames → 3500 frames (35-second window) - Files: benchmark-models.py, compare-models.py, measure-memory.py - Impact: Prevents dimension mismatches and crashes on longer audio 3. **Operator Precedence Bug** (tests/compare-models.py:164, 166): - Add parentheses to fix condition parsing - Before: `len(...) == 4 and 'cache_k' in key or key == 'new_cache_k'` - After: `len(...) == 4 and ('cache_k' in key or key == 'new_cache_k')` - Impact: Cache assignments now correctly check tensor dimensions
Alex-Wengg
added a commit
that referenced
this pull request
Apr 7, 2026
Fixes 2 decoder-related issues identified in PR #41 reviews: 1. **Stateful Decoder Missing log_softmax** (exports/export-decoder-stateful.py:148): - Add torch.log_softmax() after lm_head projection - Before: Returned raw logits from Linear layer - After: Returns log-probabilities - Impact: Beam search and probability-based decoding now work correctly - Greedy decoding unaffected (argmax works on both logits and log-probs) 2. **Multi-Step Validation Feeds Same Token** (exports/export-decoder-stateful.py:407-414): - Fix autoregressive validation loop to feed predicted tokens - Before: Fed start token (4) at every step - After: Feeds previous step's predicted token (current_token = next_token) - Impact: Validation can now detect autoregressive generation bugs
Alex-Wengg
added a commit
that referenced
this pull request
Apr 7, 2026
Fixes issue identified in PR #41 reviews: - Remove uv.lock from .gitignore - Commit uv.lock to ensure reproducible dependency versions - Compliance with AGENTS.md requirement for self-contained directories Impact: Contributors now get consistent dependency versions across environments
BrandonWeng
reviewed
Apr 8, 2026
Add the export + quantization pipeline for the Cohere Command-A Transcribe
03-2026 model, targeting on-device CoreML inference:
- exports/: stateful / stateless / encoder export scripts (the stateful
stateless-decoder is the one shipped to HF).
- tools/: compile_encoder_to_mlmodelc.py, compile_q8_to_mlmodelc.py,
quantize_to_int8.py for the INT8 weight-quantization pass.
- quantize_encoder_to_int4.py for the INT4 encoder experiment.
- Root export-*.py scripts covering earlier cache-external and
per-language decoder iterations (kept for reference).
- hf-upload/ bundles + f16/q8 package templates are prepared for upload
to FluidInference/cohere-transcribe-03-2026-coreml on HuggingFace
(README, example, requirements, tokenizer.model).
- cohere-pytorch/ vendors the HF pytorch reference (modeling,
configuration, processing, tokenizer) for reproducibility and as the
export source of truth.
- docs/ and top-level .md files document architecture analysis, cache
strategies, research notes, and known implementation issues.
- uv.lock / pyproject.toml pin the Python toolchain used for export.
This commit is pipeline + artifacts only; the host-side fix and q8
investigation land in the two follow-up commits.
…oss-attn masking
The shipped hf-upload example code had three host-side bugs that caused
drastically degraded transcription even when the CoreML weights were
correct. This commit adds the fixed reference implementation (f16/ and
q8/ packages), the parity tools that prove the diagnosis, and the
benchmarks that quantify the impact.
What was broken:
1. Mel spectrogram preprocessing differed from HF CohereAsrFeatureExtractor
(different window, fft size rounding, normalization). Every frame
was slightly off, which compounded across the encoder.
2. cross_attention_mask was set to all-ones instead of masking the
padded encoder frames, letting the decoder attend to zeros.
3. CJK languages were detokenized token-by-token, dropping the
SentencePiece byte-fallback triples (<0xE4><0xB8><0xAD>) that
encode Chinese/Japanese/Korean characters, producing gibberish.
Fixes, all host-side (no weight re-export required):
- tools/cohere_features_v2.py: numpy port of CohereAsrFeatureExtractor
with bit-close parity to the HF implementation.
- f16/example_inference.py, f16/quickstart.py, q8/example_inference.py,
q8/quickstart.py: use cohere_features_v2, set cross_attention_mask
from encoder lengths, buffer byte-fallback tokens for CJK.
- f16/ and q8/ packages (vocab, requirements, pyproject, README) are
the canonical uploads for HF.
- tests/test-feature-parity.py: parity test vs HF extractor.
- tests/diagnose-feature-diff.py: per-frame diff tool.
- tests/bench-fix-vs-broken.py: f16 FLEURS broken-vs-fixed comparison.
- tests/benchmark-librispeech.py, tests/benchmark-cjk-cer.py: canonical
FLEURS + LibriSpeech benchmarks (with --normalize + CER for CJK).
Impact (from docs/FP16_VS_INT8_FLEURS_COMPARISON.md): the FLEURS WER /
CER numbers move from near-unusable to close to PyTorch reference across
all 14 tested languages on f16. Full impact table in the PR body.
After the host-side fix (prior commit), the HF-shipped q8 stateful
decoder still over-generates on short utterances (EN/FR/ES), running
past the true sentence boundary. The root cause is INT8 weight
quantization noise on the EOS logit: at the true end-of-utterance the
EOS token sits at rank 1-2 with only a ~2 logit gap to the top token,
well within the quantization noise budget.
Diagnostic + fix in three scripts:
- tests/probe-q8-eos.py: instrument the q8 stateful decoder and dump
per-step top-5 tokens, EOS (id=3) logit, EOS rank, and the
cumulative hypothesis. Confirms EOS is rank 1-2 with 2-3 unit gap
at true boundary.
- tests/bench-q8-fleurs.py: FLEURS benchmark of the HF-shipped q8
stateful decoder on the 3-samples-per-language slice used by
bench-fix-vs-broken. Establishes the q8 baseline.
- tests/bench-q8-eosboost.py: run fixed q8 pipeline with EOS logit
boosted by 0.0 / +2.0 / +4.0 on EN/FR/ES slices. +4.0 recovers most
of the f16 quality without retraining or re-quantization.
No new export / re-quantization was needed; this is a pure inference-side
patch (logit_bias_eos=+4.0) that host code can apply when shipping the
q8 weights downloaded from
FluidInference/cohere-transcribe-03-2026-coreml.
b401305 to
8013c05
Compare
…e-shot scripts
Promote the findings previously embedded in one-shot experimental
scripts to proper docs, and remove the scripts now that the findings are
captured:
- docs/HOST_SIDE_FIXES.md documents the three host-side inference bugs
(mel spectrogram drift vs HF extractor, all-ones cross_attention_mask,
token-by-token CJK detokenization), with exact fix locations (python
and Swift) and a reproduction recipe.
- docs/Q8_EOS_BIAS.md documents the q8 over-generation diagnosis
(EOS at rank 1-2, ~2-3 logit gap), the +4.0 EOS-bias fix, and the
independence of this fix from the host-side fixes.
- docs/FP16_VS_INT8_FLEURS_COMPARISON.md is flagged as historical —
its 200-500% WER numbers predate the host-side fixes and reflect
the bugs, not the weights.
Removed scripts:
- tests/bench-fix-vs-broken.py (368 lines) — broken-vs-fixed demo on
the cache-external decoder (not the shipped stateful architecture).
Finding: the three host-side bugs are the dominant error source.
Captured in HOST_SIDE_FIXES.md.
- tests/bench-q8-eosboost.py (244 lines) — +0/+2/+4 bias sweep on
EN/FR/ES/zh. Captured in Q8_EOS_BIAS.md.
- tests/probe-q8-eos.py (172 lines) — per-step top-5 logit dump.
Captured in Q8_EOS_BIAS.md.
- tests/diagnose-feature-diff.py (63 lines) — per-frame diff tool;
superseded by tests/test-feature-parity.py which covers this.
Remaining runnable tests (4):
- tests/benchmark-librispeech.py — parameterized LibriSpeech / FLEURS
benchmark for any precision.
- tests/benchmark-cjk-cer.py — CJK CER benchmark.
- tests/bench-q8-fleurs.py — FLEURS bench against the HF-shipped q8
stateful decoder.
- tests/test-feature-parity.py — regression test for
tools/cohere_features_v2.py vs HF CohereAsrFeatureExtractor.
The f16/ and q8/ directories were staging copies of the HF upload
bundle that ships alongside the CoreML weights in
FluidInference/cohere-transcribe-03-2026-coreml. They duplicated
content now available canonically from HF:
- vocab.json (identical)
- cohere_mel_spectrogram.py (byte-identical to tools/cohere_features_v2.py)
- example_inference.py / quickstart.py / pyproject.toml / requirements.txt
(identical between f16/ and q8/)
Net effect: ~35K lines removed, zero duplication, one canonical
feature-extractor source (tools/cohere_features_v2.py).
Updates in-repo tests to source the feature extractor from tools/ and
accept a --models-dir flag so local reproduction works by pointing at
a HF-downloaded snapshot:
huggingface-cli download FluidInference/cohere-transcribe-03-2026-coreml \\
--local-dir ./cohere-models
python tests/benchmark-librispeech.py --precision fp16 \\
--models-dir ./cohere-models/f16
tools/quantize_to_int8.py and tools/compile_*.py still reference f16/
and q8/ as local working dirs; those are populated on demand by the
quantization pipeline and are not tracked in git.
Also drops the OLD-vs-NEW comparison block in test-feature-parity.py
since the former "old broken" extractor (cohere_mel_spectrogram.py)
has been replaced with the fixed v2 port.
Strips superseded experiments, vendored HF source, transient progress
notes, and abandoned variant staging. PR diff drops from 78 -> 21 files
and from ~135K -> ~9K added lines.
Removed:
- cohere-pytorch/ (19 files, ~115K lines): vendored clone of
CohereLabs/cohere-transcribe-03-2026. Reproducibility convention is to
reference HF, not vendor. test-feature-parity.py now downloads the
reference on demand and accepts --pytorch-dir for offline use.
- 14 root *_STATUS.md / *_COMPLETE.md / *_FAILURE.md (~3K lines):
transient progress narratives whose durable findings are already in
docs/HOST_SIDE_FIXES.md and docs/Q8_EOS_BIAS.md. Some pairs (e.g.
MLMODELC_LIMITATION + MLMODELC_VERIFIED) directly contradicted each
other.
- 8 root export-*.py / quantize_encoder_to_int4.py (~2K lines):
superseded by exports/{export-encoder,export-decoder-stateful,
export-decoder-stateless}.py + tools/quantize_to_int8.py.
- hf-upload/ (8 files): staging dir for the abandoned cache-external
decoder variant.
- 3 root benchmark JSON caches (~850 lines): captured outputs from
pre-fix runs.
- 7 docs/ pre-fix investigation notes (~2.4K lines):
CACHE_INVESTIGATION_SUMMARY, DECODER_CACHE_FIX, OFFICIAL_USAGE_ANALYSIS,
QWEN3_VS_COHERE_STATEFUL_CACHE, RESEARCH_INSIGHTS, REVERSE_ENGINEERING,
FP16_VS_INT8_FLEURS_COMPARISON. The host-side fixes invalidated the
WER/CER measurements these were built on; superseded by HOST_SIDE_FIXES.
Kept (canonical surface):
- README.md
- docs/{HOST_SIDE_FIXES,Q8_EOS_BIAS,STATELESS_VS_STATEFUL,
COHERE_ARCHITECTURE_ANALYSIS}.md
- tools/{cohere_features_v2,quantize_to_int8,compile_*}.py
- exports/{export-encoder,export-decoder-stateful,export-decoder-stateless}.py
- tests/{benchmark-librispeech,benchmark-cjk-cer,bench-q8-fleurs,
test-feature-parity}.py
- download-fleurs-for-swift.py, pyproject.toml, uv.lock, .gitignore
Also updates .gitignore to prevent regressions: ignores f16/, q8/,
benchmark_*.json, *_fleurs_*.json, *_cache_external*.json, and the
external cohere-pytorch/ snapshot path.
Earlier cleanup (f2cdf4e) deleted the cache-external decoder export and its supporting docs / hf-upload bundle on the assumption it was an abandoned variant. That was wrong: FluidAudio PR 487 ships cache-external as the canonical decoder (HF repos `cohere-transcribe-cache-external-coreml` and `cohere-transcribe-q8-cache-external-coreml`). Stateful and stateless are the real research variants. Restored from f2cdf4e^ (+3,824 lines, 18 files): Export scripts (only place these exist): - export-decoder-cache-external-v2.py (canonical, language-conditioned) - export-decoder-cache-external.py (earlier variant) HF-upload bundle (mirror of the published model card): - hf-upload/cohere-transcribe-cache-external-coreml/{README,example, requirements,tokenizer.model,wer_results_cache_external.json, .gitattributes} - hf-upload/{README_UPLOAD,UPLOAD_INSTRUCTIONS}.md Decision / status docs: - CACHE_EXTERNAL_ANALYSIS.md - CACHE_EXTERNAL_DELIVERED.md Investigation docs in docs/: - CACHE_INVESTIGATION_SUMMARY.md (why cache-external) - DECODER_CACHE_FIX.md (concise rationale) - FP16_VS_INT8_FLEURS_COMPARISON.md (numbers behind PR 487's table) - RESEARCH_INSIGHTS.md Result caches: - python_cache_external_full.json - python_cache_external_test.json README.md, docs/STATELESS_VS_STATEFUL.md, tools/quantize_to_int8.py, tools/compile_q8_to_mlmodelc.py, exports/export-decoder-{stateful, stateless}.py, and tests/{benchmark-librispeech,benchmark-cjk-cer, bench-q8-fleurs}.py still describe / operate on the stateful variant and need a follow-up reconciliation pass to make cache-external the documented and tooled canonical path.
…ches Pure data dumps from one-off bench runs; regenerable from tests/bench-q8-fleurs.py. Not referenced by any script.
The cache-external model card, example.py, requirements, and tokenizer already live canonically at huggingface.co/FluidInference/cohere- transcribe-cache-external-coreml. Maintaining a parallel copy here invited drift; the conversion repo's job is the export pipeline, not the deployment artifact.
README_UPLOAD.md and UPLOAD_INSTRUCTIONS.md referenced the cache-external bundle that was just removed. With no bundle to upload from this repo, the instructions are stale.
Consolidate the cache-external decision/status docs alongside the rest of the investigation notes. README.md stays at root as the entry point.
Removes four docs that misrepresent the canonical pipeline now that cache-external (with host-side fixes) is the shipped variant: - DECODER_CACHE_FIX.md (113): claims stateless is the cache-bug fix. Cache-external is the actual fix; stateless never shipped. - FP16_VS_INT8_FLEURS_COMPARISON.md (435): self-flagged as superseded in its header — all numbers predate HOST_SIDE_FIXES.md. - STATELESS_VS_STATEFUL.md (358): frames decoder choice as a 2-way bake-off; both options are non-canonical research variants. - RESEARCH_INSIGHTS.md (494): generic Cohere-architecture essay; no actionable export-pipeline content.
…ports/ Move both cache-external decoder export scripts into exports/ alongside the encoder + stateful + stateless variants, so all decoder export pipelines live in one directory.
It's a utility script (FLEURS dataset fetcher), not a top-level entry point. Belongs alongside the other tools/ scripts.
Removes three decoder export variants that don't produce the artifact FluidAudio actually loads: - export-decoder-stateful.py (436): research variant using CoreML State API. Doesn't match CohereFixedPipeline's I/O. No consumer. - export-decoder-stateless.py (296): unverified Parakeet-style variant. README itself flagged it as broken (icon-repetition, 10x slower). No consumer. - export-decoder-cache-external-v2.py (342): adds a language_id input that FluidAudio never passes; the artifact would fail to load in CohereFixedPipeline. Language-conditioning experiment that didn't ship. exports/ now holds only the two scripts that produce the artifacts on HuggingFace (cohere-transcribe-cache-external-coreml / cohere-transcribe-q8-cache-external-coreml): export-encoder.py and export-decoder-cache-external.py. README.md and docs/CACHE_INVESTIGATION_SUMMARY.md still reference the deleted stateless export by name; those will be reconciled in the README rewrite.
All three Python bench scripts loaded cohere_decoder_stateful.mlpackage and used the stateful CoreML State API (make_state, input_id, per-step attention_mask). With export-decoder-stateful.py deleted and the canonical HF artifacts (cohere-transcribe-cache-external-coreml / cohere-transcribe-q8-cache-external-coreml) shipping cache-external instead, these scripts cannot run. Removed: - bench-q8-fleurs.py (301) - benchmark-librispeech.py (336) - benchmark-cjk-cer.py (419) The actual benchmark numbers in FluidAudio PR 487 come from the Swift CLI (Scripts/run_cohere_per_lang.sh + CohereMixedBenchmark.swift), not from these Python harnesses, so no benchmark capability is lost. tests/ now keeps just test-feature-parity.py (mel-extractor parity against the HF reference, decoder-independent).
- Delete compile_q8_to_mlmodelc.py (loaded deleted cohere_decoder_stateful.mlpackage) - Strip decoder block from quantize_to_int8.py; encoder quantization remains valid The cache-external Q8 decoder is already published on HF (FluidInference/cohere-transcribe-q8-cache-external-coreml); regenerating a stateful-decoder Q8 artifact is no longer in this pipeline's scope.
… with cache-external pipeline - README: rewrite around the cache-external decoder (canonical), drop stateful/stateless framing, document the actual decoder I/O contract consumed by FluidAudio's CohereFixedPipeline, and point at the shipped HF artifacts. - CACHE_INVESTIGATION_SUMMARY: replace dangling references to deleted export-decoder-stateless.py / test scripts with a pointer to the cache-external pipeline; rewrite the conclusion to explain why the stateless stepping stone was abandoned in favor of moving cache management into the host.
`compile_encoder_to_mlmodelc.py` ended with a printed warning that "the decoder MUST remain .mlpackage (State API requirement)". That claim is stale: this PR ships the cache-external decoder, which uses no State API and compiles to .mlmodelc like the encoder. Update the closing message so someone running the script end-to-end doesn't come away thinking the decoder has a packaging restriction it doesn't actually have.
Addresses Devin review #4163032119 on PR #41: - export-encoder.py: add convert_to="mlprogram" and compute_units=ct.ComputeUnit.CPU_ONLY (was relying on ALL default), matching parakeet-tdt-v3-0.6b/coreml/individual_components.py and the CLAUDE.md "Trace with .CpuOnly" constraint. - export-encoder.py: fix stale hardcoded output-shape print (1, 376, 1024) -> (1, 438, 1024). - export-decoder-cache-external.py: change ct.ComputeUnit.ALL -> ct.ComputeUnit.CPU_ONLY to match the same convention.
Addresses Devin review finding on PR #41: the decoder export script's argparse only accepts --model-id and --output-dir, so the README's `--precision float16` caused argparse to exit with "unrecognized arguments". The decoder intentionally uses the coremltools default precision — drop the flag from the documented command instead of adding an unused argument.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CoreML conversion pipeline for Cohere Transcribe 03-2026 (14-language encoder-decoder ASR), plus the host-side preprocessing and decode fixes required for the shipped weights to actually produce correct output.
Scope is intentionally limited to
models/stt/cohere-transcribe-03-2026/coreml/— no changes to VAD, Qwen3, or other model directories.What's in here
Exports (
exports/)export-encoder.py— 48-layer Conformer encoder, fixed mel input[1, 128, 3500], output[1, 438, 1024]. Uses the correctlengthmasking parameter so padded frames are ignored internally.export-decoder-cache-external.py— 8-layer decoder with Parakeet-style external KV cache (108-token window). This is the shipped decoder path; the stateful-decoder variant was dropped after the cache-external version landed.Tools (
tools/)cohere_features_v2.py— canonical numpy port ofFilterbankFeatures(see host-side fixes below).quantize_to_int8.py— W8A16 quantization for the encoder.compile_encoder_to_mlmodelc.py—.mlpackage→.mlmodelccompile step with ANE targeting.download-fleurs-for-swift.py— FLEURS slice downloader used by the Swift benchmark.Host-side pipeline fixes (the important part)
The shipped
cohere_mel_spectrogram.pydid not matchprocessing_cohere_asr.py::FilterbankFeatureson any parameter that matters: wrongn_fft, wrong window, wrong mel normalization, wrong log base, and no per-feature CMVN at all. Without CMVN, every utterance's features drift by tens of dB per bin, the encoder is fed out-of-distribution data, and the decoder emits whatever language cluster happened to be nearest — Arabic for French, Polish for Chinese, etc.Four host-only fixes (no retraining, same weights) make the failures disappear:
tools/cohere_features_v2.py— faithful numpy port ofFilterbankFeatures:n_fft=512, Hann(400) zero-padded to 512, preemph=0.97, Slaney mel, natural log +2^-24guard, per-feature CMVN ddof=1 ε=1e-5,mag_power=2.0. Verified againstAutoFeatureExtractor.from_pretrained(..., trust_remote_code=True)on real samples × 4 languages viatests/test-feature-parity.py— residual is within HF's own dither variance.ceil(feature_length * 438/3500)correspond to real audio. Padded frames are masked with-1e4.<0xHH>triples;tokens_to_textbuffers and flushes viabytes(...).decode(\"utf-8\", errors=\"replace\").FLEURS impact (3 samples × 4 languages, same CoreML model files)
Q8 findings
The shipped q8 decoder has a separate, orthogonal failure mode from the host-side bugs: over-generation. It produces a correct transcript, then keeps going past the real EOS. Instrumented logging shows EOS is consistently rank 1 or 2 with a ~2 logit gap to the winner — textbook weight-only INT8 error on the final classifier.
Re-quantization experiments (kept decoder lm_head at FP16 via tied-const handling, per-tensor variants, threshold skipping) never matched FP16; the q8 quality loss is distributed across many layers, not localized to lm_head. A proper fix would need calibration-aware quantization with end-of-utterance frames, or mixed-precision keeping attention output projections at FP16 — both outside the
coremltools.optimize.coremlop-level API.Recommendation adopted downstream: ship INT8 encoder (W8A16) + FP16 cache-external decoder as the hybrid default. The encoder INT8 is lossless (±0.5% WER) and the FP16 decoder sidesteps the EOS regression entirely. See
docs/HOST_SIDE_FIXES.mdanddocs/Q8_EOS_BIAS.mdfor full detail.Downstream validation (FluidAudio#487)
Swift integration with the INT8 encoder + FP16 cache-external decoder is live in FluidAudio#487. Benchmarks using the exports and feature extractor from this PR:
LibriSpeech test-clean (full split, Apple M2 2022)
FLEURS (full splits, 14 languages, M4 Pro)
†ja/zh written without word boundaries → WER is tokenization artifact; CER is the real accuracy metric for those languages.
Comparison vs. Cohere's technical-report Figure 4 (FLEURS+CV17+MLS+Wenet avg, FP16 PyTorch): ours lands ~1–3 pp higher on most languages, consistent with (a) FLEURS-only being harder than a 4-corpus average and (b) the INT8 encoder. Japanese CER is actually below Cohere's reported number.
Docs landed in this PR
coreml/README.md— end-to-end CoreML conversion walkthrough.docs/HOST_SIDE_FIXES.md— the four host-side fixes and FLEURS A/B tables.docs/Q8_EOS_BIAS.md— EOS-bias analysis and re-quantization experiments.docs/CACHE_EXTERNAL_ANALYSIS.md,CACHE_EXTERNAL_DELIVERED.md,CACHE_INVESTIGATION_SUMMARY.md— cache-external decoder path rationale and delivery notes.docs/COHERE_ARCHITECTURE_ANALYSIS.md— upstream model walkthrough and CoreML mapping.Test plan
lengthmaskingcohere_features_v2.pyparity vs HFAutoFeatureExtractor(tests/test-feature-parity.py)Known limitations