Skip to content

feat(cohere-transcribe): CoreML export + host-side pipeline fix + q8 hybrid#41

Open
Alex-Wengg wants to merge 21 commits intomainfrom
docs/cohere-transcribe-coreml-decoder-fix
Open

feat(cohere-transcribe): CoreML export + host-side pipeline fix + q8 hybrid#41
Alex-Wengg wants to merge 21 commits intomainfrom
docs/cohere-transcribe-coreml-decoder-fix

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Apr 6, 2026

Summary

CoreML conversion pipeline for Cohere Transcribe 03-2026 (14-language encoder-decoder ASR), plus the host-side preprocessing and decode fixes required for the shipped weights to actually produce correct output.

Scope is intentionally limited to models/stt/cohere-transcribe-03-2026/coreml/ — no changes to VAD, Qwen3, or other model directories.

What's in here

Exports (exports/)

  • export-encoder.py — 48-layer Conformer encoder, fixed mel input [1, 128, 3500], output [1, 438, 1024]. Uses the correct length masking parameter so padded frames are ignored internally.
  • export-decoder-cache-external.py — 8-layer decoder with Parakeet-style external KV cache (108-token window). This is the shipped decoder path; the stateful-decoder variant was dropped after the cache-external version landed.

Tools (tools/)

  • cohere_features_v2.py — canonical numpy port of FilterbankFeatures (see host-side fixes below).
  • quantize_to_int8.py — W8A16 quantization for the encoder.
  • compile_encoder_to_mlmodelc.py.mlpackage.mlmodelc compile step with ANE targeting.
  • download-fleurs-for-swift.py — FLEURS slice downloader used by the Swift benchmark.

Host-side pipeline fixes (the important part)

The shipped cohere_mel_spectrogram.py did not match processing_cohere_asr.py::FilterbankFeatures on any parameter that matters: wrong n_fft, wrong window, wrong mel normalization, wrong log base, and no per-feature CMVN at all. Without CMVN, every utterance's features drift by tens of dB per bin, the encoder is fed out-of-distribution data, and the decoder emits whatever language cluster happened to be nearest — Arabic for French, Polish for Chinese, etc.

Four host-only fixes (no retraining, same weights) make the failures disappear:

  1. tools/cohere_features_v2.py — faithful numpy port of FilterbankFeatures: n_fft=512, Hann(400) zero-padded to 512, preemph=0.97, Slaney mel, natural log + 2^-24 guard, per-feature CMVN ddof=1 ε=1e-5, mag_power=2.0. Verified against AutoFeatureExtractor.from_pretrained(..., trust_remote_code=True) on real samples × 4 languages via tests/test-feature-parity.py — residual is within HF's own dither variance.
  2. Cross-attention masking — the encoder always emits 438 frames but only ceil(feature_length * 438/3500) correspond to real audio. Padded frames are masked with -1e4.
  3. Repetition penalty + no-repeat-ngram in greedy decode (defaults 1.1 and 3).
  4. SentencePiece byte-fallback detok — CJK characters emit as <0xHH> triples; tokens_to_text buffers and flushes via bytes(...).decode(\"utf-8\", errors=\"replace\").

FLEURS impact (3 samples × 4 languages, same CoreML model files)

Language Metric OLD NEW Δ
en_us WER 55.3% 10.6% −44.6 pp
es_419 WER 11.3% 4.9% −6.4 pp
fr_fr WER 92.1% 16.8% −75.2 pp
cmn_hans_cn CER 261.7% 14.1% −247.6 pp

Q8 findings

The shipped q8 decoder has a separate, orthogonal failure mode from the host-side bugs: over-generation. It produces a correct transcript, then keeps going past the real EOS. Instrumented logging shows EOS is consistently rank 1 or 2 with a ~2 logit gap to the winner — textbook weight-only INT8 error on the final classifier.

Re-quantization experiments (kept decoder lm_head at FP16 via tied-const handling, per-tensor variants, threshold skipping) never matched FP16; the q8 quality loss is distributed across many layers, not localized to lm_head. A proper fix would need calibration-aware quantization with end-of-utterance frames, or mixed-precision keeping attention output projections at FP16 — both outside the coremltools.optimize.coreml op-level API.

Recommendation adopted downstream: ship INT8 encoder (W8A16) + FP16 cache-external decoder as the hybrid default. The encoder INT8 is lossless (±0.5% WER) and the FP16 decoder sidesteps the EOS regression entirely. See docs/HOST_SIDE_FIXES.md and docs/Q8_EOS_BIAS.md for full detail.

Downstream validation (FluidAudio#487)

Swift integration with the INT8 encoder + FP16 cache-external decoder is live in FluidAudio#487. Benchmarks using the exports and feature extractor from this PR:

LibriSpeech test-clean (full split, Apple M2 2022)

Samples WER CER RTFx (per-file mean) RTFx (total)
2,620 1.77% 0.60% 2.04× 1.72×

FLEURS (full splits, 14 languages, M4 Pro)

Code Lang Samples WER CER RTFx
en_us English 647 5.63% 3.19% 2.49×
fr_fr French 676 6.22% 3.11% 2.21×
de_de German 862 5.84% 2.83% 1.98×
es_419 Spanish (LatAm) 908 4.53% 2.40% 1.34×
it_it Italian 865 4.03% 2.04% 3.15×
pt_br Portuguese (BR) 919 6.44% 3.38% 2.79×
nl_nl Dutch 364 8.07% 4.14% 2.04×
pl_pl Polish 758 7.49% 3.23% 1.98×
el_gr Greek 650 11.50% 5.45% 2.00×
ar_eg Arabic (EG) 428 18.46% 6.71% 2.06×
ja_jp Japanese 650 60.13%† 6.25% 2.23×
cmn_hans_cn Mandarin (Simp) 945 98.52%† 12.01% 1.85×
ko_kr Korean 382 16.39% 6.67% 1.84×
vi_vn Vietnamese 857 9.55% 6.87% 1.55×

†ja/zh written without word boundaries → WER is tokenization artifact; CER is the real accuracy metric for those languages.

Comparison vs. Cohere's technical-report Figure 4 (FLEURS+CV17+MLS+Wenet avg, FP16 PyTorch): ours lands ~1–3 pp higher on most languages, consistent with (a) FLEURS-only being harder than a 4-corpus average and (b) the INT8 encoder. Japanese CER is actually below Cohere's reported number.

Docs landed in this PR

  • coreml/README.md — end-to-end CoreML conversion walkthrough.
  • docs/HOST_SIDE_FIXES.md — the four host-side fixes and FLEURS A/B tables.
  • docs/Q8_EOS_BIAS.md — EOS-bias analysis and re-quantization experiments.
  • docs/CACHE_EXTERNAL_ANALYSIS.md, CACHE_EXTERNAL_DELIVERED.md, CACHE_INVESTIGATION_SUMMARY.md — cache-external decoder path rationale and delivery notes.
  • docs/COHERE_ARCHITECTURE_ANALYSIS.md — upstream model walkthrough and CoreML mapping.

Test plan

  • Encoder export with correct length masking
  • Cache-external decoder export (external KV cache, 108-token window)
  • INT8 W8A16 encoder quantization
  • cohere_features_v2.py parity vs HF AutoFeatureExtractor (tests/test-feature-parity.py)
  • FLEURS 4-language A/B (broken vs fixed host pipeline)
  • HuggingFace uploads (f16 and q8 hybrid)
  • Swift integration landed in FluidAudio#487
  • LibriSpeech test-clean full benchmark (2,620 samples, WER 1.77%)
  • Full 14-language FLEURS benchmark

Known limitations

  • Fully-q8 decoder is unusable without a runtime EOS bias or re-quantization with EOU calibration; ship INT8-encoder + FP16-decoder hybrid instead.
  • INT4 encoder rejected (293% avg WER).
  • Swift CoreML pipeline is single-chunk (≤35 s per call); >35 s audio requires the upstream 5-s-overlap sliding window, not yet ported.

devin-ai-integration[bot]

This comment was marked as outdated.

devin-ai-integration[bot]

This comment was marked as outdated.

devin-ai-integration[bot]

This comment was marked as outdated.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as outdated.

devin-ai-integration[bot]

This comment was marked as outdated.

@Alex-Wengg Alex-Wengg changed the title fix(cohere): Implement stateless decoder to fix cache repetition bug feat(cohere): Add Cohere Transcribe CoreML conversion pipeline with stateful decoder Apr 6, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

Alex-Wengg added a commit that referenced this pull request Apr 7, 2026
Fixes 3 critical correctness issues identified in PR #41 reviews:

1. **Language Token IDs Completely Broken** (f16/example_inference.py, q8/example_inference.py):
   - Fix LANGUAGE_PROMPTS dictionary with correct language token IDs
   - Position 4-5: Use correct language tokens (e.g., 169 for Spanish, not hardcoded 62)
   - Position 9: Use 13 (<|nodiarize|>) for all languages, not 14-26
   - Language tokens from vocab.json: en=62, es=169, fr=69, de=76, it=97, pt=149, pl=148, nl=60, sv=173, tr=186, ru=155, zh=50, ja=98, ko=110
   - Impact: Non-English transcription was silently producing English output

2. **Encoder Parameter Name Typo** (exports/export-encoder.py, export-encoder-ios18.py):
   - Fix encoder call from `lengths=feature_length` to `length=feature_length`
   - Since encoder accepts **kwargs, the typo was silently ignored
   - Impact: Feature length masking was never applied, causing incorrect attention for shorter audio

3. **pyproject.toml Name Field** (pyproject.toml):
   - Fix copy-paste error: "parakeet-coreml" → "cohere-transcribe-coreml"
   - Update description to match project purpose
Alex-Wengg added a commit that referenced this pull request Apr 7, 2026
Fixes 3 test-related issues identified in PR #41 reviews:

1. **Wrong EOS Token Fallback** (tests/benchmark-models.py:46):
   - Fix fallback EOS token: 2 (PAD) → 3 (actual EOS)
   - Impact: Decoder will stop at correct token when processor unavailable

2. **Mel Padding Frame Mismatch** (tests/*.py):
   - Fix padding: 3001 frames → 3500 frames (35-second window)
   - Files: benchmark-models.py, compare-models.py, measure-memory.py
   - Impact: Prevents dimension mismatches and crashes on longer audio

3. **Operator Precedence Bug** (tests/compare-models.py:164, 166):
   - Add parentheses to fix condition parsing
   - Before: `len(...) == 4 and 'cache_k' in key or key == 'new_cache_k'`
   - After: `len(...) == 4 and ('cache_k' in key or key == 'new_cache_k')`
   - Impact: Cache assignments now correctly check tensor dimensions
Alex-Wengg added a commit that referenced this pull request Apr 7, 2026
Fixes 2 decoder-related issues identified in PR #41 reviews:

1. **Stateful Decoder Missing log_softmax** (exports/export-decoder-stateful.py:148):
   - Add torch.log_softmax() after lm_head projection
   - Before: Returned raw logits from Linear layer
   - After: Returns log-probabilities
   - Impact: Beam search and probability-based decoding now work correctly
   - Greedy decoding unaffected (argmax works on both logits and log-probs)

2. **Multi-Step Validation Feeds Same Token** (exports/export-decoder-stateful.py:407-414):
   - Fix autoregressive validation loop to feed predicted tokens
   - Before: Fed start token (4) at every step
   - After: Feeds previous step's predicted token (current_token = next_token)
   - Impact: Validation can now detect autoregressive generation bugs
Alex-Wengg added a commit that referenced this pull request Apr 7, 2026
Fixes issue identified in PR #41 reviews:

- Remove uv.lock from .gitignore
- Commit uv.lock to ensure reproducible dependency versions
- Compliance with AGENTS.md requirement for self-contained directories

Impact: Contributors now get consistent dependency versions across environments
@Alex-Wengg Alex-Wengg changed the title feat(cohere): Add Cohere Transcribe CoreML conversion pipeline with stateful decoder feat(cohere): Add Cohere Transcribe CoreML conversion with critical fixes Apr 7, 2026
Comment thread models/stt/cohere-transcribe-03-2026/cohere-pytorch/.gitattributes Outdated
@Alex-Wengg Alex-Wengg marked this pull request as draft April 8, 2026 19:15
@BrandonWeng BrandonWeng requested review from BrandonWeng and removed request for BrandonWeng April 12, 2026 15:29
@Alex-Wengg Alex-Wengg changed the title feat(cohere): Add Cohere Transcribe CoreML conversion with critical fixes fix(cohere): correct host-side mel features + CJK detokenization (resolves 71% FLEURS failure) Apr 21, 2026
@Alex-Wengg Alex-Wengg changed the title fix(cohere): correct host-side mel features + CJK detokenization (resolves 71% FLEURS failure) feat(cohere-transcribe): CoreML conversion + host-side pipeline fix + q8 findings Apr 22, 2026
Add the export + quantization pipeline for the Cohere Command-A Transcribe
03-2026 model, targeting on-device CoreML inference:

  - exports/: stateful / stateless / encoder export scripts (the stateful
    stateless-decoder is the one shipped to HF).
  - tools/: compile_encoder_to_mlmodelc.py, compile_q8_to_mlmodelc.py,
    quantize_to_int8.py for the INT8 weight-quantization pass.
  - quantize_encoder_to_int4.py for the INT4 encoder experiment.
  - Root export-*.py scripts covering earlier cache-external and
    per-language decoder iterations (kept for reference).
  - hf-upload/ bundles + f16/q8 package templates are prepared for upload
    to FluidInference/cohere-transcribe-03-2026-coreml on HuggingFace
    (README, example, requirements, tokenizer.model).
  - cohere-pytorch/ vendors the HF pytorch reference (modeling,
    configuration, processing, tokenizer) for reproducibility and as the
    export source of truth.
  - docs/ and top-level .md files document architecture analysis, cache
    strategies, research notes, and known implementation issues.
  - uv.lock / pyproject.toml pin the Python toolchain used for export.

This commit is pipeline + artifacts only; the host-side fix and q8
investigation land in the two follow-up commits.
…oss-attn masking

The shipped hf-upload example code had three host-side bugs that caused
drastically degraded transcription even when the CoreML weights were
correct. This commit adds the fixed reference implementation (f16/ and
q8/ packages), the parity tools that prove the diagnosis, and the
benchmarks that quantify the impact.

What was broken:

  1. Mel spectrogram preprocessing differed from HF CohereAsrFeatureExtractor
     (different window, fft size rounding, normalization). Every frame
     was slightly off, which compounded across the encoder.
  2. cross_attention_mask was set to all-ones instead of masking the
     padded encoder frames, letting the decoder attend to zeros.
  3. CJK languages were detokenized token-by-token, dropping the
     SentencePiece byte-fallback triples (<0xE4><0xB8><0xAD>) that
     encode Chinese/Japanese/Korean characters, producing gibberish.

Fixes, all host-side (no weight re-export required):

  - tools/cohere_features_v2.py: numpy port of CohereAsrFeatureExtractor
    with bit-close parity to the HF implementation.
  - f16/example_inference.py, f16/quickstart.py, q8/example_inference.py,
    q8/quickstart.py: use cohere_features_v2, set cross_attention_mask
    from encoder lengths, buffer byte-fallback tokens for CJK.
  - f16/ and q8/ packages (vocab, requirements, pyproject, README) are
    the canonical uploads for HF.
  - tests/test-feature-parity.py: parity test vs HF extractor.
  - tests/diagnose-feature-diff.py: per-frame diff tool.
  - tests/bench-fix-vs-broken.py: f16 FLEURS broken-vs-fixed comparison.
  - tests/benchmark-librispeech.py, tests/benchmark-cjk-cer.py: canonical
    FLEURS + LibriSpeech benchmarks (with --normalize + CER for CJK).

Impact (from docs/FP16_VS_INT8_FLEURS_COMPARISON.md): the FLEURS WER /
CER numbers move from near-unusable to close to PyTorch reference across
all 14 tested languages on f16. Full impact table in the PR body.
After the host-side fix (prior commit), the HF-shipped q8 stateful
decoder still over-generates on short utterances (EN/FR/ES), running
past the true sentence boundary. The root cause is INT8 weight
quantization noise on the EOS logit: at the true end-of-utterance the
EOS token sits at rank 1-2 with only a ~2 logit gap to the top token,
well within the quantization noise budget.

Diagnostic + fix in three scripts:

  - tests/probe-q8-eos.py: instrument the q8 stateful decoder and dump
    per-step top-5 tokens, EOS (id=3) logit, EOS rank, and the
    cumulative hypothesis. Confirms EOS is rank 1-2 with 2-3 unit gap
    at true boundary.
  - tests/bench-q8-fleurs.py: FLEURS benchmark of the HF-shipped q8
    stateful decoder on the 3-samples-per-language slice used by
    bench-fix-vs-broken. Establishes the q8 baseline.
  - tests/bench-q8-eosboost.py: run fixed q8 pipeline with EOS logit
    boosted by 0.0 / +2.0 / +4.0 on EN/FR/ES slices. +4.0 recovers most
    of the f16 quality without retraining or re-quantization.

No new export / re-quantization was needed; this is a pure inference-side
patch (logit_bias_eos=+4.0) that host code can apply when shipping the
q8 weights downloaded from
FluidInference/cohere-transcribe-03-2026-coreml.
@Alex-Wengg Alex-Wengg force-pushed the docs/cohere-transcribe-coreml-decoder-fix branch from b401305 to 8013c05 Compare April 22, 2026 22:12
…e-shot scripts

Promote the findings previously embedded in one-shot experimental
scripts to proper docs, and remove the scripts now that the findings are
captured:

  - docs/HOST_SIDE_FIXES.md documents the three host-side inference bugs
    (mel spectrogram drift vs HF extractor, all-ones cross_attention_mask,
    token-by-token CJK detokenization), with exact fix locations (python
    and Swift) and a reproduction recipe.
  - docs/Q8_EOS_BIAS.md documents the q8 over-generation diagnosis
    (EOS at rank 1-2, ~2-3 logit gap), the +4.0 EOS-bias fix, and the
    independence of this fix from the host-side fixes.
  - docs/FP16_VS_INT8_FLEURS_COMPARISON.md is flagged as historical —
    its 200-500% WER numbers predate the host-side fixes and reflect
    the bugs, not the weights.

Removed scripts:

  - tests/bench-fix-vs-broken.py (368 lines) — broken-vs-fixed demo on
    the cache-external decoder (not the shipped stateful architecture).
    Finding: the three host-side bugs are the dominant error source.
    Captured in HOST_SIDE_FIXES.md.
  - tests/bench-q8-eosboost.py (244 lines) — +0/+2/+4 bias sweep on
    EN/FR/ES/zh. Captured in Q8_EOS_BIAS.md.
  - tests/probe-q8-eos.py (172 lines) — per-step top-5 logit dump.
    Captured in Q8_EOS_BIAS.md.
  - tests/diagnose-feature-diff.py (63 lines) — per-frame diff tool;
    superseded by tests/test-feature-parity.py which covers this.

Remaining runnable tests (4):

  - tests/benchmark-librispeech.py — parameterized LibriSpeech / FLEURS
    benchmark for any precision.
  - tests/benchmark-cjk-cer.py — CJK CER benchmark.
  - tests/bench-q8-fleurs.py — FLEURS bench against the HF-shipped q8
    stateful decoder.
  - tests/test-feature-parity.py — regression test for
    tools/cohere_features_v2.py vs HF CohereAsrFeatureExtractor.
The f16/ and q8/ directories were staging copies of the HF upload
bundle that ships alongside the CoreML weights in
FluidInference/cohere-transcribe-03-2026-coreml. They duplicated
content now available canonically from HF:

- vocab.json (identical)
- cohere_mel_spectrogram.py (byte-identical to tools/cohere_features_v2.py)
- example_inference.py / quickstart.py / pyproject.toml / requirements.txt
  (identical between f16/ and q8/)

Net effect: ~35K lines removed, zero duplication, one canonical
feature-extractor source (tools/cohere_features_v2.py).

Updates in-repo tests to source the feature extractor from tools/ and
accept a --models-dir flag so local reproduction works by pointing at
a HF-downloaded snapshot:

    huggingface-cli download FluidInference/cohere-transcribe-03-2026-coreml \\
        --local-dir ./cohere-models
    python tests/benchmark-librispeech.py --precision fp16 \\
        --models-dir ./cohere-models/f16

tools/quantize_to_int8.py and tools/compile_*.py still reference f16/
and q8/ as local working dirs; those are populated on demand by the
quantization pipeline and are not tracked in git.

Also drops the OLD-vs-NEW comparison block in test-feature-parity.py
since the former "old broken" extractor (cohere_mel_spectrogram.py)
has been replaced with the fixed v2 port.
Strips superseded experiments, vendored HF source, transient progress
notes, and abandoned variant staging. PR diff drops from 78 -> 21 files
and from ~135K -> ~9K added lines.

Removed:
- cohere-pytorch/ (19 files, ~115K lines): vendored clone of
  CohereLabs/cohere-transcribe-03-2026. Reproducibility convention is to
  reference HF, not vendor. test-feature-parity.py now downloads the
  reference on demand and accepts --pytorch-dir for offline use.
- 14 root *_STATUS.md / *_COMPLETE.md / *_FAILURE.md (~3K lines):
  transient progress narratives whose durable findings are already in
  docs/HOST_SIDE_FIXES.md and docs/Q8_EOS_BIAS.md. Some pairs (e.g.
  MLMODELC_LIMITATION + MLMODELC_VERIFIED) directly contradicted each
  other.
- 8 root export-*.py / quantize_encoder_to_int4.py (~2K lines):
  superseded by exports/{export-encoder,export-decoder-stateful,
  export-decoder-stateless}.py + tools/quantize_to_int8.py.
- hf-upload/ (8 files): staging dir for the abandoned cache-external
  decoder variant.
- 3 root benchmark JSON caches (~850 lines): captured outputs from
  pre-fix runs.
- 7 docs/ pre-fix investigation notes (~2.4K lines):
  CACHE_INVESTIGATION_SUMMARY, DECODER_CACHE_FIX, OFFICIAL_USAGE_ANALYSIS,
  QWEN3_VS_COHERE_STATEFUL_CACHE, RESEARCH_INSIGHTS, REVERSE_ENGINEERING,
  FP16_VS_INT8_FLEURS_COMPARISON. The host-side fixes invalidated the
  WER/CER measurements these were built on; superseded by HOST_SIDE_FIXES.

Kept (canonical surface):
- README.md
- docs/{HOST_SIDE_FIXES,Q8_EOS_BIAS,STATELESS_VS_STATEFUL,
  COHERE_ARCHITECTURE_ANALYSIS}.md
- tools/{cohere_features_v2,quantize_to_int8,compile_*}.py
- exports/{export-encoder,export-decoder-stateful,export-decoder-stateless}.py
- tests/{benchmark-librispeech,benchmark-cjk-cer,bench-q8-fleurs,
  test-feature-parity}.py
- download-fleurs-for-swift.py, pyproject.toml, uv.lock, .gitignore

Also updates .gitignore to prevent regressions: ignores f16/, q8/,
benchmark_*.json, *_fleurs_*.json, *_cache_external*.json, and the
external cohere-pytorch/ snapshot path.
Earlier cleanup (f2cdf4e) deleted the cache-external decoder export and
its supporting docs / hf-upload bundle on the assumption it was an
abandoned variant. That was wrong: FluidAudio PR 487 ships
cache-external as the canonical decoder (HF repos
`cohere-transcribe-cache-external-coreml` and
`cohere-transcribe-q8-cache-external-coreml`). Stateful and stateless
are the real research variants.

Restored from f2cdf4e^ (+3,824 lines, 18 files):

Export scripts (only place these exist):
  - export-decoder-cache-external-v2.py  (canonical, language-conditioned)
  - export-decoder-cache-external.py     (earlier variant)

HF-upload bundle (mirror of the published model card):
  - hf-upload/cohere-transcribe-cache-external-coreml/{README,example,
    requirements,tokenizer.model,wer_results_cache_external.json,
    .gitattributes}
  - hf-upload/{README_UPLOAD,UPLOAD_INSTRUCTIONS}.md

Decision / status docs:
  - CACHE_EXTERNAL_ANALYSIS.md
  - CACHE_EXTERNAL_DELIVERED.md

Investigation docs in docs/:
  - CACHE_INVESTIGATION_SUMMARY.md   (why cache-external)
  - DECODER_CACHE_FIX.md             (concise rationale)
  - FP16_VS_INT8_FLEURS_COMPARISON.md (numbers behind PR 487's table)
  - RESEARCH_INSIGHTS.md

Result caches:
  - python_cache_external_full.json
  - python_cache_external_test.json

README.md, docs/STATELESS_VS_STATEFUL.md, tools/quantize_to_int8.py,
tools/compile_q8_to_mlmodelc.py, exports/export-decoder-{stateful,
stateless}.py, and tests/{benchmark-librispeech,benchmark-cjk-cer,
bench-q8-fleurs}.py still describe / operate on the stateful variant
and need a follow-up reconciliation pass to make cache-external the
documented and tooled canonical path.
…ches

Pure data dumps from one-off bench runs; regenerable from
tests/bench-q8-fleurs.py. Not referenced by any script.
The cache-external model card, example.py, requirements, and tokenizer
already live canonically at huggingface.co/FluidInference/cohere-
transcribe-cache-external-coreml. Maintaining a parallel copy here
invited drift; the conversion repo's job is the export pipeline, not
the deployment artifact.
README_UPLOAD.md and UPLOAD_INSTRUCTIONS.md referenced the
cache-external bundle that was just removed. With no bundle to
upload from this repo, the instructions are stale.
Consolidate the cache-external decision/status docs alongside the rest
of the investigation notes. README.md stays at root as the entry point.
Removes four docs that misrepresent the canonical pipeline now that
cache-external (with host-side fixes) is the shipped variant:

- DECODER_CACHE_FIX.md (113): claims stateless is the cache-bug fix.
  Cache-external is the actual fix; stateless never shipped.
- FP16_VS_INT8_FLEURS_COMPARISON.md (435): self-flagged as superseded
  in its header — all numbers predate HOST_SIDE_FIXES.md.
- STATELESS_VS_STATEFUL.md (358): frames decoder choice as a 2-way
  bake-off; both options are non-canonical research variants.
- RESEARCH_INSIGHTS.md (494): generic Cohere-architecture essay; no
  actionable export-pipeline content.
…ports/

Move both cache-external decoder export scripts into exports/ alongside
the encoder + stateful + stateless variants, so all decoder export
pipelines live in one directory.
It's a utility script (FLEURS dataset fetcher), not a top-level entry
point. Belongs alongside the other tools/ scripts.
Removes three decoder export variants that don't produce the artifact
FluidAudio actually loads:

- export-decoder-stateful.py (436): research variant using CoreML
  State API. Doesn't match CohereFixedPipeline's I/O. No consumer.
- export-decoder-stateless.py (296): unverified Parakeet-style
  variant. README itself flagged it as broken (icon-repetition,
  10x slower). No consumer.
- export-decoder-cache-external-v2.py (342): adds a language_id
  input that FluidAudio never passes; the artifact would fail to
  load in CohereFixedPipeline. Language-conditioning experiment
  that didn't ship.

exports/ now holds only the two scripts that produce the artifacts on
HuggingFace (cohere-transcribe-cache-external-coreml /
cohere-transcribe-q8-cache-external-coreml): export-encoder.py and
export-decoder-cache-external.py.

README.md and docs/CACHE_INVESTIGATION_SUMMARY.md still reference the
deleted stateless export by name; those will be reconciled in the
README rewrite.
All three Python bench scripts loaded cohere_decoder_stateful.mlpackage
and used the stateful CoreML State API (make_state, input_id, per-step
attention_mask). With export-decoder-stateful.py deleted and the
canonical HF artifacts (cohere-transcribe-cache-external-coreml /
cohere-transcribe-q8-cache-external-coreml) shipping cache-external
instead, these scripts cannot run.

Removed:
- bench-q8-fleurs.py (301)
- benchmark-librispeech.py (336)
- benchmark-cjk-cer.py (419)

The actual benchmark numbers in FluidAudio PR 487 come from the Swift
CLI (Scripts/run_cohere_per_lang.sh + CohereMixedBenchmark.swift), not
from these Python harnesses, so no benchmark capability is lost.

tests/ now keeps just test-feature-parity.py (mel-extractor parity
against the HF reference, decoder-independent).
- Delete compile_q8_to_mlmodelc.py (loaded deleted cohere_decoder_stateful.mlpackage)
- Strip decoder block from quantize_to_int8.py; encoder quantization remains valid

The cache-external Q8 decoder is already published on HF
(FluidInference/cohere-transcribe-q8-cache-external-coreml); regenerating
a stateful-decoder Q8 artifact is no longer in this pipeline's scope.
… with cache-external pipeline

- README: rewrite around the cache-external decoder (canonical), drop
  stateful/stateless framing, document the actual decoder I/O contract
  consumed by FluidAudio's CohereFixedPipeline, and point at the shipped
  HF artifacts.
- CACHE_INVESTIGATION_SUMMARY: replace dangling references to deleted
  export-decoder-stateless.py / test scripts with a pointer to the
  cache-external pipeline; rewrite the conclusion to explain why the
  stateless stepping stone was abandoned in favor of moving cache
  management into the host.
@Alex-Wengg Alex-Wengg marked this pull request as ready for review April 23, 2026 13:40
@Alex-Wengg Alex-Wengg changed the title feat(cohere-transcribe): CoreML conversion + host-side pipeline fix + q8 findings feat(cohere-transcribe): CoreML export + host-side pipeline fix + q8 hybrid Apr 23, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

`compile_encoder_to_mlmodelc.py` ended with a printed warning that
"the decoder MUST remain .mlpackage (State API requirement)". That
claim is stale: this PR ships the cache-external decoder, which uses
no State API and compiles to .mlmodelc like the encoder. Update the
closing message so someone running the script end-to-end doesn't
come away thinking the decoder has a packaging restriction it
doesn't actually have.
devin-ai-integration[bot]

This comment was marked as resolved.

Addresses Devin review #4163032119 on PR #41:

- export-encoder.py: add convert_to="mlprogram" and
  compute_units=ct.ComputeUnit.CPU_ONLY (was relying on ALL default),
  matching parakeet-tdt-v3-0.6b/coreml/individual_components.py and
  the CLAUDE.md "Trace with .CpuOnly" constraint.
- export-encoder.py: fix stale hardcoded output-shape print
  (1, 376, 1024) -> (1, 438, 1024).
- export-decoder-cache-external.py: change ct.ComputeUnit.ALL ->
  ct.ComputeUnit.CPU_ONLY to match the same convention.
Addresses Devin review finding on PR #41: the decoder export script's
argparse only accepts --model-id and --output-dir, so the README's
`--precision float16` caused argparse to exit with
"unrecognized arguments". The decoder intentionally uses the
coremltools default precision — drop the flag from the documented
command instead of adding an unused argument.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants