diff --git a/models/tts/magpie/SWIFT_PORT_FINDINGS.md b/models/tts/magpie/SWIFT_PORT_FINDINGS.md
new file mode 100644
index 0000000..61494ba
--- /dev/null
+++ b/models/tts/magpie/SWIFT_PORT_FINDINGS.md
@@ -0,0 +1,195 @@
+# Magpie Swift Port — Findings & Platform Quirks
+
+Notes from porting Magpie TTS Multilingual 357M from `generate_coreml.py` (Python
+reference) to the FluidAudio Swift package. Documents what doesn't transfer
+cleanly, ANE compile behavior, and the perf optimizations that materially
+moved the needle. Future agents (Swift or Python side) — read before changing
+the conversion scripts.
+
+## TL;DR
+
+1. **`decoder_step.mlmodelc` cannot use ANE in production.** ANE compile
+   reliably fails at runtime (`MILCompilerForANE: ANECCompile() FAILED`)
+   under real synth, even though it succeeds with zeroed dummy inputs.
+   Pin Swift consumers to `.cpuAndGPU` for that model.
+2. **`decoder_prefill.mlmodelc` gives a 3.8× end-to-end speedup over the
+   110-step prefill fallback.** Worth keeping in the canonical HF artifact set.
+3. **fp16 host non-determinism is real and compounds with depth.** Stage-by-
+   stage Swift ↔ Python parity: text_encoder ~50 dB SNR, prefill L0=64 dB →
+   L11=44 dB, AR replay ~40 dB. Below ~50 dB you get audible single-word drift.
+4. **`MLComputePlan.load(...)` (macOS 14.4+) SIGBUSes on every Magpie
+   `.mlmodelc`** — can't introspect device assignment via the public API.
+
+## Per-model compute placement (verified end-to-end)
+
+Measured on M-series, real synth (8-word EN sentence, warm), Swift consumer:
+
+| Model              | `.cpuOnly` | `.cpuAndGPU` | `.cpuAndNeuralEngine` | Recommendation |
+| ------------------ | ---------- | ------------ | ---------------------- | -------------- |
+| `text_encoder`     | 42 ms      | 43 ms        | **12 ms**              | ANE            |
+| `decoder_prefill`  | 56 ms      | 23 ms        | **18 ms**              | ANE            |
+| `decoder_step`     | 31 ms\*    | **22 ms**    | 10 ms\*\*              | **GPU** (see below) |
+| `nanocodec_decoder`| —          | —            | runs on ANE            | ANE            |
+
+\* dummy-input single-call benchmark; real synth is 96 s warm.
+\*\* dummy-input speedup is misleading — see decoder_step section.
+
+### decoder_step ANE failure mode (the trap)
+
+`coreml-cli` and the dummy-input benchmark both report ANE works on
+`decoder_step`. **In real synth it does not.** Stack:
+
+- Single-call benchmark with `position = 0`: ANE compile succeeds, runs in
+  10 ms → looks 3× faster than CPU.
+- Real synth (incrementing `position` 0…N, real KV cache state): ANE
+  recompile is triggered per call and **fails** with `MILCompilerForANE:
+  ANECCompile() FAILED` (visible in stderr), then falls back to CPU at
+  hundreds-of-ms per call. End-to-end this is 7% slower than `.cpuAndGPU`
+  (103 s vs 96 s warm) and 34% slower cold.
+
+The likely cause is the rank-4 split-K/V scatter pattern (`cache_k_i` /
+`cache_v_i` are `[1, 512, H, D]` fp16 with `position` advancing per step).
+ANEF can compile the topology against a static input but bails when actual
+gather indices vary at runtime.
+
+**Action items if you revisit `convert_decoder_step.py`:**
+
+- Try a single rank-3 K/V layout (`[512, H*D]` fp16) instead of split rank-4.
+- Try `position` as `int32` instead of `float16`.
+- Try eliminating the scatter by writing the new K/V row as a separate output
+  and letting the host concatenate (already what Swift's `MagpieKvCache` does
+  conceptually).
+- Verify with **real incrementing positions**, not zeros.
+
+If ANE remains broken, the current `.cpuAndGPU` pin is correct. Document
+the failure in `coreml/convert_decoder_step.py` so the next person doesn't
+chase the dummy-input ghost.
+
+## decoder_prefill is essential, not optional
+
+The repo README marks `decoder_prefill.mlmodelc` as "optional" with the
+fallback being 110 sequential `decoder_step` calls. Reality:
+
+| Path                             | Wall (warm) |
+| -------------------------------- | ----------- |
+| 110-step prefill fallback        | ~420 s      |
+| `decoder_prefill.mlmodelc` (1×)  | ~110 s      |
+
+That's a **3.8× speedup** from a single batched call. Without it the Swift
+port is unshippable. Ensure `convert_decoder_prefill.py` runs in CI / the
+canonical HF asset upload.
+
+### Prefill output naming
+
+`decoder_prefill.mlmodelc` emits 12 outputs as anonymous CoreML var IDs:
+
+```
+var_208, var_374, var_540, var_706, var_872, var_1038,
+var_1204, var_1370, var_1536, var_1702, var_1868, var_1958
+```
+
+Each is `[2, 1, 512, H, D]` fp16 with axis 0 = `[K_stacked, V_stacked]`.
+Swift slices them with two `memcpy`s into the per-layer K/V cache. **Don't
+rename these without bumping the Swift port.** Or — better — explicitly name
+them `prefill_kv_layer_{0..11}` in `convert_decoder_prefill.py` so the Swift
+binding is robust to recompiles.
+
+## fp16 host non-determinism: what we measured
+
+Swift CoreML and Python+coremltools CoreML produce **bit-different** fp16
+outputs on the same inputs, same `.mlmodelc`, same M-series host. This is
+documented Apple behavior (CoreML doesn't guarantee bit-exact reproducibility
+across processes / load configurations). Magnitude:
+
+- **text_encoder**: SNR(Swift, Python) ≈ 50.6 dB
+- **decoder_prefill** per layer: L0 SNR 64 dB → L11 SNR 44 dB
+  (compounds geometrically through the 12-layer cache)
+- **decoder_step AR loop**: post-12-layer cache → ~40 dB after 100 steps
+
+40 dB SNR is below the threshold where the top-k=80 sampler trajectory
+diverges, which manifests as a single trailing word in the audio (e.g.
+"…seashore, **and**" instead of "…seashore."). 4/5 speakers are unaffected
+because their sampler trajectories are more stable; speaker 0 in our test
+set consistently hits the drift.
+
+**This is not a Swift bug.** Verified by:
+
+1. Python+CoreML matches NeMo PyTorch.
+2. Swift+CoreML's text_encoder output already differs from Python+CoreML at
+   50 dB (no Swift-side math involved — it's just `MLModel.prediction`).
+3. Swift's LocalTransformer (Accelerate+BNNS) matches a fp64 NumPy reference
+   to >120 dB in isolation — so the post-CoreML Swift path is clean.
+
+**If you want bit-identical Swift↔Python parity**, the only paths are:
+- Force fp32 weights in the `.mlpackage` (size + perf cost — probably not
+  worth it for ~1 word at the end of an utterance).
+- Accept perceptual parity (current state).
+
+## MLComputePlan crashes on Magpie `.mlmodelc`
+
+Tried `MLComputePlan.load(contentsOf: url, configuration: cfg)` on macOS 14.5
+across all four Magpie models, all three compute units. Every call SIGBUSes
+(exit 138). Cannot be used for device-assignment introspection. The Swift CLI
+falls back to a timing-based probe:
+
+```
+swift run fluidaudiocli magpie compute-plan
+```
+
+— loads each model under `.cpuOnly` / `.cpuAndGPU` / `.cpuAndNeuralEngine`,
+runs 1 warmup + 3 timed iters, infers ANE usage from the speedup ratio
+(>1.3× cpuOnly → ANE active). Hacky but works.
+
+**`coreml-cli` from `tools/coreml-cli/` may have the same issue.** Test
+before relying on its `--fallback` analysis for Magpie models. If it
+crashes, file a follow-up to check whether it's an Apple issue or a
+property of the conversion (e.g. `coremltools` version, MIL ops used).
+
+## Suggested mobius-side follow-ups
+
+1. **`convert_decoder_step.py`**: experiment with rank-3 K/V layout +
+   int32 position, validate ANE compile under real position values.
+2. **`convert_decoder_prefill.py`**: name the 12 outputs explicitly.
+3. **`prepare_hf_upload.py`**: ensure `decoder_prefill.mlmodelc` is always
+   included (not gated on availability — Swift port treats it as required).
+4. **`generate_coreml.py`**: add a `--dump-intermediates` flag that writes
+   the per-stage tensors (`encoder_output`, the 12 prefill K/V outputs,
+   per-step decoder hidden, sampled codes, audio) to an `.npz`. Used by the
+   Swift `MagpieParityCommand` and `MagpieProbeCommand` for stage-by-stage
+   parity. Currently a manual modification each time.
+5. **Documentation**: add a "Known Limitations" section to the main README
+   noting decoder_step ANE failure and the fp16 host drift.
+
+## Verified Swift performance budget (post-optimization)
+
+Reference: 8-word English sentence, M-series, warm process.
+
+```
+text_encoder         12 ms (ANE)        — 1×
+decoder_prefill      18 ms (ANE)        — 1×
+decoder_step      ~22 ms (GPU/MPS)      — ~80–200× per utterance (this is the AR loop)
+nanocodec_decoder   ~50 ms (ANE)        — 1×
+LocalTransformer   ~3-5 ms/step (CPU/Accelerate+BNNS)
+─────────────────────────────────────
+Wall (warm)        ~96 s for ~3 s of audio at 22 kHz
+RTFx              ~0.04× (sub-realtime)
+```
+
+The bottleneck is the AR loop (`decoder_step` × ~120 + LocalTransformer
+sample × 8 codebooks per step). To beat realtime we need either:
+
+- ANE on `decoder_step` (blocked by the compile failure documented above).
+- A drastically faster LocalTransformer (MLX backend candidate).
+- Speculative decoding / parallel sampling (architectural change).
+
+## File-level cross-reference (Swift side)
+
+| Concern                          | Swift file                                                                              |
+| -------------------------------- | --------------------------------------------------------------------------------------- |
+| Per-model compute units          | `Sources/FluidAudio/TTS/Magpie/Assets/MagpieModelStore.swift`                           |
+| Prefill batched call + KV seed   | `Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpiePrefill.swift`, `MagpieKvCache.swift` |
+| Stage-by-stage parity probe      | `Sources/FluidAudioCLI/Commands/MagpieProbeCommand.swift`                               |
+| Compute-device probe             | `Sources/FluidAudioCLI/Commands/MagpieComputePlanCommand.swift`                         |
+| LocalTransformer (Accelerate)    | `Sources/FluidAudio/TTS/Magpie/LocalTransformer/MagpieLocalTransformer.swift`           |
+| fp64 LocalTransformer reference  | `Sources/FluidAudio/TTS/Magpie/LocalTransformer/MagpieLocalTransformerDouble.swift`     |
+| Documentation                    | `Documentation/TTS/Magpie.md`                                                           |
diff --git a/models/tts/magpie/coreml/build_manifest.py b/models/tts/magpie/coreml/build_manifest.py
new file mode 100644
index 0000000..f2b80ca
--- /dev/null
+++ b/models/tts/magpie/coreml/build_manifest.py
@@ -0,0 +1,380 @@
+"""Build manifest.json for the Magpie TTS hf-upload directory.
+
+The manifest is a machine-readable index of every artifact in the upload
+(models in both .mlmodelc + .mlpackage form, constants, per-language
+tokenizer files), along with shapes, sizes, and SHA-256 digests. The Swift
+port's MagpieResourceDownloader consumes it to know what to fetch and how
+to verify integrity.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import json
+import struct
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+
+ROOT = Path(__file__).resolve().parent / "hf-upload"
+SCHEMA_VERSION = "1.0"
+REPO_ID = "FluidInference/magpie-tts-multilingual-357m-coreml"
+
+
+def sha256_file(path: Path) -> str:
+    h = hashlib.sha256()
+    with path.open("rb") as f:
+        for chunk in iter(lambda: f.read(1 << 20), b""):
+            h.update(chunk)
+    return h.hexdigest()
+
+
+def dir_size(path: Path) -> int:
+    return sum(p.stat().st_size for p in path.rglob("*") if p.is_file())
+
+
+def file_count(path: Path) -> int:
+    return sum(1 for p in path.rglob("*") if p.is_file())
+
+
+def parse_npy_header(path: Path) -> dict[str, Any]:
+    """Read the v1/v2 .npy header and return shape + dtype."""
+    with path.open("rb") as f:
+        magic = f.read(6)
+        if magic != b"\x93NUMPY":
+            raise ValueError(f"not an npy file: {path}")
+        major = f.read(1)[0]
+        f.read(1)  # minor
+        if major == 1:
+            (header_len,) = struct.unpack("<H", f.read(2))
+        else:
+            (header_len,) = struct.unpack("<I", f.read(4))
+        header = f.read(header_len).decode("latin1")
+    # crude eval of the header dict (it is a plain Python literal)
+    import ast
+
+    meta = ast.literal_eval(header)
+    return {
+        "dtype": meta["descr"],
+        "shape": list(meta["shape"]),
+        "fortran_order": meta["fortran_order"],
+    }
+
+
+def npy_entry(rel: str) -> dict[str, Any]:
+    p = ROOT / rel
+    info = parse_npy_header(p)
+    return {
+        "path": rel,
+        "bytes": p.stat().st_size,
+        "sha256": sha256_file(p),
+        "dtype": info["dtype"],
+        "shape": info["shape"],
+    }
+
+
+def json_entry(rel: str) -> dict[str, Any]:
+    p = ROOT / rel
+    return {
+        "path": rel,
+        "bytes": p.stat().st_size,
+        "sha256": sha256_file(p),
+    }
+
+
+def model_pair_entry(name: str, io: dict[str, Any]) -> dict[str, Any]:
+    mlmodelc = ROOT / f"{name}.mlmodelc"
+    mlpackage = ROOT / f"{name}.mlpackage"
+    return {
+        "name": name,
+        "compiled": {
+            "path": f"{name}.mlmodelc",
+            "bytes": dir_size(mlmodelc),
+            "files": file_count(mlmodelc),
+        },
+        "package": {
+            "path": f"{name}.mlpackage",
+            "bytes": dir_size(mlpackage),
+            "files": file_count(mlpackage),
+        },
+        "io": io,
+    }
+
+
+# ---------- model io specs ----------------------------------------------------
+
+# These specs were captured by inspecting the converted .mlpackage descriptions
+# during convert_*.py runs (see generate_coreml.py for runtime keys).
+
+MODEL_IO: dict[str, dict[str, Any]] = {
+    "text_encoder": {
+        "inputs": [
+            {"name": "text_tokens", "dtype": "int32", "shape": [1, 256]},
+            {"name": "text_mask", "dtype": "fp16", "shape": [1, 256]},
+        ],
+        "outputs": [
+            {"name": "encoder_output", "dtype": "fp16", "shape": [1, 256, 768]},
+            {"name": "encoder_mask", "dtype": "fp16", "shape": [1, 256]},
+        ],
+    },
+    "decoder_prefill": {
+        "inputs": [
+            {"name": "input", "dtype": "fp16", "shape": [1, 110, 768]},
+            {"name": "encoder_output", "dtype": "fp16", "shape": [1, 256, 768]},
+            {"name": "encoder_mask", "dtype": "fp16", "shape": [1, 256]},
+        ],
+        "outputs": [
+            {"name": "hidden_states", "dtype": "fp16", "shape": [1, 110, 768]},
+            {
+                "name": "cache_*",
+                "dtype": "fp16",
+                "shape": [2, 1, 512, 12, 64],
+                "count": 12,
+                "note": "12 KV-cache outputs for the 12 decoder layers",
+            },
+            {
+                "name": "position_*",
+                "dtype": "int32",
+                "shape": [],
+                "count": 12,
+                "note": "scalar position counter per layer",
+            },
+        ],
+    },
+    "decoder_step": {
+        "inputs": [
+            {"name": "input", "dtype": "fp16", "shape": [1, 1, 768]},
+            {"name": "encoder_output", "dtype": "fp16", "shape": [1, 256, 768]},
+            {"name": "encoder_mask", "dtype": "fp16", "shape": [1, 256]},
+            {
+                "name": "cache_*",
+                "dtype": "fp16",
+                "shape": [2, 1, 512, 12, 64],
+                "count": 12,
+            },
+            {"name": "position_*", "dtype": "int32", "shape": [], "count": 12},
+        ],
+        "outputs": [
+            {
+                "name": "var_2201",
+                "dtype": "fp16",
+                "shape": [1, 1, 16192],
+                "note": "logits, reshape to (1, 1, 8, 2024) for 8 codebooks",
+            },
+            {"name": "new_cache_*", "dtype": "fp16", "shape": [2, 1, 512, 12, 64], "count": 12},
+            {"name": "var_*", "dtype": "int32", "shape": [], "count": 12, "note": "advanced positions"},
+        ],
+    },
+    "nanocodec_decoder": {
+        "inputs": [
+            {"name": "tokens", "dtype": "int32", "shape": [1, 8, 256]},
+        ],
+        "outputs": [
+            {"name": "audio", "dtype": "fp32", "shape": [1, 262144], "note": "256 frames * 1024 samples = 11.89s @ 22050 Hz"},
+        ],
+        "limits": {"max_frames": 256, "max_audio_seconds": 11.89},
+    },
+}
+
+
+# ---------- constants files ---------------------------------------------------
+
+CONSTANTS_NPY = [
+    "constants/audio_embedding_0.npy",
+    "constants/audio_embedding_1.npy",
+    "constants/audio_embedding_2.npy",
+    "constants/audio_embedding_3.npy",
+    "constants/audio_embedding_4.npy",
+    "constants/audio_embedding_5.npy",
+    "constants/audio_embedding_6.npy",
+    "constants/audio_embedding_7.npy",
+    "constants/speaker_0.npy",
+    "constants/speaker_1.npy",
+    "constants/speaker_2.npy",
+    "constants/speaker_3.npy",
+    "constants/speaker_4.npy",
+    "constants/speaker_embeddings_raw.npy",
+    "constants/text_embedding.npy",
+]
+
+CONSTANTS_JSON = [
+    "constants/constants.json",
+    "constants/speaker_info.json",
+    "constants/tokenizer_info.json",
+    "constants/tokenizer_metadata.json",
+    "constants/tokenizer_references.json",
+]
+
+LOCAL_TRANSFORMER_NPY = [
+    "constants/local_transformer/in_proj_weight.npy",
+    "constants/local_transformer/in_proj_bias.npy",
+    "constants/local_transformer/pos_emb.npy",
+    "constants/local_transformer/norm1_weight.npy",
+    "constants/local_transformer/norm2_weight.npy",
+    "constants/local_transformer/sa_qkv_weight.npy",
+    "constants/local_transformer/sa_o_weight.npy",
+    "constants/local_transformer/ffn_conv1_weight.npy",
+    "constants/local_transformer/ffn_conv2_weight.npy",
+] + [
+    f"constants/local_transformer/out_proj_{i}_{kind}.npy"
+    for i in range(8)
+    for kind in ("weight", "bias")
+]
+
+
+# ---------- per-language tokenizer files --------------------------------------
+
+# Mirrors MagpieLanguage in the Swift port. Languages with no entries use
+# ByT5 byte-level tokenization (algorithmic, no lookup files).
+
+LANGUAGE_FILES: dict[str, dict[str, Any]] = {
+    "english": {
+        "tokenizer_kind": "phoneme",
+        "files": [
+            "tokenizer/english_phoneme_token2id.json",
+            "tokenizer/english_phoneme_phoneme_dict.json",
+            "tokenizer/english_phoneme_heteronyms.json",
+        ],
+    },
+    "spanish": {
+        "tokenizer_kind": "phoneme",
+        "files": [
+            "tokenizer/spanish_phoneme_token2id.json",
+            "tokenizer/spanish_phoneme_phoneme_dict.json",
+        ],
+    },
+    "german": {
+        "tokenizer_kind": "phoneme",
+        "files": [
+            "tokenizer/german_phoneme_token2id.json",
+            "tokenizer/german_phoneme_phoneme_dict.json",
+            "tokenizer/german_phoneme_heteronyms.json",
+        ],
+    },
+    "hindi": {
+        "tokenizer_kind": "char",
+        "files": [
+            "tokenizer/hindi_chartokenizer_token2id.json",
+        ],
+    },
+    "mandarin": {
+        "tokenizer_kind": "phoneme+jieba+pypinyin",
+        "files": [
+            "tokenizer/mandarin_phoneme_token2id.json",
+            "tokenizer/mandarin_phoneme_phoneme_dict.json",
+            "tokenizer/mandarin_phoneme_pinyin_dict.json",
+            "tokenizer/mandarin_phoneme_tone_dict.json",
+            "tokenizer/mandarin_phoneme_ascii_letter_dict.json",
+            "tokenizer/mandarin_pypinyin_char_dict.json",
+            "tokenizer/mandarin_pypinyin_phrase_dict.json",
+            "tokenizer/mandarin_jieba_dict.json",
+        ],
+    },
+    "french": {"tokenizer_kind": "byt5", "files": []},
+    "italian": {"tokenizer_kind": "byt5", "files": []},
+    "vietnamese": {"tokenizer_kind": "byt5", "files": []},
+}
+
+
+def build_manifest() -> dict[str, Any]:
+    models = {
+        "text_encoder": model_pair_entry("text_encoder", MODEL_IO["text_encoder"]),
+        "decoder_prefill": model_pair_entry("decoder_prefill", MODEL_IO["decoder_prefill"]),
+        "decoder_step": model_pair_entry("decoder_step", MODEL_IO["decoder_step"]),
+        "nanocodec_decoder": model_pair_entry("nanocodec_decoder", MODEL_IO["nanocodec_decoder"]),
+    }
+
+    constants = {
+        "json": [json_entry(p) for p in CONSTANTS_JSON],
+        "npy": [npy_entry(p) for p in CONSTANTS_NPY],
+        "local_transformer": [npy_entry(p) for p in LOCAL_TRANSFORMER_NPY],
+    }
+
+    languages = {}
+    for lang, spec in LANGUAGE_FILES.items():
+        entries = [json_entry(p) for p in spec["files"]]
+        languages[lang] = {
+            "tokenizer_kind": spec["tokenizer_kind"],
+            "files": entries,
+            "bytes": sum(e["bytes"] for e in entries),
+        }
+
+    # Top-level summary
+    total_bytes = (
+        sum(m["compiled"]["bytes"] + m["package"]["bytes"] for m in models.values())
+        + sum(e["bytes"] for e in constants["json"])
+        + sum(e["bytes"] for e in constants["npy"])
+        + sum(e["bytes"] for e in constants["local_transformer"])
+        + sum(lang["bytes"] for lang in languages.values())
+    )
+
+    manifest = {
+        "schema_version": SCHEMA_VERSION,
+        "generated_at": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+        "repo_id": REPO_ID,
+        "model": {
+            "name": "Magpie TTS Multilingual",
+            "params_million": 357,
+            "sample_rate": 22050,
+            "codec_samples_per_frame": 1024,
+            "frames_per_second": 22050.0 / 1024.0,
+            "max_decoder_steps": 500,
+            "max_decoder_seconds": 500 * 1024 / 22050.0,
+            "max_nanocodec_frames": 256,
+            "max_nanocodec_seconds": 256 * 1024 / 22050.0,
+            "embedding_dim": 768,
+            "num_audio_codebooks": 8,
+            "codebook_size": 2024,
+            "audio_bos_id": 2016,
+            "audio_eos_id": 2017,
+            "forbidden_token_ids": [2016, 2018, 2019, 2020, 2021, 2022, 2023],
+            "num_speakers": 5,
+            "speaker_names": ["John", "Sofia", "Aria", "Jason", "Leo"],
+            "speaker_context_length": 110,
+            "max_text_tokens": 256,
+            "supported_languages": list(LANGUAGE_FILES.keys()),
+            "supported_features": [
+                "ipa_override",
+                "deterministic_g2p",
+                "classifier_free_guidance",
+            ],
+            "japanese": {
+                "supported": False,
+                "note": "Japanese deferred — needs OpenJTalk + MeCab dict (separate follow-up).",
+            },
+            "streaming_nanocodec": {
+                "supported": False,
+                "note": (
+                    "NanoCodec is exported as a fixed-window batch decoder (max_frames=256). "
+                    "True streaming requires MLState conv-cache integration; tested overlap "
+                    "warmup yields <15 dB SNR and is unviable as a fallback."
+                ),
+            },
+        },
+        "models": models,
+        "constants": constants,
+        "languages": languages,
+        "totals": {
+            "bytes": total_bytes,
+            "human": f"{total_bytes / 1_000_000_000:.2f} GB",
+        },
+        "notes": [
+            "Both .mlmodelc (compiled, ready-to-run) and .mlpackage (portable source) are shipped.",
+            "Swift consumers should prefer .mlmodelc; .mlpackage is provided for inspection / re-targeting.",
+            "Per-language tokenizer files under tokenizer/ are lazy: download only the languages you need.",
+            "constants/local_transformer/*.npy are loaded once into a Swift fp32 cache — see MagpieLocalTransformerWeights.swift.",
+        ],
+    }
+    return manifest
+
+
+def main() -> None:
+    manifest = build_manifest()
+    out = ROOT / "manifest.json"
+    out.write_text(json.dumps(manifest, indent=2) + "\n")
+    print(f"wrote {out}  ({out.stat().st_size:,} bytes)")
+    print(f"total assets: {manifest['totals']['human']}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/models/tts/magpie/coreml/convert_decoder_step.py b/models/tts/magpie/coreml/convert_decoder_step.py
index 5b136fe..e32f2f8 100644
--- a/models/tts/magpie/coreml/convert_decoder_step.py
+++ b/models/tts/magpie/coreml/convert_decoder_step.py
@@ -48,20 +48,25 @@ def convert_decoder_step(nemo_path=None, max_seq_len=512, max_text_len=256,
     encoder_output = torch.randn(B, T_enc, d_model)
     encoder_mask = torch.ones(B, T_enc, dtype=torch.bool)
 
-    # Flat cache and position args
-    caches = []
+    # Flat split-K/V cache + position args (rank-4 — ANE-friendly).
+    cache_ks = []
+    cache_vs = []
     positions = []
     for i in range(n_layers):
-        cache = torch.zeros(2, B, max_seq_len, H, D)
+        ck = torch.zeros(B, max_seq_len, H, D)
+        cv = torch.zeros(B, max_seq_len, H, D)
         # Simulate some prefilled context
-        cache[:, :, :10, :, :] = torch.randn(2, B, 10, H, D) * 0.1
-        caches.append(cache)
+        ck[:, :10, :, :] = torch.randn(B, 10, H, D) * 0.1
+        cv[:, :10, :, :] = torch.randn(B, 10, H, D) * 0.1
+        cache_ks.append(ck)
+        cache_vs.append(cv)
         positions.append(torch.tensor([10.0]))
 
-    # Build flat argument tuple
+    # Build flat argument tuple: (audio_embed, encoder_output, encoder_mask,
+    #   ck0, cv0, p0, ck1, cv1, p1, ...).
     example_inputs = (audio_embed, encoder_output, encoder_mask)
     for i in range(n_layers):
-        example_inputs = example_inputs + (caches[i], positions[i])
+        example_inputs = example_inputs + (cache_ks[i], cache_vs[i], positions[i])
 
     # Trace
     print("Tracing model...")
@@ -76,7 +81,8 @@ def convert_decoder_step(nemo_path=None, max_seq_len=512, max_text_len=256,
         ct.TensorType(name="encoder_mask", shape=(1, T_enc), dtype=np.bool_),
     ]
     for i in range(n_layers):
-        inputs.append(ct.TensorType(name=f"cache{i}", shape=(2, 1, max_seq_len, H, D)))
+        inputs.append(ct.TensorType(name=f"cache_k{i}", shape=(1, max_seq_len, H, D)))
+        inputs.append(ct.TensorType(name=f"cache_v{i}", shape=(1, max_seq_len, H, D)))
         inputs.append(ct.TensorType(name=f"position{i}", shape=(1,)))
 
     mlmodel = ct.convert(
@@ -109,7 +115,8 @@ def convert_decoder_step(nemo_path=None, max_seq_len=512, max_text_len=256,
         "encoder_mask": np.ones((1, T_enc), dtype=np.float32),
     }
     for i in range(n_layers):
-        test_inputs[f"cache{i}"] = np.zeros((2, 1, max_seq_len, H, D), dtype=np.float32)
+        test_inputs[f"cache_k{i}"] = np.zeros((1, max_seq_len, H, D), dtype=np.float32)
+        test_inputs[f"cache_v{i}"] = np.zeros((1, max_seq_len, H, D), dtype=np.float32)
         test_inputs[f"position{i}"] = np.array([0.0], dtype=np.float32)
 
     out = coreml_model.predict(test_inputs)
diff --git a/models/tts/magpie/coreml/convert_decoder_step_stateful.py b/models/tts/magpie/coreml/convert_decoder_step_stateful.py
new file mode 100644
index 0000000..6f5aad9
--- /dev/null
+++ b/models/tts/magpie/coreml/convert_decoder_step_stateful.py
@@ -0,0 +1,158 @@
+"""EXPERIMENTAL — DO NOT USE IN PRODUCTION.
+
+Convert decoder step model to CoreML — STATEFUL variant (MLState).
+
+Kept as a documented dead-end. Benchmark on Apple M2 / macOS 26.5 / 146-step
+real loop showed this variant runs at ~212 ms/step vs ~96 ms/step for the
+production rank-4 split-K/V graph (2.2× regression). See sibling file
+``traceable/traceable_decoder_step_stateful.py`` for full rationale.
+
+KV caches are managed as on-device state buffers via ``ct.StateType`` instead of
+being passed in/out of the graph as 36 input/output tensors per step. The model
+exposes a tiny IO surface (4 inputs, 2 outputs).
+
+Caveat: stateful CoreML graphs do not target ANE. We force CPU+GPU at runtime,
+which is exactly why this variant loses for Magpie (rank-4 production already
+gets 97.3% on ANE).
+
+Usage:
+    python convert_decoder_step_stateful.py [--nemo-path /path/to/model.nemo]
+"""
+import argparse
+import os
+import sys
+
+import coremltools as ct
+import numpy as np
+import torch
+
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from traceable.traceable_decoder_step_stateful import StatefulDecoderStep
+
+
+def convert_decoder_step_stateful(nemo_path=None, max_seq_len=512, max_text_len=256,
+                                  output_path="build/decoder_step_stateful.mlpackage"):
+    print("Loading MagpieTTS model...")
+    from nemo.collections.tts.models import MagpieTTSModel
+    if nemo_path:
+        model = MagpieTTSModel.restore_from(nemo_path, map_location="cpu")
+    else:
+        model = MagpieTTSModel.from_pretrained("nvidia/magpie_tts_multilingual_357m")
+    model.eval()
+
+    cfg = model.cfg
+    dec_cfg = dict(cfg.decoder)
+    d_model = dec_cfg["d_model"]
+    n_layers = dec_cfg["n_layers"]
+    sa_n_heads = dec_cfg["sa_n_heads"]
+    d_head = d_model // sa_n_heads
+
+    print("Creating stateful traceable decoder step...")
+    decoder = StatefulDecoderStep.from_magpie(model)
+    decoder.eval()
+    decoder.reset_state()
+
+    # Example inputs. Position is a 1-elem int32 scalar.
+    B = 1
+    T_enc = max_text_len
+
+    audio_embed = torch.randn(B, 1, d_model)
+    encoder_output = torch.randn(B, T_enc, d_model)
+    encoder_mask = torch.ones(B, T_enc, dtype=torch.bool)
+    position = torch.tensor([0], dtype=torch.int32)
+
+    example_inputs = (audio_embed, encoder_output, encoder_mask, position)
+
+    print("Tracing model...")
+    with torch.no_grad():
+        traced = torch.jit.trace(decoder, example_inputs, strict=False)
+
+    print("Converting to CoreML (stateful)...")
+    inputs = [
+        ct.TensorType(name="audio_embed", shape=(1, 1, d_model)),
+        ct.TensorType(name="encoder_output", shape=(1, T_enc, d_model)),
+        ct.TensorType(name="encoder_mask", shape=(1, T_enc), dtype=np.bool_),
+        ct.TensorType(name="position", shape=(1,), dtype=np.int32),
+    ]
+
+    states = []
+    for i in range(n_layers):
+        states.append(ct.StateType(
+            wrapped_type=ct.TensorType(
+                shape=(1, max_seq_len, sa_n_heads, d_head),
+                dtype=np.float16,
+            ),
+            name=f"k_cache_{i}",
+        ))
+        states.append(ct.StateType(
+            wrapped_type=ct.TensorType(
+                shape=(1, max_seq_len, sa_n_heads, d_head),
+                dtype=np.float16,
+            ),
+            name=f"v_cache_{i}",
+        ))
+
+    mlmodel = ct.convert(
+        traced,
+        inputs=inputs,
+        states=states,
+        convert_to="mlprogram",
+        compute_precision=ct.precision.FLOAT16,
+        compute_units=ct.ComputeUnit.CPU_AND_GPU,
+        minimum_deployment_target=ct.target.macOS15,
+    )
+
+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
+    mlmodel.save(output_path)
+    print(f"Saved to {output_path}")
+
+    spec = mlmodel.get_spec()
+    print("\n=== INPUTS ===")
+    for inp in spec.description.input:
+        if inp.type.HasField("multiArrayType"):
+            shape = list(inp.type.multiArrayType.shape)
+            print(f"  {inp.name}: {shape}")
+    print("\n=== OUTPUTS ===")
+    for out in spec.description.output:
+        if out.type.HasField("multiArrayType"):
+            shape = list(out.type.multiArrayType.shape)
+            print(f"  {out.name}: {shape}")
+    print("\n=== STATES ===")
+    if hasattr(spec.description, "state"):
+        for s in spec.description.state:
+            # State features use the ``stateType`` oneof, which wraps an
+            # ``arrayType`` (multiArrayType-equivalent) on the inside.
+            try:
+                shape = list(s.type.stateType.arrayType.shape)
+                print(f"  {s.name}: {shape}")
+            except Exception as exc:  # pragma: no cover - inspection only
+                print(f"  {s.name}: <unable to inspect shape: {exc}>")
+
+    print("\nTesting CoreML model with state...")
+    coreml_model = ct.models.MLModel(output_path, compute_units=ct.ComputeUnit.CPU_AND_GPU)
+    state = coreml_model.make_state()
+
+    test_inputs = {
+        "audio_embed": np.random.randn(1, 1, d_model).astype(np.float32),
+        "encoder_output": np.random.randn(1, T_enc, d_model).astype(np.float32),
+        "encoder_mask": np.ones((1, T_enc), dtype=np.float32),
+        "position": np.array([0], dtype=np.int32),
+    }
+
+    out = coreml_model.predict(test_inputs, state=state)
+    print(f"Output keys: {len(out)}")
+    for k, v in sorted(out.items()):
+        if isinstance(v, np.ndarray):
+            print(f"  {k}: shape={v.shape}")
+    print("Done!")
+    return output_path
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--nemo-path", type=str, default=None)
+    parser.add_argument("--max-seq-len", type=int, default=512)
+    parser.add_argument("--max-text-len", type=int, default=256)
+    parser.add_argument("--output", type=str, default="build/decoder_step_stateful.mlpackage")
+    args = parser.parse_args()
+    convert_decoder_step_stateful(args.nemo_path, args.max_seq_len, args.max_text_len, args.output)
diff --git a/models/tts/magpie/coreml/emit_parity_fixture.py b/models/tts/magpie/coreml/emit_parity_fixture.py
new file mode 100644
index 0000000..ab0aa58
--- /dev/null
+++ b/models/tts/magpie/coreml/emit_parity_fixture.py
@@ -0,0 +1,332 @@
+"""Emit intermediate-tensor fixtures for cross-implementation parity testing.
+
+Runs the Magpie CoreML pipeline for a fixed (text, speaker, language, seed)
+and dumps intermediate tensors so the Swift port (or any other
+implementation) can replay each stage and diff against this ground truth.
+
+Two output modes:
+
+- ``--mode full`` (default): runs the full pipeline and saves an ``.npz`` with
+  text tokens, encoder output, post-prefill KV caches, per-step decoder
+  hidden states, per-step sampled codes, the final ``(8, N)`` codes matrix,
+  and the decoded PCM.
+- ``--mode tokenizer``: tokenizes only and saves a ``.json`` mapping
+  ``{text, speaker, language, token_ids}`` — cheap to diff against the Swift
+  ``MagpieTokenizer`` output without requiring CoreML at all.
+
+Example:
+
+    python emit_parity_fixture.py "Hello world." \\
+        --speaker 0 --language en --seed 42 \\
+        --output fixture_en_s0.npz
+
+    python emit_parity_fixture.py "Hello world." \\
+        --speaker 0 --language en --mode tokenizer \\
+        --output fixture_en_s0_tokens.json
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import os
+import time
+from typing import Any
+
+import coremltools as ct
+import numpy as np
+import soundfile as sf
+
+# Re-use everything from the main script so we never drift from the reference.
+from generate_coreml import (  # noqa: E402
+    BUILD_DIR,
+    DECODER_CACHE_OUT_KEYS,
+    DECODER_HIDDEN_KEY,
+    DECODER_POSITION_KEYS,
+    _tokenize_text,
+    embed_audio_codes,
+    load_audio_embeddings,
+    load_constants,
+    load_local_transformer,
+    load_speaker_embedding,
+    local_transformer_sample,
+)
+
+
+def _make_caches(n_layers: int, max_seq_len: int, n_heads: int, d_head: int):
+    c, p = {}, {}
+    for i in range(n_layers):
+        c[f"cache{i}"] = np.zeros(
+            (2, 1, max_seq_len, n_heads, d_head), dtype=np.float32
+        )
+        p[f"position{i}"] = np.array([0.0], dtype=np.float32)
+    return c, p
+
+
+def emit_tokenizer_fixture(
+    text: str,
+    speaker: int,
+    language: str,
+    output_path: str,
+) -> None:
+    constants = load_constants()
+    token_ids = _tokenize_text(text, language, constants).tolist()
+    fixture = {
+        "text": text,
+        "speakerIndex": speaker,
+        "languageCode": language,
+        "expectedTokenIds": token_ids,
+    }
+    with open(output_path, "w") as f:
+        json.dump(fixture, f, indent=2, ensure_ascii=False)
+    print(f"Wrote tokenizer fixture → {output_path}  ({len(token_ids)} tokens)")
+
+
+def emit_full_fixture(
+    text: str,
+    speaker: int,
+    language: str,
+    output_path: str,
+    temperature: float,
+    topk: int,
+    max_steps: int,
+    seed: int,
+    use_cfg: bool,
+    cfg_scale: float,
+) -> None:
+    np.random.seed(seed)
+    constants = load_constants()
+
+    num_codebooks = constants["num_audio_codebooks"]
+    audio_bos_id = constants["special_tokens"]["audio_bos_id"]
+    audio_eos_id = constants["special_tokens"]["audio_eos_id"]
+    sample_rate = constants["output_sample_rate"]
+    d_model = constants["decoder"]["d_model"]
+    n_layers = constants["decoder"]["n_layers"]
+    sa_n_heads = constants["decoder"]["sa_n_heads"]
+    d_head = d_model // sa_n_heads
+    max_text_len = 256
+    max_seq_len = 512
+    min_frames = constants["inference"].get("min_generated_frames", 4)
+
+    # --- 1. Tokenize ---
+    text_tokens = _tokenize_text(text, language, constants)
+    T_text = int(len(text_tokens))
+    text_tokens_padded = np.zeros(max_text_len, dtype=np.int32)
+    text_tokens_padded[:T_text] = text_tokens
+    text_mask = np.zeros(max_text_len, dtype=np.float32)
+    text_mask[:T_text] = 1.0
+
+    # --- 2. Load models ---
+    text_encoder = ct.models.MLModel(
+        os.path.join(BUILD_DIR, "text_encoder.mlpackage"),
+        compute_units=ct.ComputeUnit.CPU_AND_GPU,
+    )
+    decoder_step = ct.models.MLModel(
+        os.path.join(BUILD_DIR, "decoder_step.mlpackage"),
+        compute_units=ct.ComputeUnit.CPU_AND_GPU,
+    )
+    nanocodec = ct.models.MLModel(
+        os.path.join(BUILD_DIR, "nanocodec_decoder.mlpackage"),
+        compute_units=ct.ComputeUnit.CPU_AND_GPU,
+    )
+
+    # --- 3. Encode text ---
+    enc_out = text_encoder.predict({
+        "text_tokens": text_tokens_padded[np.newaxis, :],
+        "text_mask": text_mask[np.newaxis, :],
+    })
+    encoder_output = np.asarray(enc_out["encoder_output"], dtype=np.float32)
+
+    if use_cfg:
+        uncond_encoder_output = np.zeros_like(encoder_output)
+        uncond_text_mask = np.zeros_like(text_mask)
+        uncond_text_mask[0] = 1.0
+
+    # --- 4. Load embeddings + LT weights ---
+    speaker_emb = load_speaker_embedding(speaker)
+    T_ctx = int(speaker_emb.shape[0])
+    audio_emb_tables = load_audio_embeddings(constants)
+    lt_weights = load_local_transformer()
+
+    caches, positions = _make_caches(n_layers, max_seq_len, sa_n_heads, d_head)
+    if use_cfg:
+        u_caches, u_positions = _make_caches(n_layers, max_seq_len, sa_n_heads, d_head)
+
+    def _run_step(audio_embed, enc_out_np, mask_np, cache_dict, pos_dict):
+        inputs: dict[str, Any] = {
+            "audio_embed": audio_embed.astype(np.float32),
+            "encoder_output": enc_out_np.astype(np.float32),
+            "encoder_mask": mask_np[np.newaxis, :].astype(np.float32),
+        }
+        inputs.update(cache_dict)
+        inputs.update(pos_dict)
+        out = decoder_step.predict(inputs)
+        for i in range(n_layers):
+            cache_dict[f"cache{i}"] = out[DECODER_CACHE_OUT_KEYS[i]]
+            pos_dict[f"position{i}"] = out[DECODER_POSITION_KEYS[i]]
+        return np.asarray(out[DECODER_HIDDEN_KEY], dtype=np.float32)
+
+    # --- 5. Prefill ---
+    uncond_ctx = np.zeros((1, 1, d_model), dtype=np.float32)
+    for t in range(T_ctx):
+        ctx = speaker_emb[np.newaxis, np.newaxis, t, :]
+        _run_step(ctx, encoder_output, text_mask, caches, positions)
+        if use_cfg:
+            _run_step(uncond_ctx, uncond_encoder_output, uncond_text_mask,
+                      u_caches, u_positions)
+
+    # Snapshot KV caches after prefill (deep-copied so later rotation doesn't
+    # mutate the fixture).
+    prefill_caches = {k: v.copy() for k, v in caches.items()}
+    prefill_positions = {k: v.copy() for k, v in positions.items()}
+
+    # --- 6. AR loop ---
+    current_codes = np.full(num_codebooks, audio_bos_id, dtype=np.int32)
+    per_step_hidden: list[np.ndarray] = []
+    per_step_codes: list[np.ndarray] = []
+
+    gen_start = time.time()
+    for step in range(max_steps):
+        audio_embed = embed_audio_codes(current_codes, audio_emb_tables, num_codebooks)
+        cond_hidden = _run_step(audio_embed, encoder_output, text_mask, caches, positions)
+
+        if use_cfg:
+            uncond_hidden = _run_step(
+                audio_embed, uncond_encoder_output, uncond_text_mask,
+                u_caches, u_positions,
+            )
+            uncond_dec_hidden = uncond_hidden[0, 0]
+        else:
+            uncond_dec_hidden = None
+
+        decoder_hidden = cond_hidden[0, 0]
+        per_step_hidden.append(decoder_hidden.copy())
+
+        forbid_eos = step < min_frames
+        next_codes = local_transformer_sample(
+            decoder_hidden, lt_weights, audio_emb_tables,
+            num_codebooks, temperature, topk, forbid_eos,
+            uncond_decoder_hidden=uncond_dec_hidden,
+            cfg_scale=cfg_scale if use_cfg else 1.0,
+        )
+
+        is_eos = bool(np.any(next_codes == audio_eos_id))
+        if is_eos and step >= min_frames:
+            per_step_codes.append(next_codes.copy())
+            break
+        per_step_codes.append(next_codes.copy())
+        current_codes = next_codes
+
+    gen_time = time.time() - gen_start
+
+    predicted_codes_full = np.stack(per_step_codes, axis=1)  # (8, N)
+
+    # --- 7. NanoCodec decode ---
+    max_frames = 256
+    T_total = min(predicted_codes_full.shape[1], max_frames)
+    padded = np.zeros((num_codebooks, max_frames), dtype=np.int32)
+    padded[:, :T_total] = predicted_codes_full[:, :T_total]
+    codec_out = nanocodec.predict({
+        "tokens": padded[np.newaxis, :, :].astype(np.int32),
+    })
+    audio = np.asarray(codec_out["audio"], dtype=np.float32)
+    if audio.ndim > 1:
+        audio = audio.flatten()
+    expected_samples = T_total * constants["codec_samples_per_frame"]
+    audio = audio[:expected_samples]
+    peak = float(np.abs(audio).max())
+    if peak > 0:
+        audio = audio / peak * 0.9
+
+    # --- 8. Pack fixture ---
+    fixture: dict[str, Any] = {
+        # Config
+        "text": np.array(text),
+        "speakerIndex": np.int32(speaker),
+        "languageCode": np.array(language),
+        "seed": np.int32(seed),
+        "useCfg": np.bool_(use_cfg),
+        "cfgScale": np.float32(cfg_scale),
+        "temperature": np.float32(temperature),
+        "topk": np.int32(topk),
+        "sampleRate": np.int32(sample_rate),
+        "minFrames": np.int32(min_frames),
+        # Stage 1: tokenizer
+        "textTokens": text_tokens.astype(np.int32),
+        "textTokensPadded": text_tokens_padded.astype(np.int32),
+        "textMask": text_mask.astype(np.float32),
+        # Stage 2: text encoder
+        "encoderOutput": encoder_output.astype(np.float32),
+        # Stage 3: post-prefill caches
+        **{f"prefillCache{i}": prefill_caches[f"cache{i}"].astype(np.float32)
+           for i in range(n_layers)},
+        **{f"prefillPosition{i}": prefill_positions[f"position{i}"].astype(np.float32)
+           for i in range(n_layers)},
+        # Stage 4: per-step AR trace
+        "perStepDecoderHidden": np.stack(per_step_hidden, axis=0).astype(np.float32),
+        "perStepCodes": np.stack(per_step_codes, axis=0).astype(np.int32),
+        "predictedCodes": predicted_codes_full.astype(np.int32),
+        # Stage 5: audio
+        "audioPcm": audio.astype(np.float32),
+        "audioSamples": np.int32(len(audio)),
+        "genTimeSeconds": np.float32(gen_time),
+    }
+
+    np.savez_compressed(output_path, **fixture)
+
+    duration = len(audio) / sample_rate if sample_rate > 0 else 0.0
+    rtf = gen_time / duration if duration > 0 else math.inf
+    print(f"Wrote full fixture → {output_path}")
+    print(f"  tokens={T_text}  frames={predicted_codes_full.shape[1]}  "
+          f"duration={duration:.2f}s  rtf={rtf:.2f}x")
+
+    wav_path = os.path.splitext(output_path)[0] + ".wav"
+    sf.write(wav_path, audio, sample_rate)
+    print(f"  reference audio → {wav_path}")
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description="Emit Magpie TTS parity fixtures for cross-impl testing.",
+    )
+    parser.add_argument("text", type=str, help="Text to synthesize")
+    parser.add_argument("--mode", choices=["full", "tokenizer"], default="full",
+                        help="'full' dumps .npz of all intermediates; "
+                             "'tokenizer' dumps a small .json of token ids")
+    parser.add_argument("--speaker", type=int, default=0)
+    parser.add_argument("--language", type=str, default="en")
+    parser.add_argument("--output", type=str, required=True,
+                        help="Output path (.npz for full, .json for tokenizer)")
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--temperature", type=float, default=0.6)
+    parser.add_argument("--topk", type=int, default=80)
+    parser.add_argument("--max-steps", type=int, default=500)
+    parser.add_argument("--no-cfg", action="store_true")
+    parser.add_argument("--cfg-scale", type=float, default=2.5)
+    args = parser.parse_args()
+
+    if args.mode == "tokenizer":
+        emit_tokenizer_fixture(
+            text=args.text,
+            speaker=args.speaker,
+            language=args.language,
+            output_path=args.output,
+        )
+    else:
+        emit_full_fixture(
+            text=args.text,
+            speaker=args.speaker,
+            language=args.language,
+            output_path=args.output,
+            temperature=args.temperature,
+            topk=args.topk,
+            max_steps=args.max_steps,
+            seed=args.seed,
+            use_cfg=not args.no_cfg,
+            cfg_scale=args.cfg_scale,
+        )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/models/tts/magpie/coreml/generate_coreml.py b/models/tts/magpie/coreml/generate_coreml.py
index e786ca8..8bbae3c 100644
--- a/models/tts/magpie/coreml/generate_coreml.py
+++ b/models/tts/magpie/coreml/generate_coreml.py
@@ -25,18 +25,37 @@
 CONST_DIR = os.path.join(SCRIPT_DIR, "constants")
 BUILD_DIR = os.path.join(SCRIPT_DIR, "build")
 
-# Decoder step output key names (from CoreML model spec)
-DECODER_LOGITS_KEY = "var_2201"
+# EXPERIMENTAL: Stateful (MLState) decoder path. Off by default.
+#
+# Enable by setting MAGPIE_STATEFUL=1. Kept for reference only — benchmarks
+# (Apple M2, 146-step real loop) showed ~212 ms/step vs ~96 ms/step for the
+# rank-4 production path: 2.2× regression. Stateful graphs run on CPU+GPU
+# only (ANE rejects them); the IO-marshaling savings from collapsing 36 cache
+# tensors don't compensate for losing ANE acceleration. See
+# ``traceable/traceable_decoder_step_stateful.py`` for full rationale.
+STATEFUL = bool(os.environ.get("MAGPIE_STATEFUL", ""))
+
+# Decoder step output key names (from CoreML model spec — rank-4 split-K/V)
+DECODER_LOGITS_KEY = "var_2129"
 DECODER_HIDDEN_KEY = "input"
-# Output cache keys (input keys are cache0..cache11)
-DECODER_CACHE_OUT_KEYS = [
-    "new_cache_1", "new_cache_3", "new_cache_5", "new_cache_7",
-    "new_cache_9", "new_cache_11", "new_cache_13", "new_cache_15",
-    "new_cache_17", "new_cache_19", "new_cache_21", "new_cache",
+
+# Stateful model uses a different logits key (re-traced graph reorders ops).
+DECODER_LOGITS_KEY_STATEFUL = "var_2124"
+# Per-layer K and V output keys (12 layers each).
+# Input keys are cache_k0..cache_k11 / cache_v0..cache_v11 / position0..position11.
+DECODER_CACHE_K_OUT_KEYS = [
+    "new_k_1", "new_k_3", "new_k_5", "new_k_7",
+    "new_k_9", "new_k_11", "new_k_13", "new_k_15",
+    "new_k_17", "new_k_19", "new_k_21", "new_k",
+]
+DECODER_CACHE_V_OUT_KEYS = [
+    "new_v_1", "new_v_3", "new_v_5", "new_v_7",
+    "new_v_9", "new_v_11", "new_v_13", "new_v_15",
+    "new_v_17", "new_v_19", "new_v_21", "new_v",
 ]
 DECODER_POSITION_KEYS = [
-    "var_169", "var_346", "var_523", "var_700", "var_877", "var_1054",
-    "var_1231", "var_1408", "var_1585", "var_1762", "var_1939", "var_2116",
+    "var_169", "var_339", "var_509", "var_679", "var_849", "var_1019",
+    "var_1189", "var_1359", "var_1529", "var_1699", "var_1869", "var_2039",
 ]
 
 # Forbidden token IDs (special tokens that should never be sampled)
@@ -291,10 +310,17 @@ def generate(
         os.path.join(BUILD_DIR, "text_encoder.mlpackage"),
         compute_units=ct.ComputeUnit.CPU_AND_GPU,
     )
-    decoder_step = ct.models.MLModel(
-        os.path.join(BUILD_DIR, "decoder_step.mlpackage"),
-        compute_units=ct.ComputeUnit.CPU_AND_GPU,
-    )
+    if STATEFUL:
+        decoder_step = ct.models.MLModel(
+            os.path.join(BUILD_DIR, "decoder_step_stateful.mlpackage"),
+            # Stateful graphs are not ANE-compatible; CPU+GPU only.
+            compute_units=ct.ComputeUnit.CPU_AND_GPU,
+        )
+    else:
+        decoder_step = ct.models.MLModel(
+            os.path.join(BUILD_DIR, "decoder_step.mlpackage"),
+            compute_units=ct.ComputeUnit.ALL,  # rank-4 split-K/V — ANE compiles for some ops
+        )
     nanocodec = ct.models.MLModel(
         os.path.join(BUILD_DIR, "nanocodec_decoder.mlpackage"),
         compute_units=ct.ComputeUnit.CPU_AND_GPU,
@@ -322,32 +348,59 @@ def generate(
     audio_emb_tables = load_audio_embeddings(constants)
     lt_weights = load_local_transformer()
 
-    # 5. Initialize KV caches (conditional)
-    def make_caches():
-        c, p = {}, {}
-        for i in range(n_layers):
-            c[f"cache{i}"] = np.zeros((2, 1, max_seq_len, sa_n_heads, d_head), dtype=np.float32)
-            p[f"position{i}"] = np.array([0.0], dtype=np.float32)
-        return c, p
-
-    caches, positions = make_caches()
-    if use_cfg:
-        uncond_caches, uncond_positions = make_caches()
-
-    def run_decoder_step(audio_embed_np, enc_out_np, mask_np, cache_dict, pos_dict):
-        step_inputs = {
-            "audio_embed": audio_embed_np.astype(np.float32),
-            "encoder_output": enc_out_np.astype(np.float32),
-            "encoder_mask": mask_np[np.newaxis, :].astype(np.float32),
-        }
-        step_inputs.update(cache_dict)
-        step_inputs.update(pos_dict)
-        step_out = decoder_step.predict(step_inputs)
-        for i in range(n_layers):
-            # Output cache keys differ from input keys after scatter-based cache rewrite
-            cache_dict[f"cache{i}"] = step_out[DECODER_CACHE_OUT_KEYS[i]]
-            pos_dict[f"position{i}"] = step_out[DECODER_POSITION_KEYS[i]]
-        return step_out[DECODER_HIDDEN_KEY]  # (1, 1, d_model) — decoder hidden
+    # 5. Initialize KV caches.
+    # Stateful path: caches live on the model's MLState; we just track a
+    # per-stream scalar position. Non-stateful path: explicit numpy buffers.
+    if STATEFUL:
+        # State buffers are owned by the CoreML runtime — we just need a
+        # position counter per stream. Use a 1-element list so the closure
+        # can mutate it. Alias to ``caches``/``positions`` so the prefill +
+        # generation call sites work unchanged for both code paths.
+        caches = decoder_step.make_state()
+        positions = [0]
+        if use_cfg:
+            uncond_caches = decoder_step.make_state()
+            uncond_positions = [0]
+    else:
+        def make_caches():
+            c, p = {}, {}
+            for i in range(n_layers):
+                c[f"cache_k{i}"] = np.zeros((1, max_seq_len, sa_n_heads, d_head), dtype=np.float32)
+                c[f"cache_v{i}"] = np.zeros((1, max_seq_len, sa_n_heads, d_head), dtype=np.float32)
+                p[f"position{i}"] = np.array([0.0], dtype=np.float32)
+            return c, p
+
+        caches, positions = make_caches()
+        if use_cfg:
+            uncond_caches, uncond_positions = make_caches()
+
+    if STATEFUL:
+        def run_decoder_step(audio_embed_np, enc_out_np, mask_np, state, pos_box):
+            step_inputs = {
+                "audio_embed": audio_embed_np.astype(np.float32),
+                "encoder_output": enc_out_np.astype(np.float32),
+                "encoder_mask": mask_np[np.newaxis, :].astype(np.float32),
+                "position": np.array([pos_box[0]], dtype=np.int32),
+            }
+            step_out = decoder_step.predict(step_inputs, state=state)
+            pos_box[0] += 1
+            return step_out[DECODER_HIDDEN_KEY]  # (1, 1, d_model)
+    else:
+        def run_decoder_step(audio_embed_np, enc_out_np, mask_np, cache_dict, pos_dict):
+            step_inputs = {
+                "audio_embed": audio_embed_np.astype(np.float32),
+                "encoder_output": enc_out_np.astype(np.float32),
+                "encoder_mask": mask_np[np.newaxis, :].astype(np.float32),
+            }
+            step_inputs.update(cache_dict)
+            step_inputs.update(pos_dict)
+            step_out = decoder_step.predict(step_inputs)
+            for i in range(n_layers):
+                # Output cache keys differ from input keys due to torch trace renaming.
+                cache_dict[f"cache_k{i}"] = step_out[DECODER_CACHE_K_OUT_KEYS[i]]
+                cache_dict[f"cache_v{i}"] = step_out[DECODER_CACHE_V_OUT_KEYS[i]]
+                pos_dict[f"position{i}"] = step_out[DECODER_POSITION_KEYS[i]]
+            return step_out[DECODER_HIDDEN_KEY]  # (1, 1, d_model) — decoder hidden
 
     # 6. Prefill context
     # Conditional path: real speaker context + real encoder output
@@ -361,7 +414,8 @@ def run_decoder_step(audio_embed_np, enc_out_np, mask_np, cache_dict, pos_dict):
             run_decoder_step(uncond_ctx_token, uncond_encoder_output, uncond_text_mask, uncond_caches, uncond_positions)
         if (t + 1) % 50 == 0:
             print(f"  Prefilled {t + 1}/{T_ctx}")
-    print(f"  Prefill done. Position: {positions['position0'][0]:.0f}")
+    final_pos = positions[0] if STATEFUL else float(positions['position0'][0])
+    print(f"  Prefill done. Position: {final_pos:.0f}")
 
     # 7. Autoregressive generation with local transformer
     print(f"\nGenerating (max {max_steps} steps)...")
diff --git a/models/tts/magpie/coreml/prepare_hf_upload.py b/models/tts/magpie/coreml/prepare_hf_upload.py
new file mode 100644
index 0000000..829ae2b
--- /dev/null
+++ b/models/tts/magpie/coreml/prepare_hf_upload.py
@@ -0,0 +1,493 @@
+"""Stage a HuggingFace-ready directory for Magpie TTS Multilingual 357M.
+
+The mobius exporters and converters write into two local directories:
+
+- ``build/`` — compiled ``.mlpackage`` bundles (and, after
+  ``compile_mlmodelc.py``, matching ``.mlmodelc`` bundles).
+- ``constants/`` — ``.npy`` tensors, ``*.json`` config, the
+  ``local_transformer/`` subtree, **and** the per-language tokenizer
+  JSONs.
+
+The FluidAudio Swift port and the target HF repo
+(``FluidInference/magpie-tts-multilingual-357m-coreml``) expect a slightly
+different layout: CoreML models at the root, tokenizer JSONs in a
+dedicated ``tokenizer/`` folder, everything else in ``constants/``. This
+script assembles that layout into ``hf-upload/`` (configurable), writes a
+model card + ``.gitattributes``, validates that nothing required is
+missing, and prints the exact ``huggingface-cli upload`` commands for
+the user to run.
+
+It does **not** upload anything. Per project policy, HF uploads are
+performed manually by the maintainers.
+
+Usage:
+
+    # After running the converter + compiler + constants exporters
+    python prepare_hf_upload.py
+
+    # Custom paths / output
+    python prepare_hf_upload.py \\
+        --build-dir build \\
+        --constants-dir constants \\
+        --output-dir hf-upload \\
+        --repo-id FluidInference/magpie-tts-multilingual-357m-coreml
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import shutil
+import sys
+from dataclasses import dataclass, field
+
+SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
+
+# Core models expected at the repo root.
+REQUIRED_MODELS = [
+    "text_encoder.mlmodelc",
+    "decoder_step.mlmodelc",
+    "nanocodec_decoder.mlmodelc",
+]
+OPTIONAL_MODELS = [
+    "decoder_prefill.mlmodelc",
+]
+
+# Keys that MUST survive in constants/. Anything not in this allow-list that
+# also isn't a per-language tokenizer file will be flagged as unknown.
+CONSTANTS_KEEP_FILES = {
+    "constants.json",
+    "speaker_info.json",
+    "tokenizer_info.json",
+    "tokenizer_metadata.json",
+    "tokenizer_references.json",
+    "text_embedding.npy",
+    "speaker_embeddings_raw.npy",
+}
+CONSTANTS_KEEP_PREFIXES = (
+    "speaker_",         # speaker_0.npy .. speaker_N.npy
+    "audio_embedding_", # audio_embedding_0.npy .. audio_embedding_7.npy
+)
+CONSTANTS_KEEP_DIRS = {"local_transformer"}
+
+# Mirror of MagpieTokenizerFiles.files(for:) in the Swift port.
+PER_LANGUAGE_TOKENIZER_FILES = {
+    "english": [
+        "english_phoneme_token2id.json",
+        "english_phoneme_phoneme_dict.json",
+    ],
+    "spanish": [
+        "spanish_phoneme_token2id.json",
+        "spanish_phoneme_phoneme_dict.json",
+    ],
+    "italian": [
+        "italian_phoneme_token2id.json",
+        "italian_phoneme_phoneme_dict.json",
+    ],
+    "vietnamese": [
+        "vietnamese_phoneme_token2id.json",
+        "vietnamese_phoneme_phoneme_dict.json",
+    ],
+    "german": [
+        "german_phoneme_token2id.json",
+        "german_phoneme_phoneme_dict.json",
+        "german_phoneme_heteronyms.json",
+    ],
+    "french": [
+        "french_chartokenizer_token2id.json",
+    ],
+    "hindi": [
+        "hindi_chartokenizer_token2id.json",
+    ],
+    "mandarin": [
+        "mandarin_phoneme_token2id.json",
+        "mandarin_phoneme_pinyin_dict.json",
+        "mandarin_phoneme_tone_dict.json",
+        "mandarin_phoneme_ascii_letter_dict.json",
+        "mandarin_pypinyin_char_dict.json",
+        "mandarin_pypinyin_phrase_dict.json",
+        "mandarin_jieba_dict.json",
+    ],
+}
+
+ALL_TOKENIZER_FILES = {
+    fname for files in PER_LANGUAGE_TOKENIZER_FILES.values() for fname in files
+}
+
+
+GITATTRIBUTES = """\
+*.mlmodelc filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.mlpackage filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+"""
+
+
+README_TEMPLATE = """\
+---
+license: cc-by-4.0
+language:
+  - en
+  - es
+  - de
+  - fr
+  - it
+  - vi
+  - zh
+  - hi
+tags:
+  - text-to-speech
+  - coreml
+  - apple-silicon
+  - magpie
+library_name: coreml
+base_model: nvidia/magpie_tts_multilingual_357m
+---
+
+# Magpie TTS Multilingual 357M (CoreML)
+
+CoreML export of NVIDIA's [Magpie TTS Multilingual 357M](https://huggingface.co/nvidia/magpie_tts_multilingual_357m), optimized for on-device inference on Apple Silicon. Ships as `.mlmodelc` bundles compiled for macOS 14+ / iOS 17+.
+
+Converted with [FluidInference/mobius](https://github.com/FluidInference/mobius). Consumed by the Swift port in [FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio) (see `Sources/FluidAudio/TTS/Magpie/`).
+
+## Languages
+
+English, Spanish, German, French, Italian, Vietnamese, Mandarin, Hindi. Japanese is not yet included.
+
+## Contents
+
+```
+├── text_encoder.mlmodelc/         # Text → (B, 256, 768) encoder output
+├── decoder_step.mlmodelc/         # 12-layer AR decoder (stateful KV cache)
+├── decoder_prefill.mlmodelc/      # (optional) batched prefill fast path
+├── nanocodec_decoder.mlmodelc/    # 8-codebook → PCM vocoder (22050 Hz)
+├── constants/
+│   ├── constants.json             # d_model, n_layers, EOS ids, ...
+│   ├── speaker_info.json          # speaker names + context shape
+│   ├── tokenizer_metadata.json    # tokenizer-agnostic EOS + special tokens
+│   ├── speaker_0.npy .. speaker_4.npy
+│   ├── audio_embedding_0.npy .. audio_embedding_7.npy
+│   └── local_transformer/         # 1-layer transformer weights (Swift reads .npy)
+└── tokenizer/
+    ├── english_phoneme_*.json
+    ├── spanish_phoneme_*.json
+    ├── german_phoneme_*.json
+    ├── french_chartokenizer_*.json
+    ├── italian_phoneme_*.json
+    ├── vietnamese_phoneme_*.json
+    ├── mandarin_*.json
+    └── hindi_chartokenizer_*.json
+```
+
+## Usage (Swift)
+
+```swift
+import FluidAudio
+
+let manager = try await MagpieTtsManager.downloadAndCreate(
+    languages: [.english, .spanish]
+)
+let result = try await manager.synthesize(
+    text: "Hello | ˈ n ɛ m o ʊ | from FluidAudio.",
+    speaker: .john,
+    language: .english
+)
+let wav = AudioWAV.data(from: result.samples, sampleRate: result.sampleRate)
+try wav.write(to: URL(fileURLWithPath: "hello.wav"))
+```
+
+The manager lazy-downloads everything in this repo on first use.
+
+## Inline IPA override
+
+Text enclosed in `|...|` is passed straight to the tokenizer as whitespace-separated IPA tokens:
+
+```
+"Hello | ˈ n ɛ m o ʊ | world"
+```
+
+## License
+
+- CoreML export: CC-BY-4.0 (inherits from the upstream NeMo model).
+- Upstream weights: see [nvidia/magpie_tts_multilingual_357m](https://huggingface.co/nvidia/magpie_tts_multilingual_357m).
+"""
+
+
+@dataclass
+class PrepReport:
+    copied_models: list[str] = field(default_factory=list)
+    missing_required_models: list[str] = field(default_factory=list)
+    missing_optional_models: list[str] = field(default_factory=list)
+    copied_constants: list[str] = field(default_factory=list)
+    missing_constants: list[str] = field(default_factory=list)
+    copied_tokenizer_files: dict[str, list[str]] = field(default_factory=dict)
+    missing_tokenizer_files: dict[str, list[str]] = field(default_factory=dict)
+    unknown_files: list[str] = field(default_factory=list)
+
+    def has_errors(self) -> bool:
+        return bool(self.missing_required_models or self.missing_constants)
+
+
+def _copy_tree(src: str, dst: str) -> None:
+    if os.path.isdir(src):
+        if os.path.exists(dst):
+            shutil.rmtree(dst)
+        shutil.copytree(src, dst)
+    else:
+        os.makedirs(os.path.dirname(dst), exist_ok=True)
+        shutil.copy2(src, dst)
+
+
+def _copy_models(build_dir: str, output_dir: str, report: PrepReport) -> None:
+    for model in REQUIRED_MODELS:
+        src = os.path.join(build_dir, model)
+        if not os.path.exists(src):
+            report.missing_required_models.append(model)
+            continue
+        dst = os.path.join(output_dir, model)
+        _copy_tree(src, dst)
+        report.copied_models.append(model)
+
+    for model in OPTIONAL_MODELS:
+        src = os.path.join(build_dir, model)
+        if not os.path.exists(src):
+            report.missing_optional_models.append(model)
+            continue
+        dst = os.path.join(output_dir, model)
+        _copy_tree(src, dst)
+        report.copied_models.append(model)
+
+
+def _copy_constants(constants_dir: str, output_dir: str, report: PrepReport) -> None:
+    dst_constants = os.path.join(output_dir, "constants")
+    os.makedirs(dst_constants, exist_ok=True)
+
+    required = {"constants.json", "speaker_info.json", "tokenizer_metadata.json"}
+    required |= {f"audio_embedding_{i}.npy" for i in range(8)}
+    required |= {f"speaker_{i}.npy" for i in range(5)}
+    # local_transformer/ is a dir — enumerate expected files separately.
+    local_transformer_files = {
+        "in_proj_weight.npy",
+        "in_proj_bias.npy",
+        "pos_emb.npy",
+        "norm1_weight.npy",
+        "sa_qkv_weight.npy",
+        "sa_o_weight.npy",
+        "norm2_weight.npy",
+        "ffn_conv1_weight.npy",
+        "ffn_conv2_weight.npy",
+    }
+    for i in range(8):
+        local_transformer_files.add(f"out_proj_{i}_weight.npy")
+        local_transformer_files.add(f"out_proj_{i}_bias.npy")
+
+    for entry in sorted(os.listdir(constants_dir)):
+        src = os.path.join(constants_dir, entry)
+
+        # Tokenizer files are moved out to tokenizer/ — skip here.
+        if entry in ALL_TOKENIZER_FILES:
+            continue
+
+        # Known constants files or dirs.
+        is_keep_file = entry in CONSTANTS_KEEP_FILES
+        is_keep_prefix = any(entry.startswith(p) for p in CONSTANTS_KEEP_PREFIXES)
+        is_keep_dir = entry in CONSTANTS_KEEP_DIRS and os.path.isdir(src)
+
+        if is_keep_file or is_keep_prefix or is_keep_dir:
+            dst = os.path.join(dst_constants, entry)
+            _copy_tree(src, dst)
+            report.copied_constants.append(entry)
+        else:
+            report.unknown_files.append(os.path.relpath(src, constants_dir))
+
+    copied_set = set(report.copied_constants)
+    for req in sorted(required):
+        if req not in copied_set:
+            report.missing_constants.append(f"constants/{req}")
+
+    lt_src = os.path.join(constants_dir, "local_transformer")
+    if os.path.isdir(lt_src):
+        present = set(os.listdir(lt_src))
+        for req in sorted(local_transformer_files):
+            if req not in present:
+                report.missing_constants.append(f"constants/local_transformer/{req}")
+    else:
+        report.missing_constants.append("constants/local_transformer/")
+
+
+def _copy_tokenizer(constants_dir: str, output_dir: str, report: PrepReport) -> None:
+    dst_tokenizer = os.path.join(output_dir, "tokenizer")
+    os.makedirs(dst_tokenizer, exist_ok=True)
+
+    for language, files in PER_LANGUAGE_TOKENIZER_FILES.items():
+        copied: list[str] = []
+        missing: list[str] = []
+        for fname in files:
+            src = os.path.join(constants_dir, fname)
+            if not os.path.exists(src):
+                missing.append(fname)
+                continue
+            dst = os.path.join(dst_tokenizer, fname)
+            shutil.copy2(src, dst)
+            copied.append(fname)
+        if copied:
+            report.copied_tokenizer_files[language] = copied
+        if missing:
+            report.missing_tokenizer_files[language] = missing
+
+
+def _write_metadata(output_dir: str, report: PrepReport, repo_id: str) -> None:
+    with open(os.path.join(output_dir, ".gitattributes"), "w") as f:
+        f.write(GITATTRIBUTES)
+
+    with open(os.path.join(output_dir, "README.md"), "w") as f:
+        f.write(README_TEMPLATE)
+
+    # Machine-readable prep report for auditability.
+    summary = {
+        "repoId": repo_id,
+        "copiedModels": report.copied_models,
+        "missingRequiredModels": report.missing_required_models,
+        "missingOptionalModels": report.missing_optional_models,
+        "copiedConstants": sorted(report.copied_constants),
+        "missingConstants": report.missing_constants,
+        "copiedTokenizerFiles": report.copied_tokenizer_files,
+        "missingTokenizerFiles": report.missing_tokenizer_files,
+        "unknownFiles": report.unknown_files,
+    }
+    with open(os.path.join(output_dir, "_prep_report.json"), "w") as f:
+        json.dump(summary, f, indent=2)
+
+
+def _print_report(report: PrepReport, output_dir: str, repo_id: str) -> int:
+    print("")
+    print("=" * 72)
+    print(f"HF upload staging → {output_dir}")
+    print(f"Target repo:       {repo_id}")
+    print("=" * 72)
+
+    print("\nCoreML models:")
+    for m in report.copied_models:
+        print(f"  OK   {m}")
+    for m in report.missing_required_models:
+        print(f"  MISS {m}  (REQUIRED — re-run convert_*.py + compile_mlmodelc.py)")
+    for m in report.missing_optional_models:
+        print(f"  skip {m}  (optional)")
+
+    print("\nconstants/:")
+    for c in sorted(report.copied_constants):
+        print(f"  OK   {c}")
+    for c in report.missing_constants:
+        print(f"  MISS {c}")
+
+    print("\ntokenizer/:")
+    for lang, files in sorted(report.copied_tokenizer_files.items()):
+        print(f"  [{lang}] {len(files)} file(s) copied")
+    for lang, files in sorted(report.missing_tokenizer_files.items()):
+        for fname in files:
+            print(f"  MISS tokenizer/{fname}  ({lang})")
+
+    if report.unknown_files:
+        print("\nUnknown files under constants/ (not copied — review):")
+        for u in report.unknown_files:
+            print(f"  ??   {u}")
+
+    print("")
+    if report.has_errors():
+        print("Staging completed WITH ERRORS — see MISS entries above.")
+        print("Re-run the relevant exporter/converter and re-run this script.")
+        return 1
+
+    print("Staging OK. Upload with one of:")
+    print("")
+    print(f"  huggingface-cli upload {repo_id} {output_dir} . \\")
+    print("      --repo-type model --commit-message 'upload Magpie TTS CoreML export'")
+    print("")
+    print("Or, if the repo does not exist yet:")
+    print("")
+    print(f"  huggingface-cli repo create {repo_id} --type model")
+    print(f"  huggingface-cli upload {repo_id} {output_dir} . --repo-type model")
+    print("")
+    print("Verify from Swift:")
+    print("  swift run fluidaudiocli magpie download --languages en")
+    print("")
+    return 0
+
+
+def prepare(
+    build_dir: str,
+    constants_dir: str,
+    output_dir: str,
+    repo_id: str,
+    clean: bool,
+) -> int:
+    build_dir = os.path.abspath(build_dir)
+    constants_dir = os.path.abspath(constants_dir)
+    output_dir = os.path.abspath(output_dir)
+
+    if not os.path.isdir(build_dir):
+        print(f"error: build dir not found: {build_dir}", file=sys.stderr)
+        return 2
+    if not os.path.isdir(constants_dir):
+        print(f"error: constants dir not found: {constants_dir}", file=sys.stderr)
+        return 2
+
+    if clean and os.path.exists(output_dir):
+        shutil.rmtree(output_dir)
+    os.makedirs(output_dir, exist_ok=True)
+
+    report = PrepReport()
+    _copy_models(build_dir, output_dir, report)
+    _copy_constants(constants_dir, output_dir, report)
+    _copy_tokenizer(constants_dir, output_dir, report)
+    _write_metadata(output_dir, report, repo_id)
+
+    return _print_report(report, output_dir, repo_id)
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description="Stage a HuggingFace-ready directory for Magpie TTS CoreML.",
+    )
+    parser.add_argument(
+        "--build-dir",
+        default=os.path.join(SCRIPT_DIR, "build"),
+        help="Directory with compiled .mlmodelc bundles (default: ./build)",
+    )
+    parser.add_argument(
+        "--constants-dir",
+        default=os.path.join(SCRIPT_DIR, "constants"),
+        help="Directory with exported constants + tokenizer files (default: ./constants)",
+    )
+    parser.add_argument(
+        "--output-dir",
+        default=os.path.join(SCRIPT_DIR, "hf-upload"),
+        help="Staging directory to populate (default: ./hf-upload)",
+    )
+    parser.add_argument(
+        "--repo-id",
+        default="FluidInference/magpie-tts-multilingual-357m-coreml",
+        help="Target HF repo id (only used in the printed upload command)",
+    )
+    parser.add_argument(
+        "--clean",
+        action="store_true",
+        help="Remove the output dir before staging (fresh build).",
+    )
+    args = parser.parse_args()
+
+    rc = prepare(
+        build_dir=args.build_dir,
+        constants_dir=args.constants_dir,
+        output_dir=args.output_dir,
+        repo_id=args.repo_id,
+        clean=args.clean,
+    )
+    sys.exit(rc)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/models/tts/magpie/coreml/traceable/traceable_decoder_step.py b/models/tts/magpie/coreml/traceable/traceable_decoder_step.py
index 3cfe896..d814ec4 100644
--- a/models/tts/magpie/coreml/traceable/traceable_decoder_step.py
+++ b/models/tts/magpie/coreml/traceable/traceable_decoder_step.py
@@ -1,23 +1,36 @@
-"""Traceable decoder step wrapper for CoreML conversion.
+"""Traceable decoder step wrapper for CoreML conversion (rank-4 ANE-friendly).
 
-The decoder is a causal transformer with cross-attention to the encoder output.
-For CoreML, we implement it as a single-step model with explicit KV cache I/O,
-following the PocketTTS pattern.
+Each layer's KV cache is split into separate rank-4 K and V tensors so the ANE
+backend can compile the model. The previous rank-5 single-tensor cache
+``(2, B, max_seq, H, D)`` was rejected by ``ANECompile`` and forced the model
+onto the GPU at ~64 ms/step.
+
+Key changes vs. the original:
+  * Per-layer state is ``(cache_k, cache_v, position)`` — three rank-4/scalar
+    tensors instead of one rank-5 plus a scalar.
+  * Causal mask ``-1e9`` -> ``-3e4`` (fp16 max is ±65504; ``-1e9`` overflows
+    to ``-inf`` and the ANE compiler tends to reject out-of-range constants).
+  * Cross-attention's memory mask is added (instead of ``masked_fill``) using
+    the same fp16-safe constant so the cross-attn step is also ANE-friendly.
 
 Each step:
-1. Takes one audio embedding token + encoder output
-2. Runs through all decoder layers with causal self-attention + cross-attention
-3. Returns logits for next token + updated KV caches
+1. Embed one audio token + receive encoder output.
+2. Run 12 decoder layers (causal self-attn + cross-attn + FFN).
+3. Return logits for next token + updated K/V/positions per layer.
 """
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-import math
-from typing import Tuple, List
+
+
+# fp16 max is ±65504; use a safely-representable negative value for masked
+# softmax positions. -3e4 stays well within fp16 range and gives ~exp(-30000)
+# ≈ 0 after softmax so behaviour is numerically identical to -1e9.
+MASK_NEG = -3.0e4
 
 
 class TraceableCausalSelfAttention(nn.Module):
-    """Single-step causal self-attention with KV cache."""
+    """Single-step causal self-attention with rank-4 split K/V cache."""
 
     def __init__(self, d_model, n_heads, d_head=None):
         super().__init__()
@@ -27,60 +40,49 @@ def __init__(self, d_model, n_heads, d_head=None):
         self.qkv_proj = nn.Linear(d_model, 3 * n_heads * self.d_head, bias=False)
         self.o_proj = nn.Linear(n_heads * self.d_head, d_model, bias=False)
 
-    def forward(self, x, kv_cache, position):
+    def forward(self, x, kv_k, kv_v, position):
         """
-        Args:
-            x: (B, 1, d_model) - single token embedding
-            kv_cache: (2, B, max_seq, H, D) - [key, value] cache
-            position: (1,) - current write position in cache
-        Returns:
-            output: (B, 1, d_model)
-            new_kv_cache: (2, B, max_seq, H, D) - updated cache
-            new_position: (1,) - incremented position
+        x:        (B, 1, d_model)
+        kv_k:     (B, max_seq, H, D)
+        kv_v:     (B, max_seq, H, D)
+        position: (1,)
         """
-        B, T, _ = x.shape  # T=1 for single step
-        max_seq = kv_cache.shape[2]
+        B, T, _ = x.shape  # T = 1 (single step)
+        max_seq = kv_k.shape[1]
 
         qkv = self.qkv_proj(x)
         qkv = qkv.view(B, T, 3, self.n_heads, self.d_head)
-        q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
-
-        # Write new k,v to cache using scatter (CoreML-compatible)
-        # Build one-hot mask for position to avoid advanced indexing
-        pos_idx = position.to(torch.long)
-        one_hot = torch.zeros(max_seq, dtype=x.dtype, device=x.device)
-        one_hot[pos_idx] = 1.0
-        # one_hot: (max_seq,) → broadcast to (1, B, max_seq, H, D)
-        mask = one_hot.view(1, 1, max_seq, 1, 1)
-
-        k_new = k.squeeze(1).unsqueeze(0).unsqueeze(2)  # (1, B, 1, H, D) → broadcast to (1, B, max_seq, H, D)
-        v_new = v.squeeze(1).unsqueeze(0).unsqueeze(2)
-
-        # new_cache = (1-mask)*old_cache + mask*new_kv
-        new_cache_k = kv_cache[0:1] * (1.0 - mask) + k_new * mask  # (1, B, max_seq, H, D)
-        new_cache_v = kv_cache[1:2] * (1.0 - mask) + v_new * mask
-        new_cache = torch.cat([new_cache_k, new_cache_v], dim=0)  # (2, B, max_seq, H, D)
-
-        # Attend to all positions with a causal mask (positions > pos_idx are masked out)
-        # Build mask: 1 for positions <= pos_idx, 0 for positions > pos_idx
+        q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]  # each (B, 1, H, D)
+
+        # One-hot write mask along ``max_seq`` (rank-4 broadcast).
+        # Use arange/equal compare instead of ``one_hot[pos_idx] = 1.0`` so the
+        # graph lowers to elementwise ops (no ``scatter_nd`` — ANE rejects it).
         positions_range = torch.arange(max_seq, dtype=x.dtype, device=x.device)
-        causal_mask = (positions_range <= position).float()  # (max_seq,)
-        causal_mask = causal_mask.view(1, 1, 1, max_seq)  # (1, 1, 1, max_seq)
+        mask = (positions_range == position).to(x.dtype).view(1, max_seq, 1, 1)
+
+        # Broadcast new (B, 1, H, D) → (B, max_seq, H, D); then blend with old cache.
+        k_new = k.expand(B, max_seq, self.n_heads, self.d_head)
+        v_new = v.expand(B, max_seq, self.n_heads, self.d_head)
+        new_k = kv_k * (1.0 - mask) + k_new * mask
+        new_v = kv_v * (1.0 - mask) + v_new * mask
 
-        q = q.transpose(1, 2)  # (B, H, 1, D)
-        k_full = new_cache[0].transpose(1, 2)  # (B, H, max_seq, D)
-        v_full = new_cache[1].transpose(1, 2)
+        # Causal mask: keep positions ≤ current `position`, drop the rest.
+        causal_mask = (positions_range <= position).to(x.dtype).view(1, 1, 1, max_seq)
 
-        attn = torch.matmul(q, k_full.transpose(-2, -1)) * self.scale  # (B, H, 1, max_seq)
-        attn = attn + (1.0 - causal_mask) * (-1e9)  # mask future positions
+        q4 = q.transpose(1, 2)            # (B, H, 1, D)
+        k4 = new_k.permute(0, 2, 1, 3)    # (B, H, max_seq, D)
+        v4 = new_v.permute(0, 2, 1, 3)    # (B, H, max_seq, D)
+
+        attn = torch.matmul(q4, k4.transpose(-2, -1)) * self.scale
+        attn = attn + (1.0 - causal_mask) * MASK_NEG
         attn = F.softmax(attn, dim=-1)
-        out = torch.matmul(attn, v_full)
+        out = torch.matmul(attn, v4)      # (B, H, 1, D)
 
         out = out.transpose(1, 2).reshape(B, 1, -1)
         out = self.o_proj(out)
 
         new_position = position + 1.0
-        return out, new_cache, new_position
+        return out, new_k, new_v, new_position
 
 
 class TraceableCrossAttention(nn.Module):
@@ -96,14 +98,6 @@ def __init__(self, d_model, n_heads, d_memory, d_head=None):
         self.o_proj = nn.Linear(n_heads * self.d_head, d_model, bias=False)
 
     def forward(self, x, memory, memory_mask=None):
-        """
-        Args:
-            x: (B, 1, d_model) - query
-            memory: (B, T_enc, d_memory) - encoder output
-            memory_mask: (B, T_enc) bool - True=keep
-        Returns:
-            output: (B, 1, d_model)
-        """
         B, T_q, _ = x.shape
         T_m = memory.shape[1]
 
@@ -114,8 +108,9 @@ def forward(self, x, memory, memory_mask=None):
         attn = torch.matmul(q, k.transpose(-2, -1)) * self.scale
 
         if memory_mask is not None:
-            attn_mask = memory_mask.unsqueeze(1).unsqueeze(2)  # (B, 1, 1, T_m)
-            attn = attn.masked_fill(~attn_mask, float("-inf"))
+            # Add fp16-safe penalty instead of `masked_fill(-inf)` for ANE.
+            mem_mask_f = memory_mask.to(x.dtype).unsqueeze(1).unsqueeze(2)  # (B, 1, 1, T_m)
+            attn = attn + (1.0 - mem_mask_f) * MASK_NEG
 
         attn = F.softmax(attn, dim=-1)
         out = torch.matmul(attn, v)
@@ -124,7 +119,7 @@ def forward(self, x, memory, memory_mask=None):
 
 
 class TraceableFFN(nn.Module):
-    """Positionwise feed-forward for decoder."""
+    """Position-wise feed-forward for decoder (kernel_size=1 ⇒ matmul + GELU)."""
 
     def __init__(self, d_model, d_ffn, kernel_size=1):
         super().__init__()
@@ -142,7 +137,8 @@ def forward(self, x):
 class TraceableDecoderLayer(nn.Module):
     """Single decoder transformer layer with self-attn, cross-attn, and FFN."""
 
-    def __init__(self, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory, kernel_size=1, xa_d_head=None):
+    def __init__(self, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory,
+                 kernel_size=1, xa_d_head=None):
         super().__init__()
         self.norm_sa = nn.LayerNorm(d_model, bias=False)
         self.self_attn = TraceableCausalSelfAttention(d_model, sa_n_heads)
@@ -156,20 +152,14 @@ def __init__(self, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory, kernel_s
         self.norm_ff = nn.LayerNorm(d_model, bias=False)
         self.ffn = TraceableFFN(d_model, d_ffn, kernel_size)
 
-    def forward(self, x, kv_cache, position, encoder_output=None, encoder_mask=None):
-        """
-        Returns:
-            x: (B, 1, d_model)
-            new_kv_cache: updated cache
-            new_position: incremented position
-        """
-        # Self-attention
+    def forward(self, x, kv_k, kv_v, position, encoder_output=None, encoder_mask=None):
+        # Self-attention.
         residual = x
         x_norm = self.norm_sa(x)
-        sa_out, new_kv_cache, new_position = self.self_attn(x_norm, kv_cache, position)
+        sa_out, new_k, new_v, new_position = self.self_attn(x_norm, kv_k, kv_v, position)
         x = residual + sa_out
 
-        # Cross-attention
+        # Cross-attention.
         if self.has_xattn and encoder_output is not None:
             residual = x
             q_norm = self.norm_xa_query(x)
@@ -177,22 +167,20 @@ def forward(self, x, kv_cache, position, encoder_output=None, encoder_mask=None)
             xa_out = self.cross_attn(q_norm, m_norm, encoder_mask)
             x = residual + xa_out
 
-        # FFN
+        # FFN.
         residual = x
         x = self.norm_ff(x)
         x = self.ffn(x)
         x = residual + x
 
-        return x, new_kv_cache, new_position
+        return x, new_k, new_v, new_position
 
 
 class TraceableDecoderStep(nn.Module):
-    """Complete single-step decoder for CoreML.
+    """Complete single-step decoder with rank-4 split K/V caches.
 
-    Takes one audio token embedding, runs through all decoder layers with
-    KV cache, and outputs logits for next codebook tokens.
-
-    The KV caches are passed as flat arguments (not lists) for torch.jit.trace.
+    For each of ``n_layers`` decoder layers the model takes THREE state tensors
+    (``cache_k``, ``cache_v``, ``position``) and returns three updated outputs.
     """
 
     def __init__(self, n_layers, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory,
@@ -219,80 +207,59 @@ def __init__(self, n_layers, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory
             for _ in range(n_layers)
         ])
 
-        self.norm_out = nn.Identity()  # May be replaced if model uses apply_norm_out
-
-        # Final projection: decoder hidden → codebook logits
+        self.norm_out = nn.Identity()  # may be replaced by a LayerNorm in `from_magpie`
         self.final_proj = nn.Linear(
-            d_model,
-            num_codebooks * num_tokens_per_codebook * frame_stacking_factor,
-        )
+            d_model, num_codebooks * num_tokens_per_codebook * frame_stacking_factor)
 
     def forward(self, audio_embed, encoder_output, encoder_mask,
-                # Flat KV cache args (one pair per layer)
-                cache0, pos0, cache1, pos1, cache2, pos2,
-                cache3, pos3, cache4, pos4, cache5, pos5,
-                cache6, pos6, cache7, pos7, cache8, pos8,
-                cache9, pos9, cache10, pos10, cache11, pos11):
+                # 12 layers × (cache_k, cache_v, position) = 36 flat state args.
+                ck0, cv0, p0, ck1, cv1, p1, ck2, cv2, p2,
+                ck3, cv3, p3, ck4, cv4, p4, ck5, cv5, p5,
+                ck6, cv6, p6, ck7, cv7, p7, ck8, cv8, p8,
+                ck9, cv9, p9, ck10, cv10, p10, ck11, cv11, p11):
         """
         Args:
-            audio_embed: (B, 1, d_model) - embedded audio token(s)
-            encoder_output: (B, T_enc, d_model) - text encoder output
-            encoder_mask: (B, T_enc) - text mask
-            cache{i}: (2, B, max_seq, H, D) - KV cache per layer
-            pos{i}: (1,) - current position per layer
-
-        Returns:
-            logits: (B, 1, num_codebooks * num_tokens * frame_stacking)
-            decoder_hidden: (B, 1, d_model) - for local transformer
-            new_cache{i}: updated caches
-            new_pos{i}: updated positions
+            audio_embed:      (B, 1, d_model)
+            encoder_output:   (B, T_enc, d_model)
+            encoder_mask:     (B, T_enc) bool
+            ck{i}, cv{i}:     (B, max_seq, H, D) per layer
+            p{i}:             (1,) scalar position per layer
+
+        Returns flat tuple:
+            logits, decoder_hidden,
+            new_ck0, new_cv0, new_p0, …, new_ck11, new_cv11, new_p11
         """
-        caches = [cache0, cache1, cache2, cache3, cache4, cache5,
-                  cache6, cache7, cache8, cache9, cache10, cache11]
-        positions = [pos0, pos1, pos2, pos3, pos4, pos5,
-                     pos6, pos7, pos8, pos9, pos10, pos11]
+        cks = [ck0, ck1, ck2, ck3, ck4, ck5, ck6, ck7, ck8, ck9, ck10, ck11]
+        cvs = [cv0, cv1, cv2, cv3, cv4, cv5, cv6, cv7, cv8, cv9, cv10, cv11]
+        ps = [p0, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11]
 
         x = audio_embed
-
-        # Add positional embedding
         if self.use_pos_emb:
-            pos_idx = positions[0].to(torch.long)
+            pos_idx = ps[0].to(torch.long)
             x = x + self.position_embeddings(pos_idx).unsqueeze(0)
 
-        new_caches = []
-        new_positions = []
-
+        new_ks, new_vs, new_ps = [], [], []
         for i, layer in enumerate(self.layers):
-            x, new_cache, new_pos = layer(
-                x, caches[i], positions[i],
+            x, nk, nv, np_ = layer(
+                x, cks[i], cvs[i], ps[i],
                 encoder_output=encoder_output,
                 encoder_mask=encoder_mask,
             )
-            new_caches.append(new_cache)
-            new_positions.append(new_pos)
-
-        x = self.norm_out(x)
-        decoder_hidden = x
-
-        logits = self.final_proj(x)
-
-        return (logits, decoder_hidden,
-                new_caches[0], new_positions[0],
-                new_caches[1], new_positions[1],
-                new_caches[2], new_positions[2],
-                new_caches[3], new_positions[3],
-                new_caches[4], new_positions[4],
-                new_caches[5], new_positions[5],
-                new_caches[6], new_positions[6],
-                new_caches[7], new_positions[7],
-                new_caches[8], new_positions[8],
-                new_caches[9], new_positions[9],
-                new_caches[10], new_positions[10],
-                new_caches[11], new_positions[11])
+            new_ks.append(nk)
+            new_vs.append(nv)
+            new_ps.append(np_)
+
+        decoder_hidden = self.norm_out(x)
+        logits = self.final_proj(decoder_hidden)
+
+        outs = [logits, decoder_hidden]
+        for i in range(self.n_layers):
+            outs += [new_ks[i], new_vs[i], new_ps[i]]
+        return tuple(outs)
 
     @classmethod
     def from_magpie(cls, model):
-        """Create from a loaded MagpieTTSModel."""
+        """Create from a loaded MagpieTTSModel and copy over weights."""
         cfg = model.cfg
         dec_cfg = dict(cfg.decoder)
 
@@ -313,46 +280,35 @@ def from_magpie(cls, model):
             frame_stacking_factor=model.frame_stacking_factor,
         )
 
-        # Copy positional embeddings
         if wrapper.use_pos_emb and model.decoder.position_embeddings is not None:
-            wrapper.position_embeddings.weight.data.copy_(model.decoder.position_embeddings.weight.data)
-
-        # Copy decoder layers
-        # NeMo TransformerLayer attr names:
-        #   self_attention (SelfAttention) with qkv_net, o_net
-        #   cross_attention (CrossAttention) with q_net, kv_net, o_net
-        #   norm_self, norm_xattn_query, norm_xattn_memory, norm_pos_ff (LayerNorm, bias=False)
-        #   pos_ff (PositionwiseConvFF) with proj.conv, o_net.conv (Conv1d)
-        for i, (src_layer, dst_layer) in enumerate(zip(model.decoder.layers, wrapper.layers)):
-            # Self-attention
+            wrapper.position_embeddings.weight.data.copy_(
+                model.decoder.position_embeddings.weight.data)
+
+        for src_layer, dst_layer in zip(model.decoder.layers, wrapper.layers):
+            # Self-attention.
             dst_layer.self_attn.qkv_proj.weight.data.copy_(src_layer.self_attention.qkv_net.weight.data)
             dst_layer.self_attn.o_proj.weight.data.copy_(src_layer.self_attention.o_net.weight.data)
-
-            # Self-attn norm (bias=False in NeMo)
             dst_layer.norm_sa.weight.data.copy_(src_layer.norm_self.weight.data)
 
-            # Cross-attention (if present)
+            # Cross-attention.
             if dst_layer.has_xattn and hasattr(src_layer, "cross_attention"):
                 dst_layer.cross_attn.q_proj.weight.data.copy_(src_layer.cross_attention.q_net.weight.data)
                 dst_layer.cross_attn.kv_proj.weight.data.copy_(src_layer.cross_attention.kv_net.weight.data)
                 dst_layer.cross_attn.o_proj.weight.data.copy_(src_layer.cross_attention.o_net.weight.data)
-
                 dst_layer.norm_xa_query.weight.data.copy_(src_layer.norm_xattn_query.weight.data)
                 dst_layer.norm_xa_memory.weight.data.copy_(src_layer.norm_xattn_memory.weight.data)
 
-            # FFN norm (bias=False in NeMo)
+            # FFN.
             dst_layer.norm_ff.weight.data.copy_(src_layer.norm_pos_ff.weight.data)
-
-            # FFN (Conv1d via PositionwiseConvFF, bias=False)
             dst_layer.ffn.conv1.weight.data.copy_(src_layer.pos_ff.proj.conv.weight.data)
             dst_layer.ffn.conv2.weight.data.copy_(src_layer.pos_ff.o_net.conv.weight.data)
 
-        # Output norm
+        # Optional output norm.
         if hasattr(model.decoder, "norm_out") and isinstance(model.decoder.norm_out, nn.LayerNorm):
             wrapper.norm_out = nn.LayerNorm(dec_cfg["d_model"], bias=False)
             wrapper.norm_out.weight.data.copy_(model.decoder.norm_out.weight.data)
 
-        # Final projection
+        # Final projection.
         wrapper.final_proj.weight.data.copy_(model.final_proj.weight.data)
         wrapper.final_proj.bias.data.copy_(model.final_proj.bias.data)
 
diff --git a/models/tts/magpie/coreml/traceable/traceable_decoder_step_stateful.py b/models/tts/magpie/coreml/traceable/traceable_decoder_step_stateful.py
new file mode 100644
index 0000000..dd9eba6
--- /dev/null
+++ b/models/tts/magpie/coreml/traceable/traceable_decoder_step_stateful.py
@@ -0,0 +1,353 @@
+"""EXPERIMENTAL — DO NOT USE IN PRODUCTION.
+
+Stateful (MLState) variant of ``traceable_decoder_step.py``. Kept as a
+documented dead-end so future agents don't repeat the experiment.
+
+Benchmark result (Apple M2, macOS 26.5, 146-step real loop):
+  rank-4 production (this file's non-stateful sibling): ~96 ms/step (97.3% ANE)
+  this stateful variant (CPU_AND_GPU only):            ~212 ms/step
+  → 2.2× regression. Rejected.
+
+Why it loses for Magpie (vs CosyVoice3 where MLState gave ~3× speedup):
+  Magpie's rank-4 decoder_step already lands 97.3% of cost on ANE. MLState
+  graphs are ANE-incompatible, so they force CPU_AND_GPU. The IO-marshaling
+  savings from collapsing 39 inputs / 38 outputs to 4 / 2 are dwarfed by the
+  loss of ANE acceleration.
+
+Variant of ``traceable_decoder_step.py`` that uses CoreML ``MLState`` (stateful
+buffers) instead of passing 36 KV+position tensors through the model interface
+on every step.
+
+Differences vs. ``traceable_decoder_step.TraceableDecoderStep``:
+  * Per-layer K and V caches are ``register_buffer``-ed (24 buffers total) and
+    mutated in place via slice assignment.
+  * Forward signature shrinks to 4 inputs: (audio_embed, encoder_output,
+    encoder_mask, position). Position is a single shared scalar — all layers
+    advance in lockstep so we don't statefy 12 copies of it.
+  * Outputs shrink to 2: (logits, decoder_hidden). Cache updates are side
+    effects on the state buffers.
+  * Cross-attention path and fp16-safe ``MASK_NEG`` constant are unchanged.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+# fp16 max is ±65504; -3e4 is safely representable and gives ~exp(-30000) ≈ 0
+# after softmax. Identical numerical behaviour to -1e9 without the overflow.
+MASK_NEG = -3.0e4
+
+
+class StatefulCausalSelfAttention(nn.Module):
+    """Single-step causal self-attention with state-buffer K/V caches.
+
+    The K and V caches are owned by the parent ``StatefulDecoderLayer`` (so all
+    buffers live on a single module for clean ``ct.StateType`` registration).
+    This module receives the buffers by reference and mutates them in place.
+    """
+
+    def __init__(self, d_model, n_heads, d_head=None):
+        super().__init__()
+        self.d_head = d_head or d_model // n_heads
+        self.n_heads = n_heads
+        self.scale = self.d_head ** -0.5
+        self.qkv_proj = nn.Linear(d_model, 3 * n_heads * self.d_head, bias=False)
+        self.o_proj = nn.Linear(n_heads * self.d_head, d_model, bias=False)
+
+    def forward(self, x, k_cache, v_cache, position):
+        """
+        x:        (B, 1, d_model)
+        k_cache:  (B, max_seq, H, D)  — mutated in place
+        v_cache:  (B, max_seq, H, D)  — mutated in place
+        position: (1,) scalar — current write index (also used for causal mask)
+        """
+        B, T, _ = x.shape  # T = 1
+        max_seq = k_cache.shape[1]
+
+        qkv = self.qkv_proj(x)
+        qkv = qkv.view(B, T, 3, self.n_heads, self.d_head)
+        q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]  # each (B, 1, H, D)
+
+        # In-place slice write — pure indexed_update (no scatter_nd).
+        # Cast position to int via clamp for use as a slice bound.
+        pos_int = position.to(torch.int32)
+        # Slice bounds need to be Python ints during tracing; we materialize via
+        # ``.item()``-equivalent through a 1-element tensor. CoreML's tracer will
+        # capture the dynamic write index as a runtime variable.
+        start = pos_int[0]
+        end = start + 1
+
+        # Cast new K/V to match buffer dtype (fp16 for the converted graph).
+        k_cache[:, start:end, :, :] = k.to(k_cache.dtype)
+        v_cache[:, start:end, :, :] = v.to(v_cache.dtype)
+
+        # Reshape for batched matmul.
+        q4 = q.transpose(1, 2)                    # (B, H, 1, D)
+        k4 = k_cache.permute(0, 2, 1, 3)          # (B, H, max_seq, D)
+        v4 = v_cache.permute(0, 2, 1, 3)          # (B, H, max_seq, D)
+
+        # Causal mask: keep positions ≤ current `position`, drop the rest.
+        positions_range = torch.arange(max_seq, dtype=x.dtype, device=x.device)
+        causal_mask = (positions_range <= position).to(x.dtype).view(1, 1, 1, max_seq)
+
+        attn = torch.matmul(q4, k4.to(x.dtype).transpose(-2, -1)) * self.scale
+        attn = attn + (1.0 - causal_mask) * MASK_NEG
+        attn = F.softmax(attn, dim=-1)
+        out = torch.matmul(attn, v4.to(x.dtype))  # (B, H, 1, D)
+
+        out = out.transpose(1, 2).reshape(B, 1, -1)
+        out = self.o_proj(out)
+        return out
+
+
+class StatefulCrossAttention(nn.Module):
+    """Cross-attention to encoder output (non-causal, stateless)."""
+
+    def __init__(self, d_model, n_heads, d_memory, d_head=None):
+        super().__init__()
+        self.d_head = d_head or d_model // n_heads
+        self.n_heads = n_heads
+        self.scale = self.d_head ** -0.5
+        self.q_proj = nn.Linear(d_model, n_heads * self.d_head, bias=False)
+        self.kv_proj = nn.Linear(d_memory, 2 * n_heads * self.d_head, bias=False)
+        self.o_proj = nn.Linear(n_heads * self.d_head, d_model, bias=False)
+
+    def forward(self, x, memory, memory_mask=None):
+        B, T_q, _ = x.shape
+        T_m = memory.shape[1]
+
+        q = self.q_proj(x).view(B, T_q, self.n_heads, self.d_head).transpose(1, 2)
+        kv = self.kv_proj(memory).view(B, T_m, 2, self.n_heads, self.d_head)
+        k, v = kv[:, :, 0].transpose(1, 2), kv[:, :, 1].transpose(1, 2)
+
+        attn = torch.matmul(q, k.transpose(-2, -1)) * self.scale
+        if memory_mask is not None:
+            mem_mask_f = memory_mask.to(x.dtype).unsqueeze(1).unsqueeze(2)
+            attn = attn + (1.0 - mem_mask_f) * MASK_NEG
+
+        attn = F.softmax(attn, dim=-1)
+        out = torch.matmul(attn, v)
+        out = out.transpose(1, 2).reshape(B, T_q, -1)
+        return self.o_proj(out)
+
+
+class StatefulFFN(nn.Module):
+    def __init__(self, d_model, d_ffn, kernel_size=1):
+        super().__init__()
+        self.conv1 = nn.Conv1d(d_model, d_ffn, kernel_size, padding=0, bias=False)
+        self.conv2 = nn.Conv1d(d_ffn, d_model, kernel_size, padding=0, bias=False)
+        self.act = nn.GELU(approximate="tanh")
+
+    def forward(self, x):
+        x = x.transpose(1, 2)
+        x = self.act(self.conv1(x))
+        x = self.conv2(x)
+        return x.transpose(1, 2)
+
+
+class StatefulDecoderLayer(nn.Module):
+    """One decoder layer; owns its k_cache / v_cache as registered buffers."""
+
+    def __init__(self, layer_idx, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory,
+                 max_seq_len, kernel_size=1, xa_d_head=None):
+        super().__init__()
+        self.layer_idx = layer_idx
+        self.d_head = d_model // sa_n_heads
+        self.n_heads = sa_n_heads
+
+        self.norm_sa = nn.LayerNorm(d_model, bias=False)
+        self.self_attn = StatefulCausalSelfAttention(d_model, sa_n_heads)
+
+        self.has_xattn = xa_n_heads is not None
+        if self.has_xattn:
+            self.norm_xa_query = nn.LayerNorm(d_model, bias=False)
+            self.norm_xa_memory = nn.LayerNorm(xa_d_memory, bias=False)
+            self.cross_attn = StatefulCrossAttention(d_model, xa_n_heads, xa_d_memory, xa_d_head)
+
+        self.norm_ff = nn.LayerNorm(d_model, bias=False)
+        self.ffn = StatefulFFN(d_model, d_ffn, kernel_size)
+
+        # Register cache buffers (fp16 to match converted-graph precision).
+        # Persistent=False so they don't appear in state_dict and won't trip
+        # weight-load checks.
+        self.register_buffer(
+            "k_cache",
+            torch.zeros(1, max_seq_len, sa_n_heads, self.d_head, dtype=torch.float16),
+            persistent=False,
+        )
+        self.register_buffer(
+            "v_cache",
+            torch.zeros(1, max_seq_len, sa_n_heads, self.d_head, dtype=torch.float16),
+            persistent=False,
+        )
+
+    def forward(self, x, position, encoder_output=None, encoder_mask=None):
+        # Self-attention (mutates self.k_cache / self.v_cache in place).
+        residual = x
+        x_norm = self.norm_sa(x)
+        sa_out = self.self_attn(x_norm, self.k_cache, self.v_cache, position)
+        x = residual + sa_out
+
+        # Cross-attention.
+        if self.has_xattn and encoder_output is not None:
+            residual = x
+            q_norm = self.norm_xa_query(x)
+            m_norm = self.norm_xa_memory(encoder_output)
+            xa_out = self.cross_attn(q_norm, m_norm, encoder_mask)
+            x = residual + xa_out
+
+        # FFN.
+        residual = x
+        x = self.norm_ff(x)
+        x = self.ffn(x)
+        x = residual + x
+        return x
+
+
+class StatefulDecoderStep(nn.Module):
+    """Stateful single-step decoder. K/V caches live as buffers on each layer.
+
+    Forward inputs (4):
+        audio_embed:    (B, 1, d_model)
+        encoder_output: (B, T_enc, d_model)
+        encoder_mask:   (B, T_enc) bool
+        position:       (1,) scalar — write index for this step (shared across layers)
+
+    Forward outputs (2):
+        logits:         (B, 1, num_codebooks * tokens_per_codebook * frame_stack)
+        decoder_hidden: (B, 1, d_model)
+
+    State (24 buffers; named ``k_cache_{i}``, ``v_cache_{i}`` for i in 0..n-1
+    after ``flatten_state_buffers`` is called):
+        k_cache_{i}: (1, max_seq, H, D) fp16
+        v_cache_{i}: (1, max_seq, H, D) fp16
+    """
+
+    def __init__(self, n_layers, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory,
+                 kernel_size=1, xa_d_head=None, max_seq_len=512,
+                 use_pos_emb=False, max_pos=2048,
+                 num_codebooks=8, num_tokens_per_codebook=2024, frame_stacking_factor=1):
+        super().__init__()
+        self.n_layers = n_layers
+        self.d_model = d_model
+        self.max_seq_len = max_seq_len
+        self.use_pos_emb = use_pos_emb
+        self.num_codebooks = num_codebooks
+        self.num_tokens_per_codebook = num_tokens_per_codebook
+        self.frame_stacking_factor = frame_stacking_factor
+        self.d_head = d_model // sa_n_heads
+        self.sa_n_heads = sa_n_heads
+
+        if use_pos_emb:
+            self.position_embeddings = nn.Embedding(max_pos, d_model)
+
+        self.layers = nn.ModuleList([
+            StatefulDecoderLayer(
+                i, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory,
+                max_seq_len, kernel_size, xa_d_head,
+            )
+            for i in range(n_layers)
+        ])
+
+        self.norm_out = nn.Identity()
+        self.final_proj = nn.Linear(
+            d_model, num_codebooks * num_tokens_per_codebook * frame_stacking_factor)
+
+        # Promote per-layer buffers to top-level names so coremltools can pick
+        # them up via ``ct.StateType(name="k_cache_{i}")``.
+        self.flatten_state_buffers()
+
+    def flatten_state_buffers(self):
+        """Re-register each layer's k_cache / v_cache as top-level buffers.
+
+        coremltools' ``ct.StateType(name=...)`` matches the buffer name on the
+        traced module. Layer-nested buffers come through as
+        ``layers.{i}.k_cache``; we mirror them at the top level under the
+        flatter ``k_cache_{i}`` / ``v_cache_{i}`` names that downstream code
+        (and other mobius converters) expect.
+        """
+        for i, layer in enumerate(self.layers):
+            self.register_buffer(f"k_cache_{i}", layer.k_cache, persistent=False)
+            self.register_buffer(f"v_cache_{i}", layer.v_cache, persistent=False)
+
+    def reset_state(self):
+        """Zero all KV caches in place (host side, before make_state)."""
+        for layer in self.layers:
+            layer.k_cache.zero_()
+            layer.v_cache.zero_()
+
+    def forward(self, audio_embed, encoder_output, encoder_mask, position):
+        x = audio_embed
+        if self.use_pos_emb:
+            pos_idx = position.to(torch.long)
+            x = x + self.position_embeddings(pos_idx).unsqueeze(0)
+
+        for layer in self.layers:
+            x = layer(
+                x, position,
+                encoder_output=encoder_output,
+                encoder_mask=encoder_mask,
+            )
+
+        decoder_hidden = self.norm_out(x)
+        logits = self.final_proj(decoder_hidden)
+        return logits, decoder_hidden
+
+    @classmethod
+    def from_magpie(cls, model):
+        """Create from a loaded MagpieTTSModel and copy over weights."""
+        cfg = model.cfg
+        dec_cfg = dict(cfg.decoder)
+
+        wrapper = cls(
+            n_layers=dec_cfg["n_layers"],
+            d_model=dec_cfg["d_model"],
+            d_ffn=dec_cfg["d_ffn"],
+            sa_n_heads=dec_cfg["sa_n_heads"],
+            xa_n_heads=dec_cfg.get("xa_n_heads"),
+            xa_d_memory=dec_cfg.get("xa_d_memory"),
+            kernel_size=dec_cfg.get("kernel_size", 1),
+            xa_d_head=dec_cfg.get("xa_d_head"),
+            max_seq_len=512,
+            use_pos_emb=dec_cfg.get("use_learnable_pos_emb", False),
+            max_pos=dec_cfg.get("max_length_causal_mask", 2048),
+            num_codebooks=model.num_audio_codebooks,
+            num_tokens_per_codebook=model.num_all_tokens_per_codebook,
+            frame_stacking_factor=model.frame_stacking_factor,
+        )
+
+        if wrapper.use_pos_emb and model.decoder.position_embeddings is not None:
+            wrapper.position_embeddings.weight.data.copy_(
+                model.decoder.position_embeddings.weight.data)
+
+        for src_layer, dst_layer in zip(model.decoder.layers, wrapper.layers):
+            # Self-attention.
+            dst_layer.self_attn.qkv_proj.weight.data.copy_(src_layer.self_attention.qkv_net.weight.data)
+            dst_layer.self_attn.o_proj.weight.data.copy_(src_layer.self_attention.o_net.weight.data)
+            dst_layer.norm_sa.weight.data.copy_(src_layer.norm_self.weight.data)
+
+            # Cross-attention.
+            if dst_layer.has_xattn and hasattr(src_layer, "cross_attention"):
+                dst_layer.cross_attn.q_proj.weight.data.copy_(src_layer.cross_attention.q_net.weight.data)
+                dst_layer.cross_attn.kv_proj.weight.data.copy_(src_layer.cross_attention.kv_net.weight.data)
+                dst_layer.cross_attn.o_proj.weight.data.copy_(src_layer.cross_attention.o_net.weight.data)
+                dst_layer.norm_xa_query.weight.data.copy_(src_layer.norm_xattn_query.weight.data)
+                dst_layer.norm_xa_memory.weight.data.copy_(src_layer.norm_xattn_memory.weight.data)
+
+            # FFN.
+            dst_layer.norm_ff.weight.data.copy_(src_layer.norm_pos_ff.weight.data)
+            dst_layer.ffn.conv1.weight.data.copy_(src_layer.pos_ff.proj.conv.weight.data)
+            dst_layer.ffn.conv2.weight.data.copy_(src_layer.pos_ff.o_net.conv.weight.data)
+
+        # Optional output norm.
+        if hasattr(model.decoder, "norm_out") and isinstance(model.decoder.norm_out, nn.LayerNorm):
+            wrapper.norm_out = nn.LayerNorm(dec_cfg["d_model"], bias=False)
+            wrapper.norm_out.weight.data.copy_(model.decoder.norm_out.weight.data)
+
+        # Final projection.
+        wrapper.final_proj.weight.data.copy_(model.final_proj.weight.data)
+        wrapper.final_proj.bias.data.copy_(model.final_proj.bias.data)
+
+        # Re-flatten buffers in case eager copies replaced them.
+        wrapper.flatten_state_buffers()
+        return wrapper
diff --git a/tools/coreml-cli/src/coreml_cli/cli.py b/tools/coreml-cli/src/coreml_cli/cli.py
index 00dd97e..99dc12c 100644
--- a/tools/coreml-cli/src/coreml_cli/cli.py
+++ b/tools/coreml-cli/src/coreml_cli/cli.py
@@ -13,7 +13,7 @@
 
 import typer
 
-from .compute_plan import COMPUTE_UNITS, get_compute_plan
+from .compute_plan import COMPUTE_UNITS, DEFAULT_LOAD_TIMEOUT_S, get_compute_plan
 from .fallback import analyze_fallback
 from .latency import measure_cold_compile, measure_latency
 from .metadata import get_model_metadata
@@ -85,6 +85,15 @@ def bench(
         False, "--json", help="Output JSON instead of table"
     ),
     iterations: int = typer.Option(10, "--iterations", "-n", help="Number of timed iterations"),
+    plan_timeout: float = typer.Option(
+        DEFAULT_LOAD_TIMEOUT_S,
+        "--plan-timeout",
+        help=(
+            "Max seconds to wait for MLComputePlan to load per compute_units "
+            "config. Increase for graphs with >1500 ops if you see 'Failed to "
+            "load compute plan: timeout' errors."
+        ),
+    ),
     debug: bool = typer.Option(False, "--debug", help="Print progress to stderr"),
 ) -> None:
     """Profile CoreML model compute device assignments and latency."""
@@ -108,7 +117,7 @@ def bench(
         all_fb = []
         for model in models:
             _log(f"Analyzing fallback for {model.name}...")
-            fb = analyze_fallback(model, cu)
+            fb = analyze_fallback(model, cu, load_timeout_s=plan_timeout)
             all_fb.append({
                 "model_path": str(model),
                 "model_name": model.stem,
@@ -142,7 +151,7 @@ def bench(
         for unit_config in unit_configs:
             _log(f"  compute_units={unit_config}")
 
-            result = get_compute_plan(model, unit_config)
+            result = get_compute_plan(model, unit_config, load_timeout_s=plan_timeout)
 
             if detailed:
                 detail = get_detailed_profile(model, unit_config)
diff --git a/tools/coreml-cli/src/coreml_cli/compute_plan.py b/tools/coreml-cli/src/coreml_cli/compute_plan.py
index e5b1afe..d651ee9 100644
--- a/tools/coreml-cli/src/coreml_cli/compute_plan.py
+++ b/tools/coreml-cli/src/coreml_cli/compute_plan.py
@@ -42,9 +42,25 @@ def _walk_operations(block: Any) -> list[Any]:
     return ops
 
 
-def get_compute_plan(model_path: Path, compute_units: str) -> dict:
+DEFAULT_LOAD_TIMEOUT_S = 120.0
+
+
+def get_compute_plan(
+    model_path: Path,
+    compute_units: str,
+    load_timeout_s: float = DEFAULT_LOAD_TIMEOUT_S,
+) -> dict:
     """Load compute plan for a model with given compute units.
 
+    Args:
+        model_path: Path to the compiled .mlmodelc.
+        compute_units: One of the keys in ``COMPUTE_UNITS``.
+        load_timeout_s: Max seconds to wait for ``MLComputePlan.loadContentsOfURL``
+            to invoke its completion handler. Large graphs (≳1500 ops) can take
+            tens of seconds to analyze; the previous hard-coded 30s would
+            silently false-fail with "unknown error" when the load merely needed
+            more time.
+
     Returns dict with summary and per-operation breakdown.
     """
     url = NSURL.fileURLWithPath_(str(model_path))
@@ -64,10 +80,15 @@ def completion(loaded_plan: Any, load_error: Any) -> None:
     CoreML.MLComputePlan.loadContentsOfURL_configuration_completionHandler_(
         url, config, completion
     )
-    event.wait(timeout=30)
+    completed = event.wait(timeout=load_timeout_s)
     plan = result_holder.get("plan")
     error = result_holder.get("error")
 
+    if not completed:
+        raise RuntimeError(
+            f"Failed to load compute plan: timeout after {load_timeout_s:.0f}s "
+            f"(graph may be too large; pass --plan-timeout to extend)"
+        )
     if error is not None or plan is None:
         err_msg = str(error) if error else "unknown error"
         raise RuntimeError(f"Failed to load compute plan: {err_msg}")
diff --git a/tools/coreml-cli/src/coreml_cli/fallback.py b/tools/coreml-cli/src/coreml_cli/fallback.py
index dfae387..c82f7ba 100644
--- a/tools/coreml-cli/src/coreml_cli/fallback.py
+++ b/tools/coreml-cli/src/coreml_cli/fallback.py
@@ -10,13 +10,20 @@
 from .private_profiler import get_detailed_profile
 
 
-def analyze_fallback(model_path: Path, compute_units: str = "cpu_and_neural_engine") -> dict:
+def analyze_fallback(
+    model_path: Path,
+    compute_units: str = "cpu_and_neural_engine",
+    load_timeout_s: float | None = None,
+) -> dict:
     """Analyze which ops fall back to CPU and why.
 
     Returns a fallback_summary dict with grouped reasons.
     """
     # Get public compute plan for device assignments
-    plan = get_compute_plan(model_path, compute_units)
+    plan_kwargs: dict[str, Any] = {}
+    if load_timeout_s is not None:
+        plan_kwargs["load_timeout_s"] = load_timeout_s
+    plan = get_compute_plan(model_path, compute_units, **plan_kwargs)
 
     # Get private profiler data for validation messages
     detail = get_detailed_profile(model_path, compute_units)