diff --git a/models/tts/magpie/SWIFT_PORT_FINDINGS.md b/models/tts/magpie/SWIFT_PORT_FINDINGS.md new file mode 100644 index 0000000..61494ba --- /dev/null +++ b/models/tts/magpie/SWIFT_PORT_FINDINGS.md @@ -0,0 +1,195 @@ +# Magpie Swift Port — Findings & Platform Quirks + +Notes from porting Magpie TTS Multilingual 357M from `generate_coreml.py` (Python +reference) to the FluidAudio Swift package. Documents what doesn't transfer +cleanly, ANE compile behavior, and the perf optimizations that materially +moved the needle. Future agents (Swift or Python side) — read before changing +the conversion scripts. + +## TL;DR + +1. **`decoder_step.mlmodelc` cannot use ANE in production.** ANE compile + reliably fails at runtime (`MILCompilerForANE: ANECCompile() FAILED`) + under real synth, even though it succeeds with zeroed dummy inputs. + Pin Swift consumers to `.cpuAndGPU` for that model. +2. **`decoder_prefill.mlmodelc` gives a 3.8× end-to-end speedup over the + 110-step prefill fallback.** Worth keeping in the canonical HF artifact set. +3. **fp16 host non-determinism is real and compounds with depth.** Stage-by- + stage Swift ↔ Python parity: text_encoder ~50 dB SNR, prefill L0=64 dB → + L11=44 dB, AR replay ~40 dB. Below ~50 dB you get audible single-word drift. +4. **`MLComputePlan.load(...)` (macOS 14.4+) SIGBUSes on every Magpie + `.mlmodelc`** — can't introspect device assignment via the public API. + +## Per-model compute placement (verified end-to-end) + +Measured on M-series, real synth (8-word EN sentence, warm), Swift consumer: + +| Model | `.cpuOnly` | `.cpuAndGPU` | `.cpuAndNeuralEngine` | Recommendation | +| ------------------ | ---------- | ------------ | ---------------------- | -------------- | +| `text_encoder` | 42 ms | 43 ms | **12 ms** | ANE | +| `decoder_prefill` | 56 ms | 23 ms | **18 ms** | ANE | +| `decoder_step` | 31 ms\* | **22 ms** | 10 ms\*\* | **GPU** (see below) | +| `nanocodec_decoder`| — | — | runs on ANE | ANE | + +\* dummy-input single-call benchmark; real synth is 96 s warm. +\*\* dummy-input speedup is misleading — see decoder_step section. + +### decoder_step ANE failure mode (the trap) + +`coreml-cli` and the dummy-input benchmark both report ANE works on +`decoder_step`. **In real synth it does not.** Stack: + +- Single-call benchmark with `position = 0`: ANE compile succeeds, runs in + 10 ms → looks 3× faster than CPU. +- Real synth (incrementing `position` 0…N, real KV cache state): ANE + recompile is triggered per call and **fails** with `MILCompilerForANE: + ANECCompile() FAILED` (visible in stderr), then falls back to CPU at + hundreds-of-ms per call. End-to-end this is 7% slower than `.cpuAndGPU` + (103 s vs 96 s warm) and 34% slower cold. + +The likely cause is the rank-4 split-K/V scatter pattern (`cache_k_i` / +`cache_v_i` are `[1, 512, H, D]` fp16 with `position` advancing per step). +ANEF can compile the topology against a static input but bails when actual +gather indices vary at runtime. + +**Action items if you revisit `convert_decoder_step.py`:** + +- Try a single rank-3 K/V layout (`[512, H*D]` fp16) instead of split rank-4. +- Try `position` as `int32` instead of `float16`. +- Try eliminating the scatter by writing the new K/V row as a separate output + and letting the host concatenate (already what Swift's `MagpieKvCache` does + conceptually). +- Verify with **real incrementing positions**, not zeros. + +If ANE remains broken, the current `.cpuAndGPU` pin is correct. Document +the failure in `coreml/convert_decoder_step.py` so the next person doesn't +chase the dummy-input ghost. + +## decoder_prefill is essential, not optional + +The repo README marks `decoder_prefill.mlmodelc` as "optional" with the +fallback being 110 sequential `decoder_step` calls. Reality: + +| Path | Wall (warm) | +| -------------------------------- | ----------- | +| 110-step prefill fallback | ~420 s | +| `decoder_prefill.mlmodelc` (1×) | ~110 s | + +That's a **3.8× speedup** from a single batched call. Without it the Swift +port is unshippable. Ensure `convert_decoder_prefill.py` runs in CI / the +canonical HF asset upload. + +### Prefill output naming + +`decoder_prefill.mlmodelc` emits 12 outputs as anonymous CoreML var IDs: + +``` +var_208, var_374, var_540, var_706, var_872, var_1038, +var_1204, var_1370, var_1536, var_1702, var_1868, var_1958 +``` + +Each is `[2, 1, 512, H, D]` fp16 with axis 0 = `[K_stacked, V_stacked]`. +Swift slices them with two `memcpy`s into the per-layer K/V cache. **Don't +rename these without bumping the Swift port.** Or — better — explicitly name +them `prefill_kv_layer_{0..11}` in `convert_decoder_prefill.py` so the Swift +binding is robust to recompiles. + +## fp16 host non-determinism: what we measured + +Swift CoreML and Python+coremltools CoreML produce **bit-different** fp16 +outputs on the same inputs, same `.mlmodelc`, same M-series host. This is +documented Apple behavior (CoreML doesn't guarantee bit-exact reproducibility +across processes / load configurations). Magnitude: + +- **text_encoder**: SNR(Swift, Python) ≈ 50.6 dB +- **decoder_prefill** per layer: L0 SNR 64 dB → L11 SNR 44 dB + (compounds geometrically through the 12-layer cache) +- **decoder_step AR loop**: post-12-layer cache → ~40 dB after 100 steps + +40 dB SNR is below the threshold where the top-k=80 sampler trajectory +diverges, which manifests as a single trailing word in the audio (e.g. +"…seashore, **and**" instead of "…seashore."). 4/5 speakers are unaffected +because their sampler trajectories are more stable; speaker 0 in our test +set consistently hits the drift. + +**This is not a Swift bug.** Verified by: + +1. Python+CoreML matches NeMo PyTorch. +2. Swift+CoreML's text_encoder output already differs from Python+CoreML at + 50 dB (no Swift-side math involved — it's just `MLModel.prediction`). +3. Swift's LocalTransformer (Accelerate+BNNS) matches a fp64 NumPy reference + to >120 dB in isolation — so the post-CoreML Swift path is clean. + +**If you want bit-identical Swift↔Python parity**, the only paths are: +- Force fp32 weights in the `.mlpackage` (size + perf cost — probably not + worth it for ~1 word at the end of an utterance). +- Accept perceptual parity (current state). + +## MLComputePlan crashes on Magpie `.mlmodelc` + +Tried `MLComputePlan.load(contentsOf: url, configuration: cfg)` on macOS 14.5 +across all four Magpie models, all three compute units. Every call SIGBUSes +(exit 138). Cannot be used for device-assignment introspection. The Swift CLI +falls back to a timing-based probe: + +``` +swift run fluidaudiocli magpie compute-plan +``` + +— loads each model under `.cpuOnly` / `.cpuAndGPU` / `.cpuAndNeuralEngine`, +runs 1 warmup + 3 timed iters, infers ANE usage from the speedup ratio +(>1.3× cpuOnly → ANE active). Hacky but works. + +**`coreml-cli` from `tools/coreml-cli/` may have the same issue.** Test +before relying on its `--fallback` analysis for Magpie models. If it +crashes, file a follow-up to check whether it's an Apple issue or a +property of the conversion (e.g. `coremltools` version, MIL ops used). + +## Suggested mobius-side follow-ups + +1. **`convert_decoder_step.py`**: experiment with rank-3 K/V layout + + int32 position, validate ANE compile under real position values. +2. **`convert_decoder_prefill.py`**: name the 12 outputs explicitly. +3. **`prepare_hf_upload.py`**: ensure `decoder_prefill.mlmodelc` is always + included (not gated on availability — Swift port treats it as required). +4. **`generate_coreml.py`**: add a `--dump-intermediates` flag that writes + the per-stage tensors (`encoder_output`, the 12 prefill K/V outputs, + per-step decoder hidden, sampled codes, audio) to an `.npz`. Used by the + Swift `MagpieParityCommand` and `MagpieProbeCommand` for stage-by-stage + parity. Currently a manual modification each time. +5. **Documentation**: add a "Known Limitations" section to the main README + noting decoder_step ANE failure and the fp16 host drift. + +## Verified Swift performance budget (post-optimization) + +Reference: 8-word English sentence, M-series, warm process. + +``` +text_encoder 12 ms (ANE) — 1× +decoder_prefill 18 ms (ANE) — 1× +decoder_step ~22 ms (GPU/MPS) — ~80–200× per utterance (this is the AR loop) +nanocodec_decoder ~50 ms (ANE) — 1× +LocalTransformer ~3-5 ms/step (CPU/Accelerate+BNNS) +───────────────────────────────────── +Wall (warm) ~96 s for ~3 s of audio at 22 kHz +RTFx ~0.04× (sub-realtime) +``` + +The bottleneck is the AR loop (`decoder_step` × ~120 + LocalTransformer +sample × 8 codebooks per step). To beat realtime we need either: + +- ANE on `decoder_step` (blocked by the compile failure documented above). +- A drastically faster LocalTransformer (MLX backend candidate). +- Speculative decoding / parallel sampling (architectural change). + +## File-level cross-reference (Swift side) + +| Concern | Swift file | +| -------------------------------- | --------------------------------------------------------------------------------------- | +| Per-model compute units | `Sources/FluidAudio/TTS/Magpie/Assets/MagpieModelStore.swift` | +| Prefill batched call + KV seed | `Sources/FluidAudio/TTS/Magpie/Pipeline/Synthesize/MagpiePrefill.swift`, `MagpieKvCache.swift` | +| Stage-by-stage parity probe | `Sources/FluidAudioCLI/Commands/MagpieProbeCommand.swift` | +| Compute-device probe | `Sources/FluidAudioCLI/Commands/MagpieComputePlanCommand.swift` | +| LocalTransformer (Accelerate) | `Sources/FluidAudio/TTS/Magpie/LocalTransformer/MagpieLocalTransformer.swift` | +| fp64 LocalTransformer reference | `Sources/FluidAudio/TTS/Magpie/LocalTransformer/MagpieLocalTransformerDouble.swift` | +| Documentation | `Documentation/TTS/Magpie.md` | diff --git a/models/tts/magpie/coreml/build_manifest.py b/models/tts/magpie/coreml/build_manifest.py new file mode 100644 index 0000000..f2b80ca --- /dev/null +++ b/models/tts/magpie/coreml/build_manifest.py @@ -0,0 +1,380 @@ +"""Build manifest.json for the Magpie TTS hf-upload directory. + +The manifest is a machine-readable index of every artifact in the upload +(models in both .mlmodelc + .mlpackage form, constants, per-language +tokenizer files), along with shapes, sizes, and SHA-256 digests. The Swift +port's MagpieResourceDownloader consumes it to know what to fetch and how +to verify integrity. +""" + +from __future__ import annotations + +import hashlib +import json +import struct +from datetime import datetime, timezone +from pathlib import Path +from typing import Any + +ROOT = Path(__file__).resolve().parent / "hf-upload" +SCHEMA_VERSION = "1.0" +REPO_ID = "FluidInference/magpie-tts-multilingual-357m-coreml" + + +def sha256_file(path: Path) -> str: + h = hashlib.sha256() + with path.open("rb") as f: + for chunk in iter(lambda: f.read(1 << 20), b""): + h.update(chunk) + return h.hexdigest() + + +def dir_size(path: Path) -> int: + return sum(p.stat().st_size for p in path.rglob("*") if p.is_file()) + + +def file_count(path: Path) -> int: + return sum(1 for p in path.rglob("*") if p.is_file()) + + +def parse_npy_header(path: Path) -> dict[str, Any]: + """Read the v1/v2 .npy header and return shape + dtype.""" + with path.open("rb") as f: + magic = f.read(6) + if magic != b"\x93NUMPY": + raise ValueError(f"not an npy file: {path}") + major = f.read(1)[0] + f.read(1) # minor + if major == 1: + (header_len,) = struct.unpack(" dict[str, Any]: + p = ROOT / rel + info = parse_npy_header(p) + return { + "path": rel, + "bytes": p.stat().st_size, + "sha256": sha256_file(p), + "dtype": info["dtype"], + "shape": info["shape"], + } + + +def json_entry(rel: str) -> dict[str, Any]: + p = ROOT / rel + return { + "path": rel, + "bytes": p.stat().st_size, + "sha256": sha256_file(p), + } + + +def model_pair_entry(name: str, io: dict[str, Any]) -> dict[str, Any]: + mlmodelc = ROOT / f"{name}.mlmodelc" + mlpackage = ROOT / f"{name}.mlpackage" + return { + "name": name, + "compiled": { + "path": f"{name}.mlmodelc", + "bytes": dir_size(mlmodelc), + "files": file_count(mlmodelc), + }, + "package": { + "path": f"{name}.mlpackage", + "bytes": dir_size(mlpackage), + "files": file_count(mlpackage), + }, + "io": io, + } + + +# ---------- model io specs ---------------------------------------------------- + +# These specs were captured by inspecting the converted .mlpackage descriptions +# during convert_*.py runs (see generate_coreml.py for runtime keys). + +MODEL_IO: dict[str, dict[str, Any]] = { + "text_encoder": { + "inputs": [ + {"name": "text_tokens", "dtype": "int32", "shape": [1, 256]}, + {"name": "text_mask", "dtype": "fp16", "shape": [1, 256]}, + ], + "outputs": [ + {"name": "encoder_output", "dtype": "fp16", "shape": [1, 256, 768]}, + {"name": "encoder_mask", "dtype": "fp16", "shape": [1, 256]}, + ], + }, + "decoder_prefill": { + "inputs": [ + {"name": "input", "dtype": "fp16", "shape": [1, 110, 768]}, + {"name": "encoder_output", "dtype": "fp16", "shape": [1, 256, 768]}, + {"name": "encoder_mask", "dtype": "fp16", "shape": [1, 256]}, + ], + "outputs": [ + {"name": "hidden_states", "dtype": "fp16", "shape": [1, 110, 768]}, + { + "name": "cache_*", + "dtype": "fp16", + "shape": [2, 1, 512, 12, 64], + "count": 12, + "note": "12 KV-cache outputs for the 12 decoder layers", + }, + { + "name": "position_*", + "dtype": "int32", + "shape": [], + "count": 12, + "note": "scalar position counter per layer", + }, + ], + }, + "decoder_step": { + "inputs": [ + {"name": "input", "dtype": "fp16", "shape": [1, 1, 768]}, + {"name": "encoder_output", "dtype": "fp16", "shape": [1, 256, 768]}, + {"name": "encoder_mask", "dtype": "fp16", "shape": [1, 256]}, + { + "name": "cache_*", + "dtype": "fp16", + "shape": [2, 1, 512, 12, 64], + "count": 12, + }, + {"name": "position_*", "dtype": "int32", "shape": [], "count": 12}, + ], + "outputs": [ + { + "name": "var_2201", + "dtype": "fp16", + "shape": [1, 1, 16192], + "note": "logits, reshape to (1, 1, 8, 2024) for 8 codebooks", + }, + {"name": "new_cache_*", "dtype": "fp16", "shape": [2, 1, 512, 12, 64], "count": 12}, + {"name": "var_*", "dtype": "int32", "shape": [], "count": 12, "note": "advanced positions"}, + ], + }, + "nanocodec_decoder": { + "inputs": [ + {"name": "tokens", "dtype": "int32", "shape": [1, 8, 256]}, + ], + "outputs": [ + {"name": "audio", "dtype": "fp32", "shape": [1, 262144], "note": "256 frames * 1024 samples = 11.89s @ 22050 Hz"}, + ], + "limits": {"max_frames": 256, "max_audio_seconds": 11.89}, + }, +} + + +# ---------- constants files --------------------------------------------------- + +CONSTANTS_NPY = [ + "constants/audio_embedding_0.npy", + "constants/audio_embedding_1.npy", + "constants/audio_embedding_2.npy", + "constants/audio_embedding_3.npy", + "constants/audio_embedding_4.npy", + "constants/audio_embedding_5.npy", + "constants/audio_embedding_6.npy", + "constants/audio_embedding_7.npy", + "constants/speaker_0.npy", + "constants/speaker_1.npy", + "constants/speaker_2.npy", + "constants/speaker_3.npy", + "constants/speaker_4.npy", + "constants/speaker_embeddings_raw.npy", + "constants/text_embedding.npy", +] + +CONSTANTS_JSON = [ + "constants/constants.json", + "constants/speaker_info.json", + "constants/tokenizer_info.json", + "constants/tokenizer_metadata.json", + "constants/tokenizer_references.json", +] + +LOCAL_TRANSFORMER_NPY = [ + "constants/local_transformer/in_proj_weight.npy", + "constants/local_transformer/in_proj_bias.npy", + "constants/local_transformer/pos_emb.npy", + "constants/local_transformer/norm1_weight.npy", + "constants/local_transformer/norm2_weight.npy", + "constants/local_transformer/sa_qkv_weight.npy", + "constants/local_transformer/sa_o_weight.npy", + "constants/local_transformer/ffn_conv1_weight.npy", + "constants/local_transformer/ffn_conv2_weight.npy", +] + [ + f"constants/local_transformer/out_proj_{i}_{kind}.npy" + for i in range(8) + for kind in ("weight", "bias") +] + + +# ---------- per-language tokenizer files -------------------------------------- + +# Mirrors MagpieLanguage in the Swift port. Languages with no entries use +# ByT5 byte-level tokenization (algorithmic, no lookup files). + +LANGUAGE_FILES: dict[str, dict[str, Any]] = { + "english": { + "tokenizer_kind": "phoneme", + "files": [ + "tokenizer/english_phoneme_token2id.json", + "tokenizer/english_phoneme_phoneme_dict.json", + "tokenizer/english_phoneme_heteronyms.json", + ], + }, + "spanish": { + "tokenizer_kind": "phoneme", + "files": [ + "tokenizer/spanish_phoneme_token2id.json", + "tokenizer/spanish_phoneme_phoneme_dict.json", + ], + }, + "german": { + "tokenizer_kind": "phoneme", + "files": [ + "tokenizer/german_phoneme_token2id.json", + "tokenizer/german_phoneme_phoneme_dict.json", + "tokenizer/german_phoneme_heteronyms.json", + ], + }, + "hindi": { + "tokenizer_kind": "char", + "files": [ + "tokenizer/hindi_chartokenizer_token2id.json", + ], + }, + "mandarin": { + "tokenizer_kind": "phoneme+jieba+pypinyin", + "files": [ + "tokenizer/mandarin_phoneme_token2id.json", + "tokenizer/mandarin_phoneme_phoneme_dict.json", + "tokenizer/mandarin_phoneme_pinyin_dict.json", + "tokenizer/mandarin_phoneme_tone_dict.json", + "tokenizer/mandarin_phoneme_ascii_letter_dict.json", + "tokenizer/mandarin_pypinyin_char_dict.json", + "tokenizer/mandarin_pypinyin_phrase_dict.json", + "tokenizer/mandarin_jieba_dict.json", + ], + }, + "french": {"tokenizer_kind": "byt5", "files": []}, + "italian": {"tokenizer_kind": "byt5", "files": []}, + "vietnamese": {"tokenizer_kind": "byt5", "files": []}, +} + + +def build_manifest() -> dict[str, Any]: + models = { + "text_encoder": model_pair_entry("text_encoder", MODEL_IO["text_encoder"]), + "decoder_prefill": model_pair_entry("decoder_prefill", MODEL_IO["decoder_prefill"]), + "decoder_step": model_pair_entry("decoder_step", MODEL_IO["decoder_step"]), + "nanocodec_decoder": model_pair_entry("nanocodec_decoder", MODEL_IO["nanocodec_decoder"]), + } + + constants = { + "json": [json_entry(p) for p in CONSTANTS_JSON], + "npy": [npy_entry(p) for p in CONSTANTS_NPY], + "local_transformer": [npy_entry(p) for p in LOCAL_TRANSFORMER_NPY], + } + + languages = {} + for lang, spec in LANGUAGE_FILES.items(): + entries = [json_entry(p) for p in spec["files"]] + languages[lang] = { + "tokenizer_kind": spec["tokenizer_kind"], + "files": entries, + "bytes": sum(e["bytes"] for e in entries), + } + + # Top-level summary + total_bytes = ( + sum(m["compiled"]["bytes"] + m["package"]["bytes"] for m in models.values()) + + sum(e["bytes"] for e in constants["json"]) + + sum(e["bytes"] for e in constants["npy"]) + + sum(e["bytes"] for e in constants["local_transformer"]) + + sum(lang["bytes"] for lang in languages.values()) + ) + + manifest = { + "schema_version": SCHEMA_VERSION, + "generated_at": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"), + "repo_id": REPO_ID, + "model": { + "name": "Magpie TTS Multilingual", + "params_million": 357, + "sample_rate": 22050, + "codec_samples_per_frame": 1024, + "frames_per_second": 22050.0 / 1024.0, + "max_decoder_steps": 500, + "max_decoder_seconds": 500 * 1024 / 22050.0, + "max_nanocodec_frames": 256, + "max_nanocodec_seconds": 256 * 1024 / 22050.0, + "embedding_dim": 768, + "num_audio_codebooks": 8, + "codebook_size": 2024, + "audio_bos_id": 2016, + "audio_eos_id": 2017, + "forbidden_token_ids": [2016, 2018, 2019, 2020, 2021, 2022, 2023], + "num_speakers": 5, + "speaker_names": ["John", "Sofia", "Aria", "Jason", "Leo"], + "speaker_context_length": 110, + "max_text_tokens": 256, + "supported_languages": list(LANGUAGE_FILES.keys()), + "supported_features": [ + "ipa_override", + "deterministic_g2p", + "classifier_free_guidance", + ], + "japanese": { + "supported": False, + "note": "Japanese deferred — needs OpenJTalk + MeCab dict (separate follow-up).", + }, + "streaming_nanocodec": { + "supported": False, + "note": ( + "NanoCodec is exported as a fixed-window batch decoder (max_frames=256). " + "True streaming requires MLState conv-cache integration; tested overlap " + "warmup yields <15 dB SNR and is unviable as a fallback." + ), + }, + }, + "models": models, + "constants": constants, + "languages": languages, + "totals": { + "bytes": total_bytes, + "human": f"{total_bytes / 1_000_000_000:.2f} GB", + }, + "notes": [ + "Both .mlmodelc (compiled, ready-to-run) and .mlpackage (portable source) are shipped.", + "Swift consumers should prefer .mlmodelc; .mlpackage is provided for inspection / re-targeting.", + "Per-language tokenizer files under tokenizer/ are lazy: download only the languages you need.", + "constants/local_transformer/*.npy are loaded once into a Swift fp32 cache — see MagpieLocalTransformerWeights.swift.", + ], + } + return manifest + + +def main() -> None: + manifest = build_manifest() + out = ROOT / "manifest.json" + out.write_text(json.dumps(manifest, indent=2) + "\n") + print(f"wrote {out} ({out.stat().st_size:,} bytes)") + print(f"total assets: {manifest['totals']['human']}") + + +if __name__ == "__main__": + main() diff --git a/models/tts/magpie/coreml/convert_decoder_step.py b/models/tts/magpie/coreml/convert_decoder_step.py index 5b136fe..e32f2f8 100644 --- a/models/tts/magpie/coreml/convert_decoder_step.py +++ b/models/tts/magpie/coreml/convert_decoder_step.py @@ -48,20 +48,25 @@ def convert_decoder_step(nemo_path=None, max_seq_len=512, max_text_len=256, encoder_output = torch.randn(B, T_enc, d_model) encoder_mask = torch.ones(B, T_enc, dtype=torch.bool) - # Flat cache and position args - caches = [] + # Flat split-K/V cache + position args (rank-4 — ANE-friendly). + cache_ks = [] + cache_vs = [] positions = [] for i in range(n_layers): - cache = torch.zeros(2, B, max_seq_len, H, D) + ck = torch.zeros(B, max_seq_len, H, D) + cv = torch.zeros(B, max_seq_len, H, D) # Simulate some prefilled context - cache[:, :, :10, :, :] = torch.randn(2, B, 10, H, D) * 0.1 - caches.append(cache) + ck[:, :10, :, :] = torch.randn(B, 10, H, D) * 0.1 + cv[:, :10, :, :] = torch.randn(B, 10, H, D) * 0.1 + cache_ks.append(ck) + cache_vs.append(cv) positions.append(torch.tensor([10.0])) - # Build flat argument tuple + # Build flat argument tuple: (audio_embed, encoder_output, encoder_mask, + # ck0, cv0, p0, ck1, cv1, p1, ...). example_inputs = (audio_embed, encoder_output, encoder_mask) for i in range(n_layers): - example_inputs = example_inputs + (caches[i], positions[i]) + example_inputs = example_inputs + (cache_ks[i], cache_vs[i], positions[i]) # Trace print("Tracing model...") @@ -76,7 +81,8 @@ def convert_decoder_step(nemo_path=None, max_seq_len=512, max_text_len=256, ct.TensorType(name="encoder_mask", shape=(1, T_enc), dtype=np.bool_), ] for i in range(n_layers): - inputs.append(ct.TensorType(name=f"cache{i}", shape=(2, 1, max_seq_len, H, D))) + inputs.append(ct.TensorType(name=f"cache_k{i}", shape=(1, max_seq_len, H, D))) + inputs.append(ct.TensorType(name=f"cache_v{i}", shape=(1, max_seq_len, H, D))) inputs.append(ct.TensorType(name=f"position{i}", shape=(1,))) mlmodel = ct.convert( @@ -109,7 +115,8 @@ def convert_decoder_step(nemo_path=None, max_seq_len=512, max_text_len=256, "encoder_mask": np.ones((1, T_enc), dtype=np.float32), } for i in range(n_layers): - test_inputs[f"cache{i}"] = np.zeros((2, 1, max_seq_len, H, D), dtype=np.float32) + test_inputs[f"cache_k{i}"] = np.zeros((1, max_seq_len, H, D), dtype=np.float32) + test_inputs[f"cache_v{i}"] = np.zeros((1, max_seq_len, H, D), dtype=np.float32) test_inputs[f"position{i}"] = np.array([0.0], dtype=np.float32) out = coreml_model.predict(test_inputs) diff --git a/models/tts/magpie/coreml/convert_decoder_step_stateful.py b/models/tts/magpie/coreml/convert_decoder_step_stateful.py new file mode 100644 index 0000000..6f5aad9 --- /dev/null +++ b/models/tts/magpie/coreml/convert_decoder_step_stateful.py @@ -0,0 +1,158 @@ +"""EXPERIMENTAL — DO NOT USE IN PRODUCTION. + +Convert decoder step model to CoreML — STATEFUL variant (MLState). + +Kept as a documented dead-end. Benchmark on Apple M2 / macOS 26.5 / 146-step +real loop showed this variant runs at ~212 ms/step vs ~96 ms/step for the +production rank-4 split-K/V graph (2.2× regression). See sibling file +``traceable/traceable_decoder_step_stateful.py`` for full rationale. + +KV caches are managed as on-device state buffers via ``ct.StateType`` instead of +being passed in/out of the graph as 36 input/output tensors per step. The model +exposes a tiny IO surface (4 inputs, 2 outputs). + +Caveat: stateful CoreML graphs do not target ANE. We force CPU+GPU at runtime, +which is exactly why this variant loses for Magpie (rank-4 production already +gets 97.3% on ANE). + +Usage: + python convert_decoder_step_stateful.py [--nemo-path /path/to/model.nemo] +""" +import argparse +import os +import sys + +import coremltools as ct +import numpy as np +import torch + +sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) +from traceable.traceable_decoder_step_stateful import StatefulDecoderStep + + +def convert_decoder_step_stateful(nemo_path=None, max_seq_len=512, max_text_len=256, + output_path="build/decoder_step_stateful.mlpackage"): + print("Loading MagpieTTS model...") + from nemo.collections.tts.models import MagpieTTSModel + if nemo_path: + model = MagpieTTSModel.restore_from(nemo_path, map_location="cpu") + else: + model = MagpieTTSModel.from_pretrained("nvidia/magpie_tts_multilingual_357m") + model.eval() + + cfg = model.cfg + dec_cfg = dict(cfg.decoder) + d_model = dec_cfg["d_model"] + n_layers = dec_cfg["n_layers"] + sa_n_heads = dec_cfg["sa_n_heads"] + d_head = d_model // sa_n_heads + + print("Creating stateful traceable decoder step...") + decoder = StatefulDecoderStep.from_magpie(model) + decoder.eval() + decoder.reset_state() + + # Example inputs. Position is a 1-elem int32 scalar. + B = 1 + T_enc = max_text_len + + audio_embed = torch.randn(B, 1, d_model) + encoder_output = torch.randn(B, T_enc, d_model) + encoder_mask = torch.ones(B, T_enc, dtype=torch.bool) + position = torch.tensor([0], dtype=torch.int32) + + example_inputs = (audio_embed, encoder_output, encoder_mask, position) + + print("Tracing model...") + with torch.no_grad(): + traced = torch.jit.trace(decoder, example_inputs, strict=False) + + print("Converting to CoreML (stateful)...") + inputs = [ + ct.TensorType(name="audio_embed", shape=(1, 1, d_model)), + ct.TensorType(name="encoder_output", shape=(1, T_enc, d_model)), + ct.TensorType(name="encoder_mask", shape=(1, T_enc), dtype=np.bool_), + ct.TensorType(name="position", shape=(1,), dtype=np.int32), + ] + + states = [] + for i in range(n_layers): + states.append(ct.StateType( + wrapped_type=ct.TensorType( + shape=(1, max_seq_len, sa_n_heads, d_head), + dtype=np.float16, + ), + name=f"k_cache_{i}", + )) + states.append(ct.StateType( + wrapped_type=ct.TensorType( + shape=(1, max_seq_len, sa_n_heads, d_head), + dtype=np.float16, + ), + name=f"v_cache_{i}", + )) + + mlmodel = ct.convert( + traced, + inputs=inputs, + states=states, + convert_to="mlprogram", + compute_precision=ct.precision.FLOAT16, + compute_units=ct.ComputeUnit.CPU_AND_GPU, + minimum_deployment_target=ct.target.macOS15, + ) + + os.makedirs(os.path.dirname(output_path), exist_ok=True) + mlmodel.save(output_path) + print(f"Saved to {output_path}") + + spec = mlmodel.get_spec() + print("\n=== INPUTS ===") + for inp in spec.description.input: + if inp.type.HasField("multiArrayType"): + shape = list(inp.type.multiArrayType.shape) + print(f" {inp.name}: {shape}") + print("\n=== OUTPUTS ===") + for out in spec.description.output: + if out.type.HasField("multiArrayType"): + shape = list(out.type.multiArrayType.shape) + print(f" {out.name}: {shape}") + print("\n=== STATES ===") + if hasattr(spec.description, "state"): + for s in spec.description.state: + # State features use the ``stateType`` oneof, which wraps an + # ``arrayType`` (multiArrayType-equivalent) on the inside. + try: + shape = list(s.type.stateType.arrayType.shape) + print(f" {s.name}: {shape}") + except Exception as exc: # pragma: no cover - inspection only + print(f" {s.name}: ") + + print("\nTesting CoreML model with state...") + coreml_model = ct.models.MLModel(output_path, compute_units=ct.ComputeUnit.CPU_AND_GPU) + state = coreml_model.make_state() + + test_inputs = { + "audio_embed": np.random.randn(1, 1, d_model).astype(np.float32), + "encoder_output": np.random.randn(1, T_enc, d_model).astype(np.float32), + "encoder_mask": np.ones((1, T_enc), dtype=np.float32), + "position": np.array([0], dtype=np.int32), + } + + out = coreml_model.predict(test_inputs, state=state) + print(f"Output keys: {len(out)}") + for k, v in sorted(out.items()): + if isinstance(v, np.ndarray): + print(f" {k}: shape={v.shape}") + print("Done!") + return output_path + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--nemo-path", type=str, default=None) + parser.add_argument("--max-seq-len", type=int, default=512) + parser.add_argument("--max-text-len", type=int, default=256) + parser.add_argument("--output", type=str, default="build/decoder_step_stateful.mlpackage") + args = parser.parse_args() + convert_decoder_step_stateful(args.nemo_path, args.max_seq_len, args.max_text_len, args.output) diff --git a/models/tts/magpie/coreml/emit_parity_fixture.py b/models/tts/magpie/coreml/emit_parity_fixture.py new file mode 100644 index 0000000..ab0aa58 --- /dev/null +++ b/models/tts/magpie/coreml/emit_parity_fixture.py @@ -0,0 +1,332 @@ +"""Emit intermediate-tensor fixtures for cross-implementation parity testing. + +Runs the Magpie CoreML pipeline for a fixed (text, speaker, language, seed) +and dumps intermediate tensors so the Swift port (or any other +implementation) can replay each stage and diff against this ground truth. + +Two output modes: + +- ``--mode full`` (default): runs the full pipeline and saves an ``.npz`` with + text tokens, encoder output, post-prefill KV caches, per-step decoder + hidden states, per-step sampled codes, the final ``(8, N)`` codes matrix, + and the decoded PCM. +- ``--mode tokenizer``: tokenizes only and saves a ``.json`` mapping + ``{text, speaker, language, token_ids}`` — cheap to diff against the Swift + ``MagpieTokenizer`` output without requiring CoreML at all. + +Example: + + python emit_parity_fixture.py "Hello world." \\ + --speaker 0 --language en --seed 42 \\ + --output fixture_en_s0.npz + + python emit_parity_fixture.py "Hello world." \\ + --speaker 0 --language en --mode tokenizer \\ + --output fixture_en_s0_tokens.json +""" +from __future__ import annotations + +import argparse +import json +import math +import os +import time +from typing import Any + +import coremltools as ct +import numpy as np +import soundfile as sf + +# Re-use everything from the main script so we never drift from the reference. +from generate_coreml import ( # noqa: E402 + BUILD_DIR, + DECODER_CACHE_OUT_KEYS, + DECODER_HIDDEN_KEY, + DECODER_POSITION_KEYS, + _tokenize_text, + embed_audio_codes, + load_audio_embeddings, + load_constants, + load_local_transformer, + load_speaker_embedding, + local_transformer_sample, +) + + +def _make_caches(n_layers: int, max_seq_len: int, n_heads: int, d_head: int): + c, p = {}, {} + for i in range(n_layers): + c[f"cache{i}"] = np.zeros( + (2, 1, max_seq_len, n_heads, d_head), dtype=np.float32 + ) + p[f"position{i}"] = np.array([0.0], dtype=np.float32) + return c, p + + +def emit_tokenizer_fixture( + text: str, + speaker: int, + language: str, + output_path: str, +) -> None: + constants = load_constants() + token_ids = _tokenize_text(text, language, constants).tolist() + fixture = { + "text": text, + "speakerIndex": speaker, + "languageCode": language, + "expectedTokenIds": token_ids, + } + with open(output_path, "w") as f: + json.dump(fixture, f, indent=2, ensure_ascii=False) + print(f"Wrote tokenizer fixture → {output_path} ({len(token_ids)} tokens)") + + +def emit_full_fixture( + text: str, + speaker: int, + language: str, + output_path: str, + temperature: float, + topk: int, + max_steps: int, + seed: int, + use_cfg: bool, + cfg_scale: float, +) -> None: + np.random.seed(seed) + constants = load_constants() + + num_codebooks = constants["num_audio_codebooks"] + audio_bos_id = constants["special_tokens"]["audio_bos_id"] + audio_eos_id = constants["special_tokens"]["audio_eos_id"] + sample_rate = constants["output_sample_rate"] + d_model = constants["decoder"]["d_model"] + n_layers = constants["decoder"]["n_layers"] + sa_n_heads = constants["decoder"]["sa_n_heads"] + d_head = d_model // sa_n_heads + max_text_len = 256 + max_seq_len = 512 + min_frames = constants["inference"].get("min_generated_frames", 4) + + # --- 1. Tokenize --- + text_tokens = _tokenize_text(text, language, constants) + T_text = int(len(text_tokens)) + text_tokens_padded = np.zeros(max_text_len, dtype=np.int32) + text_tokens_padded[:T_text] = text_tokens + text_mask = np.zeros(max_text_len, dtype=np.float32) + text_mask[:T_text] = 1.0 + + # --- 2. Load models --- + text_encoder = ct.models.MLModel( + os.path.join(BUILD_DIR, "text_encoder.mlpackage"), + compute_units=ct.ComputeUnit.CPU_AND_GPU, + ) + decoder_step = ct.models.MLModel( + os.path.join(BUILD_DIR, "decoder_step.mlpackage"), + compute_units=ct.ComputeUnit.CPU_AND_GPU, + ) + nanocodec = ct.models.MLModel( + os.path.join(BUILD_DIR, "nanocodec_decoder.mlpackage"), + compute_units=ct.ComputeUnit.CPU_AND_GPU, + ) + + # --- 3. Encode text --- + enc_out = text_encoder.predict({ + "text_tokens": text_tokens_padded[np.newaxis, :], + "text_mask": text_mask[np.newaxis, :], + }) + encoder_output = np.asarray(enc_out["encoder_output"], dtype=np.float32) + + if use_cfg: + uncond_encoder_output = np.zeros_like(encoder_output) + uncond_text_mask = np.zeros_like(text_mask) + uncond_text_mask[0] = 1.0 + + # --- 4. Load embeddings + LT weights --- + speaker_emb = load_speaker_embedding(speaker) + T_ctx = int(speaker_emb.shape[0]) + audio_emb_tables = load_audio_embeddings(constants) + lt_weights = load_local_transformer() + + caches, positions = _make_caches(n_layers, max_seq_len, sa_n_heads, d_head) + if use_cfg: + u_caches, u_positions = _make_caches(n_layers, max_seq_len, sa_n_heads, d_head) + + def _run_step(audio_embed, enc_out_np, mask_np, cache_dict, pos_dict): + inputs: dict[str, Any] = { + "audio_embed": audio_embed.astype(np.float32), + "encoder_output": enc_out_np.astype(np.float32), + "encoder_mask": mask_np[np.newaxis, :].astype(np.float32), + } + inputs.update(cache_dict) + inputs.update(pos_dict) + out = decoder_step.predict(inputs) + for i in range(n_layers): + cache_dict[f"cache{i}"] = out[DECODER_CACHE_OUT_KEYS[i]] + pos_dict[f"position{i}"] = out[DECODER_POSITION_KEYS[i]] + return np.asarray(out[DECODER_HIDDEN_KEY], dtype=np.float32) + + # --- 5. Prefill --- + uncond_ctx = np.zeros((1, 1, d_model), dtype=np.float32) + for t in range(T_ctx): + ctx = speaker_emb[np.newaxis, np.newaxis, t, :] + _run_step(ctx, encoder_output, text_mask, caches, positions) + if use_cfg: + _run_step(uncond_ctx, uncond_encoder_output, uncond_text_mask, + u_caches, u_positions) + + # Snapshot KV caches after prefill (deep-copied so later rotation doesn't + # mutate the fixture). + prefill_caches = {k: v.copy() for k, v in caches.items()} + prefill_positions = {k: v.copy() for k, v in positions.items()} + + # --- 6. AR loop --- + current_codes = np.full(num_codebooks, audio_bos_id, dtype=np.int32) + per_step_hidden: list[np.ndarray] = [] + per_step_codes: list[np.ndarray] = [] + + gen_start = time.time() + for step in range(max_steps): + audio_embed = embed_audio_codes(current_codes, audio_emb_tables, num_codebooks) + cond_hidden = _run_step(audio_embed, encoder_output, text_mask, caches, positions) + + if use_cfg: + uncond_hidden = _run_step( + audio_embed, uncond_encoder_output, uncond_text_mask, + u_caches, u_positions, + ) + uncond_dec_hidden = uncond_hidden[0, 0] + else: + uncond_dec_hidden = None + + decoder_hidden = cond_hidden[0, 0] + per_step_hidden.append(decoder_hidden.copy()) + + forbid_eos = step < min_frames + next_codes = local_transformer_sample( + decoder_hidden, lt_weights, audio_emb_tables, + num_codebooks, temperature, topk, forbid_eos, + uncond_decoder_hidden=uncond_dec_hidden, + cfg_scale=cfg_scale if use_cfg else 1.0, + ) + + is_eos = bool(np.any(next_codes == audio_eos_id)) + if is_eos and step >= min_frames: + per_step_codes.append(next_codes.copy()) + break + per_step_codes.append(next_codes.copy()) + current_codes = next_codes + + gen_time = time.time() - gen_start + + predicted_codes_full = np.stack(per_step_codes, axis=1) # (8, N) + + # --- 7. NanoCodec decode --- + max_frames = 256 + T_total = min(predicted_codes_full.shape[1], max_frames) + padded = np.zeros((num_codebooks, max_frames), dtype=np.int32) + padded[:, :T_total] = predicted_codes_full[:, :T_total] + codec_out = nanocodec.predict({ + "tokens": padded[np.newaxis, :, :].astype(np.int32), + }) + audio = np.asarray(codec_out["audio"], dtype=np.float32) + if audio.ndim > 1: + audio = audio.flatten() + expected_samples = T_total * constants["codec_samples_per_frame"] + audio = audio[:expected_samples] + peak = float(np.abs(audio).max()) + if peak > 0: + audio = audio / peak * 0.9 + + # --- 8. Pack fixture --- + fixture: dict[str, Any] = { + # Config + "text": np.array(text), + "speakerIndex": np.int32(speaker), + "languageCode": np.array(language), + "seed": np.int32(seed), + "useCfg": np.bool_(use_cfg), + "cfgScale": np.float32(cfg_scale), + "temperature": np.float32(temperature), + "topk": np.int32(topk), + "sampleRate": np.int32(sample_rate), + "minFrames": np.int32(min_frames), + # Stage 1: tokenizer + "textTokens": text_tokens.astype(np.int32), + "textTokensPadded": text_tokens_padded.astype(np.int32), + "textMask": text_mask.astype(np.float32), + # Stage 2: text encoder + "encoderOutput": encoder_output.astype(np.float32), + # Stage 3: post-prefill caches + **{f"prefillCache{i}": prefill_caches[f"cache{i}"].astype(np.float32) + for i in range(n_layers)}, + **{f"prefillPosition{i}": prefill_positions[f"position{i}"].astype(np.float32) + for i in range(n_layers)}, + # Stage 4: per-step AR trace + "perStepDecoderHidden": np.stack(per_step_hidden, axis=0).astype(np.float32), + "perStepCodes": np.stack(per_step_codes, axis=0).astype(np.int32), + "predictedCodes": predicted_codes_full.astype(np.int32), + # Stage 5: audio + "audioPcm": audio.astype(np.float32), + "audioSamples": np.int32(len(audio)), + "genTimeSeconds": np.float32(gen_time), + } + + np.savez_compressed(output_path, **fixture) + + duration = len(audio) / sample_rate if sample_rate > 0 else 0.0 + rtf = gen_time / duration if duration > 0 else math.inf + print(f"Wrote full fixture → {output_path}") + print(f" tokens={T_text} frames={predicted_codes_full.shape[1]} " + f"duration={duration:.2f}s rtf={rtf:.2f}x") + + wav_path = os.path.splitext(output_path)[0] + ".wav" + sf.write(wav_path, audio, sample_rate) + print(f" reference audio → {wav_path}") + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Emit Magpie TTS parity fixtures for cross-impl testing.", + ) + parser.add_argument("text", type=str, help="Text to synthesize") + parser.add_argument("--mode", choices=["full", "tokenizer"], default="full", + help="'full' dumps .npz of all intermediates; " + "'tokenizer' dumps a small .json of token ids") + parser.add_argument("--speaker", type=int, default=0) + parser.add_argument("--language", type=str, default="en") + parser.add_argument("--output", type=str, required=True, + help="Output path (.npz for full, .json for tokenizer)") + parser.add_argument("--seed", type=int, default=42) + parser.add_argument("--temperature", type=float, default=0.6) + parser.add_argument("--topk", type=int, default=80) + parser.add_argument("--max-steps", type=int, default=500) + parser.add_argument("--no-cfg", action="store_true") + parser.add_argument("--cfg-scale", type=float, default=2.5) + args = parser.parse_args() + + if args.mode == "tokenizer": + emit_tokenizer_fixture( + text=args.text, + speaker=args.speaker, + language=args.language, + output_path=args.output, + ) + else: + emit_full_fixture( + text=args.text, + speaker=args.speaker, + language=args.language, + output_path=args.output, + temperature=args.temperature, + topk=args.topk, + max_steps=args.max_steps, + seed=args.seed, + use_cfg=not args.no_cfg, + cfg_scale=args.cfg_scale, + ) + + +if __name__ == "__main__": + main() diff --git a/models/tts/magpie/coreml/generate_coreml.py b/models/tts/magpie/coreml/generate_coreml.py index e786ca8..8bbae3c 100644 --- a/models/tts/magpie/coreml/generate_coreml.py +++ b/models/tts/magpie/coreml/generate_coreml.py @@ -25,18 +25,37 @@ CONST_DIR = os.path.join(SCRIPT_DIR, "constants") BUILD_DIR = os.path.join(SCRIPT_DIR, "build") -# Decoder step output key names (from CoreML model spec) -DECODER_LOGITS_KEY = "var_2201" +# EXPERIMENTAL: Stateful (MLState) decoder path. Off by default. +# +# Enable by setting MAGPIE_STATEFUL=1. Kept for reference only — benchmarks +# (Apple M2, 146-step real loop) showed ~212 ms/step vs ~96 ms/step for the +# rank-4 production path: 2.2× regression. Stateful graphs run on CPU+GPU +# only (ANE rejects them); the IO-marshaling savings from collapsing 36 cache +# tensors don't compensate for losing ANE acceleration. See +# ``traceable/traceable_decoder_step_stateful.py`` for full rationale. +STATEFUL = bool(os.environ.get("MAGPIE_STATEFUL", "")) + +# Decoder step output key names (from CoreML model spec — rank-4 split-K/V) +DECODER_LOGITS_KEY = "var_2129" DECODER_HIDDEN_KEY = "input" -# Output cache keys (input keys are cache0..cache11) -DECODER_CACHE_OUT_KEYS = [ - "new_cache_1", "new_cache_3", "new_cache_5", "new_cache_7", - "new_cache_9", "new_cache_11", "new_cache_13", "new_cache_15", - "new_cache_17", "new_cache_19", "new_cache_21", "new_cache", + +# Stateful model uses a different logits key (re-traced graph reorders ops). +DECODER_LOGITS_KEY_STATEFUL = "var_2124" +# Per-layer K and V output keys (12 layers each). +# Input keys are cache_k0..cache_k11 / cache_v0..cache_v11 / position0..position11. +DECODER_CACHE_K_OUT_KEYS = [ + "new_k_1", "new_k_3", "new_k_5", "new_k_7", + "new_k_9", "new_k_11", "new_k_13", "new_k_15", + "new_k_17", "new_k_19", "new_k_21", "new_k", +] +DECODER_CACHE_V_OUT_KEYS = [ + "new_v_1", "new_v_3", "new_v_5", "new_v_7", + "new_v_9", "new_v_11", "new_v_13", "new_v_15", + "new_v_17", "new_v_19", "new_v_21", "new_v", ] DECODER_POSITION_KEYS = [ - "var_169", "var_346", "var_523", "var_700", "var_877", "var_1054", - "var_1231", "var_1408", "var_1585", "var_1762", "var_1939", "var_2116", + "var_169", "var_339", "var_509", "var_679", "var_849", "var_1019", + "var_1189", "var_1359", "var_1529", "var_1699", "var_1869", "var_2039", ] # Forbidden token IDs (special tokens that should never be sampled) @@ -291,10 +310,17 @@ def generate( os.path.join(BUILD_DIR, "text_encoder.mlpackage"), compute_units=ct.ComputeUnit.CPU_AND_GPU, ) - decoder_step = ct.models.MLModel( - os.path.join(BUILD_DIR, "decoder_step.mlpackage"), - compute_units=ct.ComputeUnit.CPU_AND_GPU, - ) + if STATEFUL: + decoder_step = ct.models.MLModel( + os.path.join(BUILD_DIR, "decoder_step_stateful.mlpackage"), + # Stateful graphs are not ANE-compatible; CPU+GPU only. + compute_units=ct.ComputeUnit.CPU_AND_GPU, + ) + else: + decoder_step = ct.models.MLModel( + os.path.join(BUILD_DIR, "decoder_step.mlpackage"), + compute_units=ct.ComputeUnit.ALL, # rank-4 split-K/V — ANE compiles for some ops + ) nanocodec = ct.models.MLModel( os.path.join(BUILD_DIR, "nanocodec_decoder.mlpackage"), compute_units=ct.ComputeUnit.CPU_AND_GPU, @@ -322,32 +348,59 @@ def generate( audio_emb_tables = load_audio_embeddings(constants) lt_weights = load_local_transformer() - # 5. Initialize KV caches (conditional) - def make_caches(): - c, p = {}, {} - for i in range(n_layers): - c[f"cache{i}"] = np.zeros((2, 1, max_seq_len, sa_n_heads, d_head), dtype=np.float32) - p[f"position{i}"] = np.array([0.0], dtype=np.float32) - return c, p - - caches, positions = make_caches() - if use_cfg: - uncond_caches, uncond_positions = make_caches() - - def run_decoder_step(audio_embed_np, enc_out_np, mask_np, cache_dict, pos_dict): - step_inputs = { - "audio_embed": audio_embed_np.astype(np.float32), - "encoder_output": enc_out_np.astype(np.float32), - "encoder_mask": mask_np[np.newaxis, :].astype(np.float32), - } - step_inputs.update(cache_dict) - step_inputs.update(pos_dict) - step_out = decoder_step.predict(step_inputs) - for i in range(n_layers): - # Output cache keys differ from input keys after scatter-based cache rewrite - cache_dict[f"cache{i}"] = step_out[DECODER_CACHE_OUT_KEYS[i]] - pos_dict[f"position{i}"] = step_out[DECODER_POSITION_KEYS[i]] - return step_out[DECODER_HIDDEN_KEY] # (1, 1, d_model) — decoder hidden + # 5. Initialize KV caches. + # Stateful path: caches live on the model's MLState; we just track a + # per-stream scalar position. Non-stateful path: explicit numpy buffers. + if STATEFUL: + # State buffers are owned by the CoreML runtime — we just need a + # position counter per stream. Use a 1-element list so the closure + # can mutate it. Alias to ``caches``/``positions`` so the prefill + + # generation call sites work unchanged for both code paths. + caches = decoder_step.make_state() + positions = [0] + if use_cfg: + uncond_caches = decoder_step.make_state() + uncond_positions = [0] + else: + def make_caches(): + c, p = {}, {} + for i in range(n_layers): + c[f"cache_k{i}"] = np.zeros((1, max_seq_len, sa_n_heads, d_head), dtype=np.float32) + c[f"cache_v{i}"] = np.zeros((1, max_seq_len, sa_n_heads, d_head), dtype=np.float32) + p[f"position{i}"] = np.array([0.0], dtype=np.float32) + return c, p + + caches, positions = make_caches() + if use_cfg: + uncond_caches, uncond_positions = make_caches() + + if STATEFUL: + def run_decoder_step(audio_embed_np, enc_out_np, mask_np, state, pos_box): + step_inputs = { + "audio_embed": audio_embed_np.astype(np.float32), + "encoder_output": enc_out_np.astype(np.float32), + "encoder_mask": mask_np[np.newaxis, :].astype(np.float32), + "position": np.array([pos_box[0]], dtype=np.int32), + } + step_out = decoder_step.predict(step_inputs, state=state) + pos_box[0] += 1 + return step_out[DECODER_HIDDEN_KEY] # (1, 1, d_model) + else: + def run_decoder_step(audio_embed_np, enc_out_np, mask_np, cache_dict, pos_dict): + step_inputs = { + "audio_embed": audio_embed_np.astype(np.float32), + "encoder_output": enc_out_np.astype(np.float32), + "encoder_mask": mask_np[np.newaxis, :].astype(np.float32), + } + step_inputs.update(cache_dict) + step_inputs.update(pos_dict) + step_out = decoder_step.predict(step_inputs) + for i in range(n_layers): + # Output cache keys differ from input keys due to torch trace renaming. + cache_dict[f"cache_k{i}"] = step_out[DECODER_CACHE_K_OUT_KEYS[i]] + cache_dict[f"cache_v{i}"] = step_out[DECODER_CACHE_V_OUT_KEYS[i]] + pos_dict[f"position{i}"] = step_out[DECODER_POSITION_KEYS[i]] + return step_out[DECODER_HIDDEN_KEY] # (1, 1, d_model) — decoder hidden # 6. Prefill context # Conditional path: real speaker context + real encoder output @@ -361,7 +414,8 @@ def run_decoder_step(audio_embed_np, enc_out_np, mask_np, cache_dict, pos_dict): run_decoder_step(uncond_ctx_token, uncond_encoder_output, uncond_text_mask, uncond_caches, uncond_positions) if (t + 1) % 50 == 0: print(f" Prefilled {t + 1}/{T_ctx}") - print(f" Prefill done. Position: {positions['position0'][0]:.0f}") + final_pos = positions[0] if STATEFUL else float(positions['position0'][0]) + print(f" Prefill done. Position: {final_pos:.0f}") # 7. Autoregressive generation with local transformer print(f"\nGenerating (max {max_steps} steps)...") diff --git a/models/tts/magpie/coreml/prepare_hf_upload.py b/models/tts/magpie/coreml/prepare_hf_upload.py new file mode 100644 index 0000000..829ae2b --- /dev/null +++ b/models/tts/magpie/coreml/prepare_hf_upload.py @@ -0,0 +1,493 @@ +"""Stage a HuggingFace-ready directory for Magpie TTS Multilingual 357M. + +The mobius exporters and converters write into two local directories: + +- ``build/`` — compiled ``.mlpackage`` bundles (and, after + ``compile_mlmodelc.py``, matching ``.mlmodelc`` bundles). +- ``constants/`` — ``.npy`` tensors, ``*.json`` config, the + ``local_transformer/`` subtree, **and** the per-language tokenizer + JSONs. + +The FluidAudio Swift port and the target HF repo +(``FluidInference/magpie-tts-multilingual-357m-coreml``) expect a slightly +different layout: CoreML models at the root, tokenizer JSONs in a +dedicated ``tokenizer/`` folder, everything else in ``constants/``. This +script assembles that layout into ``hf-upload/`` (configurable), writes a +model card + ``.gitattributes``, validates that nothing required is +missing, and prints the exact ``huggingface-cli upload`` commands for +the user to run. + +It does **not** upload anything. Per project policy, HF uploads are +performed manually by the maintainers. + +Usage: + + # After running the converter + compiler + constants exporters + python prepare_hf_upload.py + + # Custom paths / output + python prepare_hf_upload.py \\ + --build-dir build \\ + --constants-dir constants \\ + --output-dir hf-upload \\ + --repo-id FluidInference/magpie-tts-multilingual-357m-coreml +""" +from __future__ import annotations + +import argparse +import json +import os +import shutil +import sys +from dataclasses import dataclass, field + +SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__)) + +# Core models expected at the repo root. +REQUIRED_MODELS = [ + "text_encoder.mlmodelc", + "decoder_step.mlmodelc", + "nanocodec_decoder.mlmodelc", +] +OPTIONAL_MODELS = [ + "decoder_prefill.mlmodelc", +] + +# Keys that MUST survive in constants/. Anything not in this allow-list that +# also isn't a per-language tokenizer file will be flagged as unknown. +CONSTANTS_KEEP_FILES = { + "constants.json", + "speaker_info.json", + "tokenizer_info.json", + "tokenizer_metadata.json", + "tokenizer_references.json", + "text_embedding.npy", + "speaker_embeddings_raw.npy", +} +CONSTANTS_KEEP_PREFIXES = ( + "speaker_", # speaker_0.npy .. speaker_N.npy + "audio_embedding_", # audio_embedding_0.npy .. audio_embedding_7.npy +) +CONSTANTS_KEEP_DIRS = {"local_transformer"} + +# Mirror of MagpieTokenizerFiles.files(for:) in the Swift port. +PER_LANGUAGE_TOKENIZER_FILES = { + "english": [ + "english_phoneme_token2id.json", + "english_phoneme_phoneme_dict.json", + ], + "spanish": [ + "spanish_phoneme_token2id.json", + "spanish_phoneme_phoneme_dict.json", + ], + "italian": [ + "italian_phoneme_token2id.json", + "italian_phoneme_phoneme_dict.json", + ], + "vietnamese": [ + "vietnamese_phoneme_token2id.json", + "vietnamese_phoneme_phoneme_dict.json", + ], + "german": [ + "german_phoneme_token2id.json", + "german_phoneme_phoneme_dict.json", + "german_phoneme_heteronyms.json", + ], + "french": [ + "french_chartokenizer_token2id.json", + ], + "hindi": [ + "hindi_chartokenizer_token2id.json", + ], + "mandarin": [ + "mandarin_phoneme_token2id.json", + "mandarin_phoneme_pinyin_dict.json", + "mandarin_phoneme_tone_dict.json", + "mandarin_phoneme_ascii_letter_dict.json", + "mandarin_pypinyin_char_dict.json", + "mandarin_pypinyin_phrase_dict.json", + "mandarin_jieba_dict.json", + ], +} + +ALL_TOKENIZER_FILES = { + fname for files in PER_LANGUAGE_TOKENIZER_FILES.values() for fname in files +} + + +GITATTRIBUTES = """\ +*.mlmodelc filter=lfs diff=lfs merge=lfs -text +*.mlmodel filter=lfs diff=lfs merge=lfs -text +*.mlpackage filter=lfs diff=lfs merge=lfs -text +*.npy filter=lfs diff=lfs merge=lfs -text +*.bin filter=lfs diff=lfs merge=lfs -text +*.safetensors filter=lfs diff=lfs merge=lfs -text +*.onnx filter=lfs diff=lfs merge=lfs -text +""" + + +README_TEMPLATE = """\ +--- +license: cc-by-4.0 +language: + - en + - es + - de + - fr + - it + - vi + - zh + - hi +tags: + - text-to-speech + - coreml + - apple-silicon + - magpie +library_name: coreml +base_model: nvidia/magpie_tts_multilingual_357m +--- + +# Magpie TTS Multilingual 357M (CoreML) + +CoreML export of NVIDIA's [Magpie TTS Multilingual 357M](https://huggingface.co/nvidia/magpie_tts_multilingual_357m), optimized for on-device inference on Apple Silicon. Ships as `.mlmodelc` bundles compiled for macOS 14+ / iOS 17+. + +Converted with [FluidInference/mobius](https://github.com/FluidInference/mobius). Consumed by the Swift port in [FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio) (see `Sources/FluidAudio/TTS/Magpie/`). + +## Languages + +English, Spanish, German, French, Italian, Vietnamese, Mandarin, Hindi. Japanese is not yet included. + +## Contents + +``` +├── text_encoder.mlmodelc/ # Text → (B, 256, 768) encoder output +├── decoder_step.mlmodelc/ # 12-layer AR decoder (stateful KV cache) +├── decoder_prefill.mlmodelc/ # (optional) batched prefill fast path +├── nanocodec_decoder.mlmodelc/ # 8-codebook → PCM vocoder (22050 Hz) +├── constants/ +│ ├── constants.json # d_model, n_layers, EOS ids, ... +│ ├── speaker_info.json # speaker names + context shape +│ ├── tokenizer_metadata.json # tokenizer-agnostic EOS + special tokens +│ ├── speaker_0.npy .. speaker_4.npy +│ ├── audio_embedding_0.npy .. audio_embedding_7.npy +│ └── local_transformer/ # 1-layer transformer weights (Swift reads .npy) +└── tokenizer/ + ├── english_phoneme_*.json + ├── spanish_phoneme_*.json + ├── german_phoneme_*.json + ├── french_chartokenizer_*.json + ├── italian_phoneme_*.json + ├── vietnamese_phoneme_*.json + ├── mandarin_*.json + └── hindi_chartokenizer_*.json +``` + +## Usage (Swift) + +```swift +import FluidAudio + +let manager = try await MagpieTtsManager.downloadAndCreate( + languages: [.english, .spanish] +) +let result = try await manager.synthesize( + text: "Hello | ˈ n ɛ m o ʊ | from FluidAudio.", + speaker: .john, + language: .english +) +let wav = AudioWAV.data(from: result.samples, sampleRate: result.sampleRate) +try wav.write(to: URL(fileURLWithPath: "hello.wav")) +``` + +The manager lazy-downloads everything in this repo on first use. + +## Inline IPA override + +Text enclosed in `|...|` is passed straight to the tokenizer as whitespace-separated IPA tokens: + +``` +"Hello | ˈ n ɛ m o ʊ | world" +``` + +## License + +- CoreML export: CC-BY-4.0 (inherits from the upstream NeMo model). +- Upstream weights: see [nvidia/magpie_tts_multilingual_357m](https://huggingface.co/nvidia/magpie_tts_multilingual_357m). +""" + + +@dataclass +class PrepReport: + copied_models: list[str] = field(default_factory=list) + missing_required_models: list[str] = field(default_factory=list) + missing_optional_models: list[str] = field(default_factory=list) + copied_constants: list[str] = field(default_factory=list) + missing_constants: list[str] = field(default_factory=list) + copied_tokenizer_files: dict[str, list[str]] = field(default_factory=dict) + missing_tokenizer_files: dict[str, list[str]] = field(default_factory=dict) + unknown_files: list[str] = field(default_factory=list) + + def has_errors(self) -> bool: + return bool(self.missing_required_models or self.missing_constants) + + +def _copy_tree(src: str, dst: str) -> None: + if os.path.isdir(src): + if os.path.exists(dst): + shutil.rmtree(dst) + shutil.copytree(src, dst) + else: + os.makedirs(os.path.dirname(dst), exist_ok=True) + shutil.copy2(src, dst) + + +def _copy_models(build_dir: str, output_dir: str, report: PrepReport) -> None: + for model in REQUIRED_MODELS: + src = os.path.join(build_dir, model) + if not os.path.exists(src): + report.missing_required_models.append(model) + continue + dst = os.path.join(output_dir, model) + _copy_tree(src, dst) + report.copied_models.append(model) + + for model in OPTIONAL_MODELS: + src = os.path.join(build_dir, model) + if not os.path.exists(src): + report.missing_optional_models.append(model) + continue + dst = os.path.join(output_dir, model) + _copy_tree(src, dst) + report.copied_models.append(model) + + +def _copy_constants(constants_dir: str, output_dir: str, report: PrepReport) -> None: + dst_constants = os.path.join(output_dir, "constants") + os.makedirs(dst_constants, exist_ok=True) + + required = {"constants.json", "speaker_info.json", "tokenizer_metadata.json"} + required |= {f"audio_embedding_{i}.npy" for i in range(8)} + required |= {f"speaker_{i}.npy" for i in range(5)} + # local_transformer/ is a dir — enumerate expected files separately. + local_transformer_files = { + "in_proj_weight.npy", + "in_proj_bias.npy", + "pos_emb.npy", + "norm1_weight.npy", + "sa_qkv_weight.npy", + "sa_o_weight.npy", + "norm2_weight.npy", + "ffn_conv1_weight.npy", + "ffn_conv2_weight.npy", + } + for i in range(8): + local_transformer_files.add(f"out_proj_{i}_weight.npy") + local_transformer_files.add(f"out_proj_{i}_bias.npy") + + for entry in sorted(os.listdir(constants_dir)): + src = os.path.join(constants_dir, entry) + + # Tokenizer files are moved out to tokenizer/ — skip here. + if entry in ALL_TOKENIZER_FILES: + continue + + # Known constants files or dirs. + is_keep_file = entry in CONSTANTS_KEEP_FILES + is_keep_prefix = any(entry.startswith(p) for p in CONSTANTS_KEEP_PREFIXES) + is_keep_dir = entry in CONSTANTS_KEEP_DIRS and os.path.isdir(src) + + if is_keep_file or is_keep_prefix or is_keep_dir: + dst = os.path.join(dst_constants, entry) + _copy_tree(src, dst) + report.copied_constants.append(entry) + else: + report.unknown_files.append(os.path.relpath(src, constants_dir)) + + copied_set = set(report.copied_constants) + for req in sorted(required): + if req not in copied_set: + report.missing_constants.append(f"constants/{req}") + + lt_src = os.path.join(constants_dir, "local_transformer") + if os.path.isdir(lt_src): + present = set(os.listdir(lt_src)) + for req in sorted(local_transformer_files): + if req not in present: + report.missing_constants.append(f"constants/local_transformer/{req}") + else: + report.missing_constants.append("constants/local_transformer/") + + +def _copy_tokenizer(constants_dir: str, output_dir: str, report: PrepReport) -> None: + dst_tokenizer = os.path.join(output_dir, "tokenizer") + os.makedirs(dst_tokenizer, exist_ok=True) + + for language, files in PER_LANGUAGE_TOKENIZER_FILES.items(): + copied: list[str] = [] + missing: list[str] = [] + for fname in files: + src = os.path.join(constants_dir, fname) + if not os.path.exists(src): + missing.append(fname) + continue + dst = os.path.join(dst_tokenizer, fname) + shutil.copy2(src, dst) + copied.append(fname) + if copied: + report.copied_tokenizer_files[language] = copied + if missing: + report.missing_tokenizer_files[language] = missing + + +def _write_metadata(output_dir: str, report: PrepReport, repo_id: str) -> None: + with open(os.path.join(output_dir, ".gitattributes"), "w") as f: + f.write(GITATTRIBUTES) + + with open(os.path.join(output_dir, "README.md"), "w") as f: + f.write(README_TEMPLATE) + + # Machine-readable prep report for auditability. + summary = { + "repoId": repo_id, + "copiedModels": report.copied_models, + "missingRequiredModels": report.missing_required_models, + "missingOptionalModels": report.missing_optional_models, + "copiedConstants": sorted(report.copied_constants), + "missingConstants": report.missing_constants, + "copiedTokenizerFiles": report.copied_tokenizer_files, + "missingTokenizerFiles": report.missing_tokenizer_files, + "unknownFiles": report.unknown_files, + } + with open(os.path.join(output_dir, "_prep_report.json"), "w") as f: + json.dump(summary, f, indent=2) + + +def _print_report(report: PrepReport, output_dir: str, repo_id: str) -> int: + print("") + print("=" * 72) + print(f"HF upload staging → {output_dir}") + print(f"Target repo: {repo_id}") + print("=" * 72) + + print("\nCoreML models:") + for m in report.copied_models: + print(f" OK {m}") + for m in report.missing_required_models: + print(f" MISS {m} (REQUIRED — re-run convert_*.py + compile_mlmodelc.py)") + for m in report.missing_optional_models: + print(f" skip {m} (optional)") + + print("\nconstants/:") + for c in sorted(report.copied_constants): + print(f" OK {c}") + for c in report.missing_constants: + print(f" MISS {c}") + + print("\ntokenizer/:") + for lang, files in sorted(report.copied_tokenizer_files.items()): + print(f" [{lang}] {len(files)} file(s) copied") + for lang, files in sorted(report.missing_tokenizer_files.items()): + for fname in files: + print(f" MISS tokenizer/{fname} ({lang})") + + if report.unknown_files: + print("\nUnknown files under constants/ (not copied — review):") + for u in report.unknown_files: + print(f" ?? {u}") + + print("") + if report.has_errors(): + print("Staging completed WITH ERRORS — see MISS entries above.") + print("Re-run the relevant exporter/converter and re-run this script.") + return 1 + + print("Staging OK. Upload with one of:") + print("") + print(f" huggingface-cli upload {repo_id} {output_dir} . \\") + print(" --repo-type model --commit-message 'upload Magpie TTS CoreML export'") + print("") + print("Or, if the repo does not exist yet:") + print("") + print(f" huggingface-cli repo create {repo_id} --type model") + print(f" huggingface-cli upload {repo_id} {output_dir} . --repo-type model") + print("") + print("Verify from Swift:") + print(" swift run fluidaudiocli magpie download --languages en") + print("") + return 0 + + +def prepare( + build_dir: str, + constants_dir: str, + output_dir: str, + repo_id: str, + clean: bool, +) -> int: + build_dir = os.path.abspath(build_dir) + constants_dir = os.path.abspath(constants_dir) + output_dir = os.path.abspath(output_dir) + + if not os.path.isdir(build_dir): + print(f"error: build dir not found: {build_dir}", file=sys.stderr) + return 2 + if not os.path.isdir(constants_dir): + print(f"error: constants dir not found: {constants_dir}", file=sys.stderr) + return 2 + + if clean and os.path.exists(output_dir): + shutil.rmtree(output_dir) + os.makedirs(output_dir, exist_ok=True) + + report = PrepReport() + _copy_models(build_dir, output_dir, report) + _copy_constants(constants_dir, output_dir, report) + _copy_tokenizer(constants_dir, output_dir, report) + _write_metadata(output_dir, report, repo_id) + + return _print_report(report, output_dir, repo_id) + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Stage a HuggingFace-ready directory for Magpie TTS CoreML.", + ) + parser.add_argument( + "--build-dir", + default=os.path.join(SCRIPT_DIR, "build"), + help="Directory with compiled .mlmodelc bundles (default: ./build)", + ) + parser.add_argument( + "--constants-dir", + default=os.path.join(SCRIPT_DIR, "constants"), + help="Directory with exported constants + tokenizer files (default: ./constants)", + ) + parser.add_argument( + "--output-dir", + default=os.path.join(SCRIPT_DIR, "hf-upload"), + help="Staging directory to populate (default: ./hf-upload)", + ) + parser.add_argument( + "--repo-id", + default="FluidInference/magpie-tts-multilingual-357m-coreml", + help="Target HF repo id (only used in the printed upload command)", + ) + parser.add_argument( + "--clean", + action="store_true", + help="Remove the output dir before staging (fresh build).", + ) + args = parser.parse_args() + + rc = prepare( + build_dir=args.build_dir, + constants_dir=args.constants_dir, + output_dir=args.output_dir, + repo_id=args.repo_id, + clean=args.clean, + ) + sys.exit(rc) + + +if __name__ == "__main__": + main() diff --git a/models/tts/magpie/coreml/traceable/traceable_decoder_step.py b/models/tts/magpie/coreml/traceable/traceable_decoder_step.py index 3cfe896..d814ec4 100644 --- a/models/tts/magpie/coreml/traceable/traceable_decoder_step.py +++ b/models/tts/magpie/coreml/traceable/traceable_decoder_step.py @@ -1,23 +1,36 @@ -"""Traceable decoder step wrapper for CoreML conversion. +"""Traceable decoder step wrapper for CoreML conversion (rank-4 ANE-friendly). -The decoder is a causal transformer with cross-attention to the encoder output. -For CoreML, we implement it as a single-step model with explicit KV cache I/O, -following the PocketTTS pattern. +Each layer's KV cache is split into separate rank-4 K and V tensors so the ANE +backend can compile the model. The previous rank-5 single-tensor cache +``(2, B, max_seq, H, D)`` was rejected by ``ANECompile`` and forced the model +onto the GPU at ~64 ms/step. + +Key changes vs. the original: + * Per-layer state is ``(cache_k, cache_v, position)`` — three rank-4/scalar + tensors instead of one rank-5 plus a scalar. + * Causal mask ``-1e9`` -> ``-3e4`` (fp16 max is ±65504; ``-1e9`` overflows + to ``-inf`` and the ANE compiler tends to reject out-of-range constants). + * Cross-attention's memory mask is added (instead of ``masked_fill``) using + the same fp16-safe constant so the cross-attn step is also ANE-friendly. Each step: -1. Takes one audio embedding token + encoder output -2. Runs through all decoder layers with causal self-attention + cross-attention -3. Returns logits for next token + updated KV caches +1. Embed one audio token + receive encoder output. +2. Run 12 decoder layers (causal self-attn + cross-attn + FFN). +3. Return logits for next token + updated K/V/positions per layer. """ import torch import torch.nn as nn import torch.nn.functional as F -import math -from typing import Tuple, List + + +# fp16 max is ±65504; use a safely-representable negative value for masked +# softmax positions. -3e4 stays well within fp16 range and gives ~exp(-30000) +# ≈ 0 after softmax so behaviour is numerically identical to -1e9. +MASK_NEG = -3.0e4 class TraceableCausalSelfAttention(nn.Module): - """Single-step causal self-attention with KV cache.""" + """Single-step causal self-attention with rank-4 split K/V cache.""" def __init__(self, d_model, n_heads, d_head=None): super().__init__() @@ -27,60 +40,49 @@ def __init__(self, d_model, n_heads, d_head=None): self.qkv_proj = nn.Linear(d_model, 3 * n_heads * self.d_head, bias=False) self.o_proj = nn.Linear(n_heads * self.d_head, d_model, bias=False) - def forward(self, x, kv_cache, position): + def forward(self, x, kv_k, kv_v, position): """ - Args: - x: (B, 1, d_model) - single token embedding - kv_cache: (2, B, max_seq, H, D) - [key, value] cache - position: (1,) - current write position in cache - Returns: - output: (B, 1, d_model) - new_kv_cache: (2, B, max_seq, H, D) - updated cache - new_position: (1,) - incremented position + x: (B, 1, d_model) + kv_k: (B, max_seq, H, D) + kv_v: (B, max_seq, H, D) + position: (1,) """ - B, T, _ = x.shape # T=1 for single step - max_seq = kv_cache.shape[2] + B, T, _ = x.shape # T = 1 (single step) + max_seq = kv_k.shape[1] qkv = self.qkv_proj(x) qkv = qkv.view(B, T, 3, self.n_heads, self.d_head) - q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2] - - # Write new k,v to cache using scatter (CoreML-compatible) - # Build one-hot mask for position to avoid advanced indexing - pos_idx = position.to(torch.long) - one_hot = torch.zeros(max_seq, dtype=x.dtype, device=x.device) - one_hot[pos_idx] = 1.0 - # one_hot: (max_seq,) → broadcast to (1, B, max_seq, H, D) - mask = one_hot.view(1, 1, max_seq, 1, 1) - - k_new = k.squeeze(1).unsqueeze(0).unsqueeze(2) # (1, B, 1, H, D) → broadcast to (1, B, max_seq, H, D) - v_new = v.squeeze(1).unsqueeze(0).unsqueeze(2) - - # new_cache = (1-mask)*old_cache + mask*new_kv - new_cache_k = kv_cache[0:1] * (1.0 - mask) + k_new * mask # (1, B, max_seq, H, D) - new_cache_v = kv_cache[1:2] * (1.0 - mask) + v_new * mask - new_cache = torch.cat([new_cache_k, new_cache_v], dim=0) # (2, B, max_seq, H, D) - - # Attend to all positions with a causal mask (positions > pos_idx are masked out) - # Build mask: 1 for positions <= pos_idx, 0 for positions > pos_idx + q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2] # each (B, 1, H, D) + + # One-hot write mask along ``max_seq`` (rank-4 broadcast). + # Use arange/equal compare instead of ``one_hot[pos_idx] = 1.0`` so the + # graph lowers to elementwise ops (no ``scatter_nd`` — ANE rejects it). positions_range = torch.arange(max_seq, dtype=x.dtype, device=x.device) - causal_mask = (positions_range <= position).float() # (max_seq,) - causal_mask = causal_mask.view(1, 1, 1, max_seq) # (1, 1, 1, max_seq) + mask = (positions_range == position).to(x.dtype).view(1, max_seq, 1, 1) + + # Broadcast new (B, 1, H, D) → (B, max_seq, H, D); then blend with old cache. + k_new = k.expand(B, max_seq, self.n_heads, self.d_head) + v_new = v.expand(B, max_seq, self.n_heads, self.d_head) + new_k = kv_k * (1.0 - mask) + k_new * mask + new_v = kv_v * (1.0 - mask) + v_new * mask - q = q.transpose(1, 2) # (B, H, 1, D) - k_full = new_cache[0].transpose(1, 2) # (B, H, max_seq, D) - v_full = new_cache[1].transpose(1, 2) + # Causal mask: keep positions ≤ current `position`, drop the rest. + causal_mask = (positions_range <= position).to(x.dtype).view(1, 1, 1, max_seq) - attn = torch.matmul(q, k_full.transpose(-2, -1)) * self.scale # (B, H, 1, max_seq) - attn = attn + (1.0 - causal_mask) * (-1e9) # mask future positions + q4 = q.transpose(1, 2) # (B, H, 1, D) + k4 = new_k.permute(0, 2, 1, 3) # (B, H, max_seq, D) + v4 = new_v.permute(0, 2, 1, 3) # (B, H, max_seq, D) + + attn = torch.matmul(q4, k4.transpose(-2, -1)) * self.scale + attn = attn + (1.0 - causal_mask) * MASK_NEG attn = F.softmax(attn, dim=-1) - out = torch.matmul(attn, v_full) + out = torch.matmul(attn, v4) # (B, H, 1, D) out = out.transpose(1, 2).reshape(B, 1, -1) out = self.o_proj(out) new_position = position + 1.0 - return out, new_cache, new_position + return out, new_k, new_v, new_position class TraceableCrossAttention(nn.Module): @@ -96,14 +98,6 @@ def __init__(self, d_model, n_heads, d_memory, d_head=None): self.o_proj = nn.Linear(n_heads * self.d_head, d_model, bias=False) def forward(self, x, memory, memory_mask=None): - """ - Args: - x: (B, 1, d_model) - query - memory: (B, T_enc, d_memory) - encoder output - memory_mask: (B, T_enc) bool - True=keep - Returns: - output: (B, 1, d_model) - """ B, T_q, _ = x.shape T_m = memory.shape[1] @@ -114,8 +108,9 @@ def forward(self, x, memory, memory_mask=None): attn = torch.matmul(q, k.transpose(-2, -1)) * self.scale if memory_mask is not None: - attn_mask = memory_mask.unsqueeze(1).unsqueeze(2) # (B, 1, 1, T_m) - attn = attn.masked_fill(~attn_mask, float("-inf")) + # Add fp16-safe penalty instead of `masked_fill(-inf)` for ANE. + mem_mask_f = memory_mask.to(x.dtype).unsqueeze(1).unsqueeze(2) # (B, 1, 1, T_m) + attn = attn + (1.0 - mem_mask_f) * MASK_NEG attn = F.softmax(attn, dim=-1) out = torch.matmul(attn, v) @@ -124,7 +119,7 @@ def forward(self, x, memory, memory_mask=None): class TraceableFFN(nn.Module): - """Positionwise feed-forward for decoder.""" + """Position-wise feed-forward for decoder (kernel_size=1 ⇒ matmul + GELU).""" def __init__(self, d_model, d_ffn, kernel_size=1): super().__init__() @@ -142,7 +137,8 @@ def forward(self, x): class TraceableDecoderLayer(nn.Module): """Single decoder transformer layer with self-attn, cross-attn, and FFN.""" - def __init__(self, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory, kernel_size=1, xa_d_head=None): + def __init__(self, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory, + kernel_size=1, xa_d_head=None): super().__init__() self.norm_sa = nn.LayerNorm(d_model, bias=False) self.self_attn = TraceableCausalSelfAttention(d_model, sa_n_heads) @@ -156,20 +152,14 @@ def __init__(self, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory, kernel_s self.norm_ff = nn.LayerNorm(d_model, bias=False) self.ffn = TraceableFFN(d_model, d_ffn, kernel_size) - def forward(self, x, kv_cache, position, encoder_output=None, encoder_mask=None): - """ - Returns: - x: (B, 1, d_model) - new_kv_cache: updated cache - new_position: incremented position - """ - # Self-attention + def forward(self, x, kv_k, kv_v, position, encoder_output=None, encoder_mask=None): + # Self-attention. residual = x x_norm = self.norm_sa(x) - sa_out, new_kv_cache, new_position = self.self_attn(x_norm, kv_cache, position) + sa_out, new_k, new_v, new_position = self.self_attn(x_norm, kv_k, kv_v, position) x = residual + sa_out - # Cross-attention + # Cross-attention. if self.has_xattn and encoder_output is not None: residual = x q_norm = self.norm_xa_query(x) @@ -177,22 +167,20 @@ def forward(self, x, kv_cache, position, encoder_output=None, encoder_mask=None) xa_out = self.cross_attn(q_norm, m_norm, encoder_mask) x = residual + xa_out - # FFN + # FFN. residual = x x = self.norm_ff(x) x = self.ffn(x) x = residual + x - return x, new_kv_cache, new_position + return x, new_k, new_v, new_position class TraceableDecoderStep(nn.Module): - """Complete single-step decoder for CoreML. + """Complete single-step decoder with rank-4 split K/V caches. - Takes one audio token embedding, runs through all decoder layers with - KV cache, and outputs logits for next codebook tokens. - - The KV caches are passed as flat arguments (not lists) for torch.jit.trace. + For each of ``n_layers`` decoder layers the model takes THREE state tensors + (``cache_k``, ``cache_v``, ``position``) and returns three updated outputs. """ def __init__(self, n_layers, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory, @@ -219,80 +207,59 @@ def __init__(self, n_layers, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory for _ in range(n_layers) ]) - self.norm_out = nn.Identity() # May be replaced if model uses apply_norm_out - - # Final projection: decoder hidden → codebook logits + self.norm_out = nn.Identity() # may be replaced by a LayerNorm in `from_magpie` self.final_proj = nn.Linear( - d_model, - num_codebooks * num_tokens_per_codebook * frame_stacking_factor, - ) + d_model, num_codebooks * num_tokens_per_codebook * frame_stacking_factor) def forward(self, audio_embed, encoder_output, encoder_mask, - # Flat KV cache args (one pair per layer) - cache0, pos0, cache1, pos1, cache2, pos2, - cache3, pos3, cache4, pos4, cache5, pos5, - cache6, pos6, cache7, pos7, cache8, pos8, - cache9, pos9, cache10, pos10, cache11, pos11): + # 12 layers × (cache_k, cache_v, position) = 36 flat state args. + ck0, cv0, p0, ck1, cv1, p1, ck2, cv2, p2, + ck3, cv3, p3, ck4, cv4, p4, ck5, cv5, p5, + ck6, cv6, p6, ck7, cv7, p7, ck8, cv8, p8, + ck9, cv9, p9, ck10, cv10, p10, ck11, cv11, p11): """ Args: - audio_embed: (B, 1, d_model) - embedded audio token(s) - encoder_output: (B, T_enc, d_model) - text encoder output - encoder_mask: (B, T_enc) - text mask - cache{i}: (2, B, max_seq, H, D) - KV cache per layer - pos{i}: (1,) - current position per layer - - Returns: - logits: (B, 1, num_codebooks * num_tokens * frame_stacking) - decoder_hidden: (B, 1, d_model) - for local transformer - new_cache{i}: updated caches - new_pos{i}: updated positions + audio_embed: (B, 1, d_model) + encoder_output: (B, T_enc, d_model) + encoder_mask: (B, T_enc) bool + ck{i}, cv{i}: (B, max_seq, H, D) per layer + p{i}: (1,) scalar position per layer + + Returns flat tuple: + logits, decoder_hidden, + new_ck0, new_cv0, new_p0, …, new_ck11, new_cv11, new_p11 """ - caches = [cache0, cache1, cache2, cache3, cache4, cache5, - cache6, cache7, cache8, cache9, cache10, cache11] - positions = [pos0, pos1, pos2, pos3, pos4, pos5, - pos6, pos7, pos8, pos9, pos10, pos11] + cks = [ck0, ck1, ck2, ck3, ck4, ck5, ck6, ck7, ck8, ck9, ck10, ck11] + cvs = [cv0, cv1, cv2, cv3, cv4, cv5, cv6, cv7, cv8, cv9, cv10, cv11] + ps = [p0, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11] x = audio_embed - - # Add positional embedding if self.use_pos_emb: - pos_idx = positions[0].to(torch.long) + pos_idx = ps[0].to(torch.long) x = x + self.position_embeddings(pos_idx).unsqueeze(0) - new_caches = [] - new_positions = [] - + new_ks, new_vs, new_ps = [], [], [] for i, layer in enumerate(self.layers): - x, new_cache, new_pos = layer( - x, caches[i], positions[i], + x, nk, nv, np_ = layer( + x, cks[i], cvs[i], ps[i], encoder_output=encoder_output, encoder_mask=encoder_mask, ) - new_caches.append(new_cache) - new_positions.append(new_pos) - - x = self.norm_out(x) - decoder_hidden = x - - logits = self.final_proj(x) - - return (logits, decoder_hidden, - new_caches[0], new_positions[0], - new_caches[1], new_positions[1], - new_caches[2], new_positions[2], - new_caches[3], new_positions[3], - new_caches[4], new_positions[4], - new_caches[5], new_positions[5], - new_caches[6], new_positions[6], - new_caches[7], new_positions[7], - new_caches[8], new_positions[8], - new_caches[9], new_positions[9], - new_caches[10], new_positions[10], - new_caches[11], new_positions[11]) + new_ks.append(nk) + new_vs.append(nv) + new_ps.append(np_) + + decoder_hidden = self.norm_out(x) + logits = self.final_proj(decoder_hidden) + + outs = [logits, decoder_hidden] + for i in range(self.n_layers): + outs += [new_ks[i], new_vs[i], new_ps[i]] + return tuple(outs) @classmethod def from_magpie(cls, model): - """Create from a loaded MagpieTTSModel.""" + """Create from a loaded MagpieTTSModel and copy over weights.""" cfg = model.cfg dec_cfg = dict(cfg.decoder) @@ -313,46 +280,35 @@ def from_magpie(cls, model): frame_stacking_factor=model.frame_stacking_factor, ) - # Copy positional embeddings if wrapper.use_pos_emb and model.decoder.position_embeddings is not None: - wrapper.position_embeddings.weight.data.copy_(model.decoder.position_embeddings.weight.data) - - # Copy decoder layers - # NeMo TransformerLayer attr names: - # self_attention (SelfAttention) with qkv_net, o_net - # cross_attention (CrossAttention) with q_net, kv_net, o_net - # norm_self, norm_xattn_query, norm_xattn_memory, norm_pos_ff (LayerNorm, bias=False) - # pos_ff (PositionwiseConvFF) with proj.conv, o_net.conv (Conv1d) - for i, (src_layer, dst_layer) in enumerate(zip(model.decoder.layers, wrapper.layers)): - # Self-attention + wrapper.position_embeddings.weight.data.copy_( + model.decoder.position_embeddings.weight.data) + + for src_layer, dst_layer in zip(model.decoder.layers, wrapper.layers): + # Self-attention. dst_layer.self_attn.qkv_proj.weight.data.copy_(src_layer.self_attention.qkv_net.weight.data) dst_layer.self_attn.o_proj.weight.data.copy_(src_layer.self_attention.o_net.weight.data) - - # Self-attn norm (bias=False in NeMo) dst_layer.norm_sa.weight.data.copy_(src_layer.norm_self.weight.data) - # Cross-attention (if present) + # Cross-attention. if dst_layer.has_xattn and hasattr(src_layer, "cross_attention"): dst_layer.cross_attn.q_proj.weight.data.copy_(src_layer.cross_attention.q_net.weight.data) dst_layer.cross_attn.kv_proj.weight.data.copy_(src_layer.cross_attention.kv_net.weight.data) dst_layer.cross_attn.o_proj.weight.data.copy_(src_layer.cross_attention.o_net.weight.data) - dst_layer.norm_xa_query.weight.data.copy_(src_layer.norm_xattn_query.weight.data) dst_layer.norm_xa_memory.weight.data.copy_(src_layer.norm_xattn_memory.weight.data) - # FFN norm (bias=False in NeMo) + # FFN. dst_layer.norm_ff.weight.data.copy_(src_layer.norm_pos_ff.weight.data) - - # FFN (Conv1d via PositionwiseConvFF, bias=False) dst_layer.ffn.conv1.weight.data.copy_(src_layer.pos_ff.proj.conv.weight.data) dst_layer.ffn.conv2.weight.data.copy_(src_layer.pos_ff.o_net.conv.weight.data) - # Output norm + # Optional output norm. if hasattr(model.decoder, "norm_out") and isinstance(model.decoder.norm_out, nn.LayerNorm): wrapper.norm_out = nn.LayerNorm(dec_cfg["d_model"], bias=False) wrapper.norm_out.weight.data.copy_(model.decoder.norm_out.weight.data) - # Final projection + # Final projection. wrapper.final_proj.weight.data.copy_(model.final_proj.weight.data) wrapper.final_proj.bias.data.copy_(model.final_proj.bias.data) diff --git a/models/tts/magpie/coreml/traceable/traceable_decoder_step_stateful.py b/models/tts/magpie/coreml/traceable/traceable_decoder_step_stateful.py new file mode 100644 index 0000000..dd9eba6 --- /dev/null +++ b/models/tts/magpie/coreml/traceable/traceable_decoder_step_stateful.py @@ -0,0 +1,353 @@ +"""EXPERIMENTAL — DO NOT USE IN PRODUCTION. + +Stateful (MLState) variant of ``traceable_decoder_step.py``. Kept as a +documented dead-end so future agents don't repeat the experiment. + +Benchmark result (Apple M2, macOS 26.5, 146-step real loop): + rank-4 production (this file's non-stateful sibling): ~96 ms/step (97.3% ANE) + this stateful variant (CPU_AND_GPU only): ~212 ms/step + → 2.2× regression. Rejected. + +Why it loses for Magpie (vs CosyVoice3 where MLState gave ~3× speedup): + Magpie's rank-4 decoder_step already lands 97.3% of cost on ANE. MLState + graphs are ANE-incompatible, so they force CPU_AND_GPU. The IO-marshaling + savings from collapsing 39 inputs / 38 outputs to 4 / 2 are dwarfed by the + loss of ANE acceleration. + +Variant of ``traceable_decoder_step.py`` that uses CoreML ``MLState`` (stateful +buffers) instead of passing 36 KV+position tensors through the model interface +on every step. + +Differences vs. ``traceable_decoder_step.TraceableDecoderStep``: + * Per-layer K and V caches are ``register_buffer``-ed (24 buffers total) and + mutated in place via slice assignment. + * Forward signature shrinks to 4 inputs: (audio_embed, encoder_output, + encoder_mask, position). Position is a single shared scalar — all layers + advance in lockstep so we don't statefy 12 copies of it. + * Outputs shrink to 2: (logits, decoder_hidden). Cache updates are side + effects on the state buffers. + * Cross-attention path and fp16-safe ``MASK_NEG`` constant are unchanged. +""" +import torch +import torch.nn as nn +import torch.nn.functional as F + + +# fp16 max is ±65504; -3e4 is safely representable and gives ~exp(-30000) ≈ 0 +# after softmax. Identical numerical behaviour to -1e9 without the overflow. +MASK_NEG = -3.0e4 + + +class StatefulCausalSelfAttention(nn.Module): + """Single-step causal self-attention with state-buffer K/V caches. + + The K and V caches are owned by the parent ``StatefulDecoderLayer`` (so all + buffers live on a single module for clean ``ct.StateType`` registration). + This module receives the buffers by reference and mutates them in place. + """ + + def __init__(self, d_model, n_heads, d_head=None): + super().__init__() + self.d_head = d_head or d_model // n_heads + self.n_heads = n_heads + self.scale = self.d_head ** -0.5 + self.qkv_proj = nn.Linear(d_model, 3 * n_heads * self.d_head, bias=False) + self.o_proj = nn.Linear(n_heads * self.d_head, d_model, bias=False) + + def forward(self, x, k_cache, v_cache, position): + """ + x: (B, 1, d_model) + k_cache: (B, max_seq, H, D) — mutated in place + v_cache: (B, max_seq, H, D) — mutated in place + position: (1,) scalar — current write index (also used for causal mask) + """ + B, T, _ = x.shape # T = 1 + max_seq = k_cache.shape[1] + + qkv = self.qkv_proj(x) + qkv = qkv.view(B, T, 3, self.n_heads, self.d_head) + q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2] # each (B, 1, H, D) + + # In-place slice write — pure indexed_update (no scatter_nd). + # Cast position to int via clamp for use as a slice bound. + pos_int = position.to(torch.int32) + # Slice bounds need to be Python ints during tracing; we materialize via + # ``.item()``-equivalent through a 1-element tensor. CoreML's tracer will + # capture the dynamic write index as a runtime variable. + start = pos_int[0] + end = start + 1 + + # Cast new K/V to match buffer dtype (fp16 for the converted graph). + k_cache[:, start:end, :, :] = k.to(k_cache.dtype) + v_cache[:, start:end, :, :] = v.to(v_cache.dtype) + + # Reshape for batched matmul. + q4 = q.transpose(1, 2) # (B, H, 1, D) + k4 = k_cache.permute(0, 2, 1, 3) # (B, H, max_seq, D) + v4 = v_cache.permute(0, 2, 1, 3) # (B, H, max_seq, D) + + # Causal mask: keep positions ≤ current `position`, drop the rest. + positions_range = torch.arange(max_seq, dtype=x.dtype, device=x.device) + causal_mask = (positions_range <= position).to(x.dtype).view(1, 1, 1, max_seq) + + attn = torch.matmul(q4, k4.to(x.dtype).transpose(-2, -1)) * self.scale + attn = attn + (1.0 - causal_mask) * MASK_NEG + attn = F.softmax(attn, dim=-1) + out = torch.matmul(attn, v4.to(x.dtype)) # (B, H, 1, D) + + out = out.transpose(1, 2).reshape(B, 1, -1) + out = self.o_proj(out) + return out + + +class StatefulCrossAttention(nn.Module): + """Cross-attention to encoder output (non-causal, stateless).""" + + def __init__(self, d_model, n_heads, d_memory, d_head=None): + super().__init__() + self.d_head = d_head or d_model // n_heads + self.n_heads = n_heads + self.scale = self.d_head ** -0.5 + self.q_proj = nn.Linear(d_model, n_heads * self.d_head, bias=False) + self.kv_proj = nn.Linear(d_memory, 2 * n_heads * self.d_head, bias=False) + self.o_proj = nn.Linear(n_heads * self.d_head, d_model, bias=False) + + def forward(self, x, memory, memory_mask=None): + B, T_q, _ = x.shape + T_m = memory.shape[1] + + q = self.q_proj(x).view(B, T_q, self.n_heads, self.d_head).transpose(1, 2) + kv = self.kv_proj(memory).view(B, T_m, 2, self.n_heads, self.d_head) + k, v = kv[:, :, 0].transpose(1, 2), kv[:, :, 1].transpose(1, 2) + + attn = torch.matmul(q, k.transpose(-2, -1)) * self.scale + if memory_mask is not None: + mem_mask_f = memory_mask.to(x.dtype).unsqueeze(1).unsqueeze(2) + attn = attn + (1.0 - mem_mask_f) * MASK_NEG + + attn = F.softmax(attn, dim=-1) + out = torch.matmul(attn, v) + out = out.transpose(1, 2).reshape(B, T_q, -1) + return self.o_proj(out) + + +class StatefulFFN(nn.Module): + def __init__(self, d_model, d_ffn, kernel_size=1): + super().__init__() + self.conv1 = nn.Conv1d(d_model, d_ffn, kernel_size, padding=0, bias=False) + self.conv2 = nn.Conv1d(d_ffn, d_model, kernel_size, padding=0, bias=False) + self.act = nn.GELU(approximate="tanh") + + def forward(self, x): + x = x.transpose(1, 2) + x = self.act(self.conv1(x)) + x = self.conv2(x) + return x.transpose(1, 2) + + +class StatefulDecoderLayer(nn.Module): + """One decoder layer; owns its k_cache / v_cache as registered buffers.""" + + def __init__(self, layer_idx, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory, + max_seq_len, kernel_size=1, xa_d_head=None): + super().__init__() + self.layer_idx = layer_idx + self.d_head = d_model // sa_n_heads + self.n_heads = sa_n_heads + + self.norm_sa = nn.LayerNorm(d_model, bias=False) + self.self_attn = StatefulCausalSelfAttention(d_model, sa_n_heads) + + self.has_xattn = xa_n_heads is not None + if self.has_xattn: + self.norm_xa_query = nn.LayerNorm(d_model, bias=False) + self.norm_xa_memory = nn.LayerNorm(xa_d_memory, bias=False) + self.cross_attn = StatefulCrossAttention(d_model, xa_n_heads, xa_d_memory, xa_d_head) + + self.norm_ff = nn.LayerNorm(d_model, bias=False) + self.ffn = StatefulFFN(d_model, d_ffn, kernel_size) + + # Register cache buffers (fp16 to match converted-graph precision). + # Persistent=False so they don't appear in state_dict and won't trip + # weight-load checks. + self.register_buffer( + "k_cache", + torch.zeros(1, max_seq_len, sa_n_heads, self.d_head, dtype=torch.float16), + persistent=False, + ) + self.register_buffer( + "v_cache", + torch.zeros(1, max_seq_len, sa_n_heads, self.d_head, dtype=torch.float16), + persistent=False, + ) + + def forward(self, x, position, encoder_output=None, encoder_mask=None): + # Self-attention (mutates self.k_cache / self.v_cache in place). + residual = x + x_norm = self.norm_sa(x) + sa_out = self.self_attn(x_norm, self.k_cache, self.v_cache, position) + x = residual + sa_out + + # Cross-attention. + if self.has_xattn and encoder_output is not None: + residual = x + q_norm = self.norm_xa_query(x) + m_norm = self.norm_xa_memory(encoder_output) + xa_out = self.cross_attn(q_norm, m_norm, encoder_mask) + x = residual + xa_out + + # FFN. + residual = x + x = self.norm_ff(x) + x = self.ffn(x) + x = residual + x + return x + + +class StatefulDecoderStep(nn.Module): + """Stateful single-step decoder. K/V caches live as buffers on each layer. + + Forward inputs (4): + audio_embed: (B, 1, d_model) + encoder_output: (B, T_enc, d_model) + encoder_mask: (B, T_enc) bool + position: (1,) scalar — write index for this step (shared across layers) + + Forward outputs (2): + logits: (B, 1, num_codebooks * tokens_per_codebook * frame_stack) + decoder_hidden: (B, 1, d_model) + + State (24 buffers; named ``k_cache_{i}``, ``v_cache_{i}`` for i in 0..n-1 + after ``flatten_state_buffers`` is called): + k_cache_{i}: (1, max_seq, H, D) fp16 + v_cache_{i}: (1, max_seq, H, D) fp16 + """ + + def __init__(self, n_layers, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory, + kernel_size=1, xa_d_head=None, max_seq_len=512, + use_pos_emb=False, max_pos=2048, + num_codebooks=8, num_tokens_per_codebook=2024, frame_stacking_factor=1): + super().__init__() + self.n_layers = n_layers + self.d_model = d_model + self.max_seq_len = max_seq_len + self.use_pos_emb = use_pos_emb + self.num_codebooks = num_codebooks + self.num_tokens_per_codebook = num_tokens_per_codebook + self.frame_stacking_factor = frame_stacking_factor + self.d_head = d_model // sa_n_heads + self.sa_n_heads = sa_n_heads + + if use_pos_emb: + self.position_embeddings = nn.Embedding(max_pos, d_model) + + self.layers = nn.ModuleList([ + StatefulDecoderLayer( + i, d_model, d_ffn, sa_n_heads, xa_n_heads, xa_d_memory, + max_seq_len, kernel_size, xa_d_head, + ) + for i in range(n_layers) + ]) + + self.norm_out = nn.Identity() + self.final_proj = nn.Linear( + d_model, num_codebooks * num_tokens_per_codebook * frame_stacking_factor) + + # Promote per-layer buffers to top-level names so coremltools can pick + # them up via ``ct.StateType(name="k_cache_{i}")``. + self.flatten_state_buffers() + + def flatten_state_buffers(self): + """Re-register each layer's k_cache / v_cache as top-level buffers. + + coremltools' ``ct.StateType(name=...)`` matches the buffer name on the + traced module. Layer-nested buffers come through as + ``layers.{i}.k_cache``; we mirror them at the top level under the + flatter ``k_cache_{i}`` / ``v_cache_{i}`` names that downstream code + (and other mobius converters) expect. + """ + for i, layer in enumerate(self.layers): + self.register_buffer(f"k_cache_{i}", layer.k_cache, persistent=False) + self.register_buffer(f"v_cache_{i}", layer.v_cache, persistent=False) + + def reset_state(self): + """Zero all KV caches in place (host side, before make_state).""" + for layer in self.layers: + layer.k_cache.zero_() + layer.v_cache.zero_() + + def forward(self, audio_embed, encoder_output, encoder_mask, position): + x = audio_embed + if self.use_pos_emb: + pos_idx = position.to(torch.long) + x = x + self.position_embeddings(pos_idx).unsqueeze(0) + + for layer in self.layers: + x = layer( + x, position, + encoder_output=encoder_output, + encoder_mask=encoder_mask, + ) + + decoder_hidden = self.norm_out(x) + logits = self.final_proj(decoder_hidden) + return logits, decoder_hidden + + @classmethod + def from_magpie(cls, model): + """Create from a loaded MagpieTTSModel and copy over weights.""" + cfg = model.cfg + dec_cfg = dict(cfg.decoder) + + wrapper = cls( + n_layers=dec_cfg["n_layers"], + d_model=dec_cfg["d_model"], + d_ffn=dec_cfg["d_ffn"], + sa_n_heads=dec_cfg["sa_n_heads"], + xa_n_heads=dec_cfg.get("xa_n_heads"), + xa_d_memory=dec_cfg.get("xa_d_memory"), + kernel_size=dec_cfg.get("kernel_size", 1), + xa_d_head=dec_cfg.get("xa_d_head"), + max_seq_len=512, + use_pos_emb=dec_cfg.get("use_learnable_pos_emb", False), + max_pos=dec_cfg.get("max_length_causal_mask", 2048), + num_codebooks=model.num_audio_codebooks, + num_tokens_per_codebook=model.num_all_tokens_per_codebook, + frame_stacking_factor=model.frame_stacking_factor, + ) + + if wrapper.use_pos_emb and model.decoder.position_embeddings is not None: + wrapper.position_embeddings.weight.data.copy_( + model.decoder.position_embeddings.weight.data) + + for src_layer, dst_layer in zip(model.decoder.layers, wrapper.layers): + # Self-attention. + dst_layer.self_attn.qkv_proj.weight.data.copy_(src_layer.self_attention.qkv_net.weight.data) + dst_layer.self_attn.o_proj.weight.data.copy_(src_layer.self_attention.o_net.weight.data) + dst_layer.norm_sa.weight.data.copy_(src_layer.norm_self.weight.data) + + # Cross-attention. + if dst_layer.has_xattn and hasattr(src_layer, "cross_attention"): + dst_layer.cross_attn.q_proj.weight.data.copy_(src_layer.cross_attention.q_net.weight.data) + dst_layer.cross_attn.kv_proj.weight.data.copy_(src_layer.cross_attention.kv_net.weight.data) + dst_layer.cross_attn.o_proj.weight.data.copy_(src_layer.cross_attention.o_net.weight.data) + dst_layer.norm_xa_query.weight.data.copy_(src_layer.norm_xattn_query.weight.data) + dst_layer.norm_xa_memory.weight.data.copy_(src_layer.norm_xattn_memory.weight.data) + + # FFN. + dst_layer.norm_ff.weight.data.copy_(src_layer.norm_pos_ff.weight.data) + dst_layer.ffn.conv1.weight.data.copy_(src_layer.pos_ff.proj.conv.weight.data) + dst_layer.ffn.conv2.weight.data.copy_(src_layer.pos_ff.o_net.conv.weight.data) + + # Optional output norm. + if hasattr(model.decoder, "norm_out") and isinstance(model.decoder.norm_out, nn.LayerNorm): + wrapper.norm_out = nn.LayerNorm(dec_cfg["d_model"], bias=False) + wrapper.norm_out.weight.data.copy_(model.decoder.norm_out.weight.data) + + # Final projection. + wrapper.final_proj.weight.data.copy_(model.final_proj.weight.data) + wrapper.final_proj.bias.data.copy_(model.final_proj.bias.data) + + # Re-flatten buffers in case eager copies replaced them. + wrapper.flatten_state_buffers() + return wrapper diff --git a/tools/coreml-cli/src/coreml_cli/cli.py b/tools/coreml-cli/src/coreml_cli/cli.py index 00dd97e..99dc12c 100644 --- a/tools/coreml-cli/src/coreml_cli/cli.py +++ b/tools/coreml-cli/src/coreml_cli/cli.py @@ -13,7 +13,7 @@ import typer -from .compute_plan import COMPUTE_UNITS, get_compute_plan +from .compute_plan import COMPUTE_UNITS, DEFAULT_LOAD_TIMEOUT_S, get_compute_plan from .fallback import analyze_fallback from .latency import measure_cold_compile, measure_latency from .metadata import get_model_metadata @@ -85,6 +85,15 @@ def bench( False, "--json", help="Output JSON instead of table" ), iterations: int = typer.Option(10, "--iterations", "-n", help="Number of timed iterations"), + plan_timeout: float = typer.Option( + DEFAULT_LOAD_TIMEOUT_S, + "--plan-timeout", + help=( + "Max seconds to wait for MLComputePlan to load per compute_units " + "config. Increase for graphs with >1500 ops if you see 'Failed to " + "load compute plan: timeout' errors." + ), + ), debug: bool = typer.Option(False, "--debug", help="Print progress to stderr"), ) -> None: """Profile CoreML model compute device assignments and latency.""" @@ -108,7 +117,7 @@ def bench( all_fb = [] for model in models: _log(f"Analyzing fallback for {model.name}...") - fb = analyze_fallback(model, cu) + fb = analyze_fallback(model, cu, load_timeout_s=plan_timeout) all_fb.append({ "model_path": str(model), "model_name": model.stem, @@ -142,7 +151,7 @@ def bench( for unit_config in unit_configs: _log(f" compute_units={unit_config}") - result = get_compute_plan(model, unit_config) + result = get_compute_plan(model, unit_config, load_timeout_s=plan_timeout) if detailed: detail = get_detailed_profile(model, unit_config) diff --git a/tools/coreml-cli/src/coreml_cli/compute_plan.py b/tools/coreml-cli/src/coreml_cli/compute_plan.py index e5b1afe..d651ee9 100644 --- a/tools/coreml-cli/src/coreml_cli/compute_plan.py +++ b/tools/coreml-cli/src/coreml_cli/compute_plan.py @@ -42,9 +42,25 @@ def _walk_operations(block: Any) -> list[Any]: return ops -def get_compute_plan(model_path: Path, compute_units: str) -> dict: +DEFAULT_LOAD_TIMEOUT_S = 120.0 + + +def get_compute_plan( + model_path: Path, + compute_units: str, + load_timeout_s: float = DEFAULT_LOAD_TIMEOUT_S, +) -> dict: """Load compute plan for a model with given compute units. + Args: + model_path: Path to the compiled .mlmodelc. + compute_units: One of the keys in ``COMPUTE_UNITS``. + load_timeout_s: Max seconds to wait for ``MLComputePlan.loadContentsOfURL`` + to invoke its completion handler. Large graphs (≳1500 ops) can take + tens of seconds to analyze; the previous hard-coded 30s would + silently false-fail with "unknown error" when the load merely needed + more time. + Returns dict with summary and per-operation breakdown. """ url = NSURL.fileURLWithPath_(str(model_path)) @@ -64,10 +80,15 @@ def completion(loaded_plan: Any, load_error: Any) -> None: CoreML.MLComputePlan.loadContentsOfURL_configuration_completionHandler_( url, config, completion ) - event.wait(timeout=30) + completed = event.wait(timeout=load_timeout_s) plan = result_holder.get("plan") error = result_holder.get("error") + if not completed: + raise RuntimeError( + f"Failed to load compute plan: timeout after {load_timeout_s:.0f}s " + f"(graph may be too large; pass --plan-timeout to extend)" + ) if error is not None or plan is None: err_msg = str(error) if error else "unknown error" raise RuntimeError(f"Failed to load compute plan: {err_msg}") diff --git a/tools/coreml-cli/src/coreml_cli/fallback.py b/tools/coreml-cli/src/coreml_cli/fallback.py index dfae387..c82f7ba 100644 --- a/tools/coreml-cli/src/coreml_cli/fallback.py +++ b/tools/coreml-cli/src/coreml_cli/fallback.py @@ -10,13 +10,20 @@ from .private_profiler import get_detailed_profile -def analyze_fallback(model_path: Path, compute_units: str = "cpu_and_neural_engine") -> dict: +def analyze_fallback( + model_path: Path, + compute_units: str = "cpu_and_neural_engine", + load_timeout_s: float | None = None, +) -> dict: """Analyze which ops fall back to CPU and why. Returns a fallback_summary dict with grouped reasons. """ # Get public compute plan for device assignments - plan = get_compute_plan(model_path, compute_units) + plan_kwargs: dict[str, Any] = {} + if load_timeout_s is not None: + plan_kwargs["load_timeout_s"] = load_timeout_s + plan = get_compute_plan(model_path, compute_units, **plan_kwargs) # Get private profiler data for validation messages detail = get_detailed_profile(model_path, compute_units)