FluidInference · Alex-Wengg · Apr 25, 2026 · Apr 26, 2026 · Apr 26, 2026 · Apr 26, 2026
diff --git a/Documentation/Models.md b/Documentation/Models.md
@@ -51,6 +51,7 @@ TDT models process audio in chunks (~15s with overlap) as batch operations.
 |-------|-------------|---------|
 | **Kokoro TTS** | Text-to-speech synthesis (82M params), 48 voices, minimal RAM usage on iOS. Generates all frames at once via flow matching over mel spectrograms + Vocos vocoder. Uses CoreML G2P model for phonemization. | First TTS backend added + support custom pronounces |
 | **PocketTTS** | Second TTS backend (~155M params). Autoregressive frame-by-frame generation with dynamic audio chunking. No phoneme stage, works directly on text tokens. | Supports streaming, minimal RAM usage, excellent quality |
+| **Magpie TTS Multilingual** | NVIDIA NeMo Magpie TTS Multilingual 357M, 8 languages (en/es/de/fr/it/vi/zh/hi), 5 built-in speakers. 4-model CoreML pipeline: text_encoder + decoder_prefill + decoder_step + nanocodec_decoder. Custom IPA override via `\|...\|` segments. Local Transformer (8-codebook sampler) implemented in pure Swift via Accelerate + BNNS. | Third TTS backend. Japanese deferred (needs OpenJTalk + MeCab dict). |
 
 ## Evaluated Models (Not Supported)
 
@@ -81,4 +82,5 @@ Models we converted and tested but are not supported: too large for on-device de
 | Sortformer | [FluidInference/diar-streaming-sortformer-coreml](https://huggingface.co/FluidInference/diar-streaming-sortformer-coreml) |
 | Kokoro TTS | [FluidInference/kokoro-82m-coreml](https://huggingface.co/FluidInference/kokoro-82m-coreml) |
 | PocketTTS | [FluidInference/pocket-tts-coreml](https://huggingface.co/FluidInference/pocket-tts-coreml) |
+| Magpie TTS Multilingual | [FluidInference/magpie-tts-multilingual-357m-coreml](https://huggingface.co/FluidInference/magpie-tts-multilingual-357m-coreml) |
 | Nemotron Streaming | [FluidInference/nemotron-speech-streaming-en-0.6b-coreml](https://huggingface.co/FluidInference/nemotron-speech-streaming-en-0.6b-coreml) |
diff --git a/Documentation/TTS/Magpie.md b/Documentation/TTS/Magpie.md
@@ -0,0 +1,145 @@
+# Magpie TTS Multilingual (Swift Port)
+
+Swift port of NVIDIA NeMo Magpie TTS Multilingual 357M, exported to CoreML.
+Lives under `Sources/FluidAudio/TTS/Magpie/`.
+
+## Status
+
+Functional. Audio quality is perceptually clean across all 5 speakers; first
+synth on a fresh process is dominated by CoreML model load + first-call ANE
+compile (~30 s), warm synths run at ~96 s wall for an 8-word English sentence
+on M-series (RTFx ~0.04). Quality is ASR-clean (4/5 speakers), spk0 has a
+single trailing-word artifact ("…and") attributable to fp16 sampler-trajectory
+drift, not a structural bug.
+
+Not yet covered: Japanese (deferred — needs OpenJTalk XCFramework + MeCab
+dict), CFG performance optimization, MLX-backed LocalTransformer.
+
+## Architecture
+
+```
+text → MagpieTokenizer (per-language) → text_encoder.mlmodelc
+                                          ↓
+speaker_N.npy (110×768) → decoder_prefill.mlmodelc (1 batched call) ──┐
+                                                                      ↓
+                            ┌──── KV cache (12 layers × [2,1,512,12,64] fp16)
+                            ↓
+                   AR loop (decoder_step.mlmodelc, ≤500 steps):
+                     ├─ LocalTransformer (Swift, Accelerate+BNNS)
+                     ├─ Sampler (top-k=80, temp=0.6, forbidden mask)
+                     ├─ embed sampled (8) codes → next decoder_step input
+                     └─ stop on audio_eos_id (2017) or maxSteps
+                            ↓
+                   nanocodec_decoder.mlmodelc → 22 050 Hz Float32 PCM
+```
+
+## Compute placement (verified end-to-end)
+
+| Model              | Compute units            | Reasoning                                                                                                    |
+| ------------------ | ------------------------ | ------------------------------------------------------------------------------------------------------------ |
+| `text_encoder`     | `.cpuAndNeuralEngine`    | Runs on ANE; ~3.5× vs CPU.                                                                                   |
+| `decoder_prefill`  | `.cpuAndNeuralEngine`    | Runs on ANE; ~3.2× vs CPU. One batched call replaces 110 sequential `decoder_step` calls.                    |
+| `decoder_step`     | **`.cpuAndGPU`**         | Pinned. ANE compile fails (`MILCompilerForANE: ANECCompile() FAILED`) due to rank-4 split-K/V scatter; on `.cpuAndNeuralEngine` it falls back to CPU at ~hundreds-of-ms cost per call. GPU (Metal MPS) is fastest. Verified: 96 s warm vs 103 s warm on `.cpuAndNeuralEngine`. |
+| `nanocodec_decoder`| `.cpuAndNeuralEngine`    | Runs on ANE.                                                                                                 |
+
+The pin is implemented in `MagpieModelStore.swift:60` — caller-supplied
+`computeUnits` is honored for all models *except* `decoder_step`, which is
+forced to `.cpuAndGPU` (or `.cpuOnly` if the caller asked for `.cpuOnly`).
+
+## Performance journey
+
+Three optimizations landed during the port; numbers are warm-avg wall time on
+M-series for an 8-word English sentence.
+
+| Stage                                                   | Wall (warm) | Speedup |
+| ------------------------------------------------------- | ----------- | ------- |
+| Baseline: 110-step prefill loop, ANE on decoder_step    | ~420 s      | 1.0×    |
+| **Wire `decoder_prefill.mlmodelc` (1 batched call)**    | ~110 s      | 3.8×    |
+| **Pin decoder_step to `.cpuAndGPU`**                    | ~96 s       | 4.4×    |
+
+Asset was already on HF (`FluidInference/magpie-tts-multilingual-357m-coreml`)
+and downloaded by `MagpieResourceDownloader`, just unused. `prefillFast`
+(`MagpiePrefill.swift:23`) replaces 110 sequential `decoder_step` calls with
+one `decoder_prefill` call whose 12 stacked-K/V outputs (`var_208`, `var_374`,
+… `var_1958`, each `[2, 1, 512, 12, 64]` fp16) are sliced via two `memcpy`s
+per layer into the KV cache (`MagpieKvCache.seedFromPrefillOutputs`).
+
+## Public API
+
+```swift
+let manager = try await MagpieTtsManager.downloadAndCreate(
+    languages: [.english],
+    cacheDirectory: nil,
+    computeUnits: .cpuAndNeuralEngine,   // decoder_step pinned to GPU internally
+    progressHandler: nil
+)
+
+let result = try await manager.synthesize(
+    text: "Hello world.",
+    speaker: .john,
+    language: .english,
+    options: .default
+)
+// result.samples : [Float]   (22 050 Hz)
+// result.codeCount : Int
+// result.durationSeconds : Double
+```
+
+## CLI
+
+```bash
+# Download all assets eagerly
+swift run fluidaudiocli magpie download
+
+# Synth
+swift run fluidaudiocli magpie text "Hello world." --speaker 0 --output hello.wav
+```
+
+Parity, probe, and compute-plan tooling live upstream in `mobius` (Python) —
+they exercise the export pipeline and are out of scope for the Swift runtime.
+
+## Known issues
+
+1. **spk0 trailing-word drift.** ASR shows a stray word at the end (e.g.
+   "…seashore, and"). Stage-by-stage parity probe (in `mobius`) localizes it
+   to fp16 sampler-trajectory non-determinism between Python+CoreML reference
+   and Swift+CoreML host: prefill SNR degrades L0=64 dB → L11=44 dB through
+   the 12-layer cache, then compounds in the AR loop. CoreML itself is
+   consistent between languages; the drift is host-floating-point + RNG/sampler
+   ordering. Not user-perceptible on speakers 1–4.
+
+2. **`decoder_step` ANE compile failure is real.** Earlier benchmark with
+   zeroed `position` scalars showed a 3× ANE speedup; that was misleading —
+   with real incrementing positions the ANEF compile fails at runtime per
+   call. Keep the `.cpuAndGPU` pin.
+
+## File map
+
+```
+Sources/FluidAudio/TTS/Magpie/
+├── MagpieTtsManager.swift                # public actor
+├── MagpieConstants.swift                 # shapes, ids, file names, HF repo id
+├── MagpieError.swift
+├── MagpieTypes.swift
+├── Assets/
+│   ├── MagpieModelStore.swift            # actor; loads 4 mlmodelcs, per-model compute units
+│   ├── MagpieResourceDownloader.swift    # HF download via DownloadUtils
+│   ├── MagpieConstantsStore.swift
+│   └── MagpieLocalTransformerWeights.swift
+├── LocalTransformer/
+│   ├── MagpieLocalTransformer.swift      # 1-layer transformer (attention + FFN) via Accelerate (cblas_sgemm) + BNNS (GELU)
+│   └── MagpieSampler.swift               # top-k + temp + forbidden mask + CFG merge
+├── Pipeline/
+│   ├── Preprocess/                       # per-language tokenizers + IPA override
+│   └── Synthesize/
+│       ├── MagpieSynthesizer.swift       # orchestrates encode → prefill → AR → nanocodec
+│       ├── MagpieKvCache.swift           # 12 layers × (cache, position); seedFromPrefillOutputs
+│       ├── MagpiePrefill.swift           # prefillFast (batched) + prefill (110-step fallback)
+│       └── MagpieNanocodec.swift
+└── Shared/
+    ├── NpyReader.swift                   # .npy v1 (fp32/fp16/int)
+    └── MagpieMT19937.swift               # deterministic RNG matching Python reference
+
+Sources/FluidAudioCLI/Commands/
+└── MagpieCommand.swift                   # dispatch (download / text)
+```
diff --git a/Package.swift b/Package.swift
@@ -43,7 +43,7 @@ let package = Package(
         .executableTarget(
             name: "FluidAudioCLI",
             dependencies: [
-                "FluidAudio",
+                "FluidAudio"
             ],
             path: "Sources/FluidAudioCLI",
             exclude: ["README.md"],
@@ -54,7 +54,7 @@ let package = Package(
         .testTarget(
             name: "FluidAudioTests",
             dependencies: [
-                "FluidAudio",
+                "FluidAudio"
             ]
         ),
     ],

diff --git a/README.md b/README.md
@@ -37,7 +37,7 @@ Want to convert your own model? Check [möbius](https://github.com/FluidInferenc
 
 - **Automatic Speech Recognition (ASR)**: [Parakeet TDT v3](Documentation/Models.md#batch-transcription-near-real-time) (0.6b) and other TDT/CTC models for batch transcription supporting 25 European languages, Japanese, and Chinese; [Parakeet EOU](Documentation/Models.md#streaming-transcription-true-real-time) (120m) for streaming ASR with end-of-utterance detection (English only). See all [ASR models](Documentation/Models.md#asr-models).
 - **Inverse Text Normalization (ITN)**: Post-process ASR output to convert spoken-form to written-form ("two hundred" → "200"). See [text-processing-rs](https://github.com/FluidInference/text-processing-rs)
-- **Text-to-Speech (TTS)**: Kokoro (82m) for parallel synthesis with SSML and pronunciation control across 9 languages (EN, ES, FR, HI, IT, JA, PT, ZH); PocketTTS for streaming TTS with voice cloning support (English only)
+- **Text-to-Speech (TTS)**: Kokoro (82m) for parallel synthesis with SSML and pronunciation control across 9 languages (EN, ES, FR, HI, IT, JA, PT, ZH); PocketTTS for streaming TTS with voice cloning support (English only); Magpie (357m) autoregressive multilingual TTS with 5 speakers, `|…|` IPA override, and 8-language coverage (EN, ES, DE, FR, IT, VI, ZH, HI)
 - **Speaker Diarization (Online + Offline)**: Speaker separation and identification across audio streams. Streaming pipeline for real-time processing and offline batch pipeline with advanced clustering.
 - **Speaker Embedding Extraction**: Generate speaker embeddings for voice comparison and clustering, you can use this for speaker identification
 - **Voice Activity Detection (VAD)**: Voice activity detection with Silero models
@@ -596,6 +596,46 @@ swift run fluidaudiocli tts "Hello from FluidAudio." --auto-download --output ou
 
 Dictionary and model assets are cached under `~/.cache/fluidaudio/Models/kokoro`.
 
+### Magpie (Multilingual)
+
+Magpie TTS Multilingual (357M) is NVIDIA's autoregressive encoder-decoder TTS with 8-codebook NanoCodec vocoder output at 22.05 kHz. It exposes 5 built-in speakers and supports 8 languages (English, Spanish, German, French, Italian, Vietnamese, Mandarin, Hindi) with a `|…|` IPA override that routes inline phoneme sequences directly to the tokenizer. Japanese is deferred pending OpenJTalk integration.
+
+```swift
+import FluidAudio
+
+Task {
+    let manager = try await MagpieTtsManager.downloadAndCreate(
+        languages: [.english, .spanish]
+    )
+    let result = try await manager.synthesize(
+        text: "Hello | ˈ n ɛ m o ʊ | from FluidAudio.",
+        speaker: .john,
+        language: .english
+    )
+    let wav = AudioWAV.data(from: result.samples, sampleRate: result.sampleRate)
+    try wav.write(to: URL(fileURLWithPath: "hello.wav"))
+}
+```
+
+```bash
+# Pre-download assets for selected languages
+swift run fluidaudiocli magpie download --languages en,es
+
+# Synthesize with IPA override enabled (default)
+swift run fluidaudiocli magpie text --text "Hello | ˈ n ɛ m o ʊ |." \
+    --speaker 0 --language en --output hello.wav
+
+# Classifier-free guidance and sampling controls
+swift run fluidaudiocli magpie text --text "Bonjour." --language fr \
+    --cfg 2.5 --temperature 0.6 --topk 80 --seed 42 --output bonjour.wav
+```
+
+Parity / probe / compute-plan tooling lives upstream in `mobius` (Python).
+
+Assets (4 CoreML models + `constants/` + per-language tokenizer files) are fetched from [`FluidInference/magpie-tts-multilingual-357m-coreml`](https://huggingface.co/FluidInference/magpie-tts-multilingual-357m-coreml) on first use. The 1-layer local transformer (256d, top-k + temperature sampling, forbidden-token mask) runs on CPU via Accelerate/BNNS; the 12-layer decoder KV cache is rolled stateful across steps.
+
+When `--seed N` is supplied, sampling is driven by a NumPy-compatible MT19937 RNG so the Swift output is bit-reproducible against the Python reference seeded with `np.random.seed(N)`.
+
 ## Continuous Integration
 
 - `tests.yml`: Default build matrix covering SwiftPM tests and an iOS archive smoke test.

diff --git a/Sources/FluidAudio/ModelNames.swift b/Sources/FluidAudio/ModelNames.swift
@@ -29,6 +29,7 @@ public enum Repo: String, CaseIterable, Sendable {
     case multilingualG2p = "FluidInference/charsiu-g2p-byt5-coreml"
     case parakeetTdtCtc110m = "FluidInference/parakeet-tdt-ctc-110m-coreml"
     case cohereTranscribeCoreml = "FluidInference/cohere-transcribe-03-2026-coreml/q8"
+    case magpieTts = "FluidInference/magpie-tts-multilingual-357m-coreml"
 
     /// Repository slug (without owner)
     public var name: String {
@@ -81,6 +82,8 @@ public enum Repo: String, CaseIterable, Sendable {
             return "parakeet-tdt-ctc-110m-coreml"
         case .cohereTranscribeCoreml:
             return "cohere-transcribe-03-2026-coreml/q8"
+        case .magpieTts:
+            return "magpie-tts-multilingual-357m-coreml"
         }
     }
 
@@ -171,6 +174,8 @@ public enum Repo: String, CaseIterable, Sendable {
             return "parakeet-tdt-ctc-110m"
         case .cohereTranscribeCoreml:
             return "cohere-transcribe/q8"
+        case .magpieTts:
+            return "magpie-tts"
         default:
             return name.replacingOccurrences(of: "-coreml", with: "")
         }
@@ -591,6 +596,35 @@ public enum ModelNames {
         ]
     }
 
+    /// Magpie TTS Multilingual 357M model names.
+    ///
+    /// Four CoreML models + a `constants/` directory + a `tokenizer/` directory of
+    /// per-language lookup data. The `decoder_prefill` model is optional; when
+    /// absent the prefill runs step-by-step through `decoder_step`.
+    public enum Magpie {
+        public static let textEncoder = "text_encoder"
+        public static let decoderPrefill = "decoder_prefill"
+        public static let decoderStep = "decoder_step"
+        public static let nanocodecDecoder = "nanocodec_decoder"
+
+        public static let textEncoderFile = textEncoder + ".mlmodelc"
+        public static let decoderPrefillFile = decoderPrefill + ".mlmodelc"
+        public static let decoderStepFile = decoderStep + ".mlmodelc"
+        public static let nanocodecDecoderFile = nanocodecDecoder + ".mlmodelc"
+
+        public static let constantsDir = "constants"
+        public static let tokenizerDir = "tokenizer"
+
+        /// Files required for English synthesis. Other languages append their own
+        /// lookup files on top (see `MagpieResourceDownloader`).
+        public static let requiredModels: Set<String> = [
+            textEncoderFile,
+            decoderStepFile,
+            nanocodecDecoderFile,
+            constantsDir,
+        ]
+    }
+
     /// Multilingual G2P (CharsiuG2P ByT5) model names
     public enum MultilingualG2P {
         public static let encoder = "MultilingualG2PEncoder"
@@ -760,6 +794,8 @@ public enum ModelNames {
             return ModelNames.MultilingualG2P.requiredModels
         case .cohereTranscribeCoreml:
             return ModelNames.CohereTranscribe.requiredModels
+        case .magpieTts:
+            return ModelNames.Magpie.requiredModels
         }
     }
 }