Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
27a6203
feat(tts/magpie): add NVIDIA Magpie TTS Multilingual 357M Swift port
Alex-Wengg Apr 25, 2026
ff95086
perf(tts/magpie): wire decoder_prefill fast path + pin decoder_step t…
Alex-Wengg Apr 26, 2026
b3bb985
refactor(tts/magpie): drop fp64 LocalTransformer + move NpzReader to CLI
Alex-Wengg Apr 26, 2026
7890dc4
fix(tts/magpie): unblock iOS build — work around Swift 6 isolation ch…
Alex-Wengg Apr 26, 2026
332ac2d
docs(tts/magpie): sync file map with post-cleanup layout
Alex-Wengg Apr 26, 2026
7348c18
docs(models): add Magpie TTS Multilingual entry + HF repo
Alex-Wengg Apr 26, 2026
206c177
refactor(tts/magpie): drop unused speaker_info plumbing + dead helpers
Alex-Wengg Apr 26, 2026
66ec141
refactor(tts/magpie): drop unused D binding + redundant CFG guard
Alex-Wengg Apr 26, 2026
e0a7a80
refactor(tts/magpie): purge 4 dead-code orphans
Alex-Wengg Apr 26, 2026
c7ed31f
refactor(tts/magpie): purge wasted manager alloc + tighten NPZ parse …
Alex-Wengg Apr 26, 2026
4599fcf
refactor(tts/magpie): scope CLI to download + text; move parity to mo…
Alex-Wengg Apr 26, 2026
aa65232
refactor(tts/magpie): drop NumPy bit-parity tests; move to mobius
Alex-Wengg Apr 26, 2026
624dcb5
perf(tts/magpie): CFG default off + sampler heap + vDSP embed
Alex-Wengg Apr 26, 2026
1fb54db
perf(tts/magpie): kill LT alloc churn in QKV split + LN + GELU
Alex-Wengg Apr 26, 2026
7783775
perf(tts/magpie): nanocodec to cpuOnly + per-stage timing breakdown
Alex-Wengg Apr 26, 2026
2ecffad
feat(cli/magpie): add in-process bench subcommand for stable RTFx med…
Alex-Wengg Apr 26, 2026
13b34ec
perf(tts/magpie): bind decoder_step outputs via outputBackings + doub…
Alex-Wengg Apr 26, 2026
e2c73a8
feat(tts/magpie): chunk-level streaming + ANE-warmup at init
Alex-Wengg Apr 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Documentation/Models.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ TDT models process audio in chunks (~15s with overlap) as batch operations.
|-------|-------------|---------|
| **Kokoro TTS** | Text-to-speech synthesis (82M params), 48 voices, minimal RAM usage on iOS. Generates all frames at once via flow matching over mel spectrograms + Vocos vocoder. Uses CoreML G2P model for phonemization. | First TTS backend added + support custom pronounces |
| **PocketTTS** | Second TTS backend (~155M params). Autoregressive frame-by-frame generation with dynamic audio chunking. No phoneme stage, works directly on text tokens. | Supports streaming, minimal RAM usage, excellent quality |
| **Magpie TTS Multilingual** | NVIDIA NeMo Magpie TTS Multilingual 357M, 8 languages (en/es/de/fr/it/vi/zh/hi), 5 built-in speakers. 4-model CoreML pipeline: text_encoder + decoder_prefill + decoder_step + nanocodec_decoder. Custom IPA override via `\|...\|` segments. Local Transformer (8-codebook sampler) implemented in pure Swift via Accelerate + BNNS. | Third TTS backend. Japanese deferred (needs OpenJTalk + MeCab dict). |

## Evaluated Models (Not Supported)

Expand Down Expand Up @@ -81,4 +82,5 @@ Models we converted and tested but are not supported: too large for on-device de
| Sortformer | [FluidInference/diar-streaming-sortformer-coreml](https://huggingface.co/FluidInference/diar-streaming-sortformer-coreml) |
| Kokoro TTS | [FluidInference/kokoro-82m-coreml](https://huggingface.co/FluidInference/kokoro-82m-coreml) |
| PocketTTS | [FluidInference/pocket-tts-coreml](https://huggingface.co/FluidInference/pocket-tts-coreml) |
| Magpie TTS Multilingual | [FluidInference/magpie-tts-multilingual-357m-coreml](https://huggingface.co/FluidInference/magpie-tts-multilingual-357m-coreml) |
| Nemotron Streaming | [FluidInference/nemotron-speech-streaming-en-0.6b-coreml](https://huggingface.co/FluidInference/nemotron-speech-streaming-en-0.6b-coreml) |
145 changes: 145 additions & 0 deletions Documentation/TTS/Magpie.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Magpie TTS Multilingual (Swift Port)

Swift port of NVIDIA NeMo Magpie TTS Multilingual 357M, exported to CoreML.
Lives under `Sources/FluidAudio/TTS/Magpie/`.

## Status

Functional. Audio quality is perceptually clean across all 5 speakers; first
synth on a fresh process is dominated by CoreML model load + first-call ANE
compile (~30 s), warm synths run at ~96 s wall for an 8-word English sentence
on M-series (RTFx ~0.04). Quality is ASR-clean (4/5 speakers), spk0 has a
single trailing-word artifact ("…and") attributable to fp16 sampler-trajectory
drift, not a structural bug.

Not yet covered: Japanese (deferred — needs OpenJTalk XCFramework + MeCab
dict), CFG performance optimization, MLX-backed LocalTransformer.

## Architecture

```
text → MagpieTokenizer (per-language) → text_encoder.mlmodelc
speaker_N.npy (110×768) → decoder_prefill.mlmodelc (1 batched call) ──┐
┌──── KV cache (12 layers × [2,1,512,12,64] fp16)
AR loop (decoder_step.mlmodelc, ≤500 steps):
├─ LocalTransformer (Swift, Accelerate+BNNS)
├─ Sampler (top-k=80, temp=0.6, forbidden mask)
├─ embed sampled (8) codes → next decoder_step input
└─ stop on audio_eos_id (2017) or maxSteps
nanocodec_decoder.mlmodelc → 22 050 Hz Float32 PCM
```

## Compute placement (verified end-to-end)

| Model | Compute units | Reasoning |
| ------------------ | ------------------------ | ------------------------------------------------------------------------------------------------------------ |
| `text_encoder` | `.cpuAndNeuralEngine` | Runs on ANE; ~3.5× vs CPU. |
| `decoder_prefill` | `.cpuAndNeuralEngine` | Runs on ANE; ~3.2× vs CPU. One batched call replaces 110 sequential `decoder_step` calls. |
| `decoder_step` | **`.cpuAndGPU`** | Pinned. ANE compile fails (`MILCompilerForANE: ANECCompile() FAILED`) due to rank-4 split-K/V scatter; on `.cpuAndNeuralEngine` it falls back to CPU at ~hundreds-of-ms cost per call. GPU (Metal MPS) is fastest. Verified: 96 s warm vs 103 s warm on `.cpuAndNeuralEngine`. |
| `nanocodec_decoder`| `.cpuAndNeuralEngine` | Runs on ANE. |

The pin is implemented in `MagpieModelStore.swift:60` — caller-supplied
`computeUnits` is honored for all models *except* `decoder_step`, which is
forced to `.cpuAndGPU` (or `.cpuOnly` if the caller asked for `.cpuOnly`).

## Performance journey

Three optimizations landed during the port; numbers are warm-avg wall time on
M-series for an 8-word English sentence.

| Stage | Wall (warm) | Speedup |
| ------------------------------------------------------- | ----------- | ------- |
| Baseline: 110-step prefill loop, ANE on decoder_step | ~420 s | 1.0× |
| **Wire `decoder_prefill.mlmodelc` (1 batched call)** | ~110 s | 3.8× |
| **Pin decoder_step to `.cpuAndGPU`** | ~96 s | 4.4× |

Asset was already on HF (`FluidInference/magpie-tts-multilingual-357m-coreml`)
and downloaded by `MagpieResourceDownloader`, just unused. `prefillFast`
(`MagpiePrefill.swift:23`) replaces 110 sequential `decoder_step` calls with
one `decoder_prefill` call whose 12 stacked-K/V outputs (`var_208`, `var_374`,
… `var_1958`, each `[2, 1, 512, 12, 64]` fp16) are sliced via two `memcpy`s
per layer into the KV cache (`MagpieKvCache.seedFromPrefillOutputs`).

## Public API

```swift
let manager = try await MagpieTtsManager.downloadAndCreate(
languages: [.english],
cacheDirectory: nil,
computeUnits: .cpuAndNeuralEngine, // decoder_step pinned to GPU internally
progressHandler: nil
)

let result = try await manager.synthesize(
text: "Hello world.",
speaker: .john,
language: .english,
options: .default
)
// result.samples : [Float] (22 050 Hz)
// result.codeCount : Int
// result.durationSeconds : Double
```

## CLI

```bash
# Download all assets eagerly
swift run fluidaudiocli magpie download

# Synth
swift run fluidaudiocli magpie text "Hello world." --speaker 0 --output hello.wav
```

Parity, probe, and compute-plan tooling live upstream in `mobius` (Python) —
they exercise the export pipeline and are out of scope for the Swift runtime.

## Known issues

1. **spk0 trailing-word drift.** ASR shows a stray word at the end (e.g.
"…seashore, and"). Stage-by-stage parity probe (in `mobius`) localizes it
to fp16 sampler-trajectory non-determinism between Python+CoreML reference
and Swift+CoreML host: prefill SNR degrades L0=64 dB → L11=44 dB through
the 12-layer cache, then compounds in the AR loop. CoreML itself is
consistent between languages; the drift is host-floating-point + RNG/sampler
ordering. Not user-perceptible on speakers 1–4.

2. **`decoder_step` ANE compile failure is real.** Earlier benchmark with
zeroed `position` scalars showed a 3× ANE speedup; that was misleading —
with real incrementing positions the ANEF compile fails at runtime per
call. Keep the `.cpuAndGPU` pin.

## File map

```
Sources/FluidAudio/TTS/Magpie/
├── MagpieTtsManager.swift # public actor
├── MagpieConstants.swift # shapes, ids, file names, HF repo id
├── MagpieError.swift
├── MagpieTypes.swift
├── Assets/
│ ├── MagpieModelStore.swift # actor; loads 4 mlmodelcs, per-model compute units
│ ├── MagpieResourceDownloader.swift # HF download via DownloadUtils
│ ├── MagpieConstantsStore.swift
│ └── MagpieLocalTransformerWeights.swift
├── LocalTransformer/
│ ├── MagpieLocalTransformer.swift # 1-layer transformer (attention + FFN) via Accelerate (cblas_sgemm) + BNNS (GELU)
│ └── MagpieSampler.swift # top-k + temp + forbidden mask + CFG merge
├── Pipeline/
│ ├── Preprocess/ # per-language tokenizers + IPA override
│ └── Synthesize/
│ ├── MagpieSynthesizer.swift # orchestrates encode → prefill → AR → nanocodec
│ ├── MagpieKvCache.swift # 12 layers × (cache, position); seedFromPrefillOutputs
│ ├── MagpiePrefill.swift # prefillFast (batched) + prefill (110-step fallback)
│ └── MagpieNanocodec.swift
└── Shared/
├── NpyReader.swift # .npy v1 (fp32/fp16/int)
└── MagpieMT19937.swift # deterministic RNG matching Python reference

Sources/FluidAudioCLI/Commands/
└── MagpieCommand.swift # dispatch (download / text)
```
4 changes: 2 additions & 2 deletions Package.swift
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ let package = Package(
.executableTarget(
name: "FluidAudioCLI",
dependencies: [
"FluidAudio",
"FluidAudio"
],
path: "Sources/FluidAudioCLI",
exclude: ["README.md"],
Expand All @@ -54,7 +54,7 @@ let package = Package(
.testTarget(
name: "FluidAudioTests",
dependencies: [
"FluidAudio",
"FluidAudio"
]
),
],
Expand Down
42 changes: 41 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Want to convert your own model? Check [möbius](https://github.com/FluidInferenc

- **Automatic Speech Recognition (ASR)**: [Parakeet TDT v3](Documentation/Models.md#batch-transcription-near-real-time) (0.6b) and other TDT/CTC models for batch transcription supporting 25 European languages, Japanese, and Chinese; [Parakeet EOU](Documentation/Models.md#streaming-transcription-true-real-time) (120m) for streaming ASR with end-of-utterance detection (English only). See all [ASR models](Documentation/Models.md#asr-models).
- **Inverse Text Normalization (ITN)**: Post-process ASR output to convert spoken-form to written-form ("two hundred" → "200"). See [text-processing-rs](https://github.com/FluidInference/text-processing-rs)
- **Text-to-Speech (TTS)**: Kokoro (82m) for parallel synthesis with SSML and pronunciation control across 9 languages (EN, ES, FR, HI, IT, JA, PT, ZH); PocketTTS for streaming TTS with voice cloning support (English only)
- **Text-to-Speech (TTS)**: Kokoro (82m) for parallel synthesis with SSML and pronunciation control across 9 languages (EN, ES, FR, HI, IT, JA, PT, ZH); PocketTTS for streaming TTS with voice cloning support (English only); Magpie (357m) autoregressive multilingual TTS with 5 speakers, `|…|` IPA override, and 8-language coverage (EN, ES, DE, FR, IT, VI, ZH, HI)
- **Speaker Diarization (Online + Offline)**: Speaker separation and identification across audio streams. Streaming pipeline for real-time processing and offline batch pipeline with advanced clustering.
- **Speaker Embedding Extraction**: Generate speaker embeddings for voice comparison and clustering, you can use this for speaker identification
- **Voice Activity Detection (VAD)**: Voice activity detection with Silero models
Expand Down Expand Up @@ -596,6 +596,46 @@ swift run fluidaudiocli tts "Hello from FluidAudio." --auto-download --output ou

Dictionary and model assets are cached under `~/.cache/fluidaudio/Models/kokoro`.

### Magpie (Multilingual)

Magpie TTS Multilingual (357M) is NVIDIA's autoregressive encoder-decoder TTS with 8-codebook NanoCodec vocoder output at 22.05 kHz. It exposes 5 built-in speakers and supports 8 languages (English, Spanish, German, French, Italian, Vietnamese, Mandarin, Hindi) with a `|…|` IPA override that routes inline phoneme sequences directly to the tokenizer. Japanese is deferred pending OpenJTalk integration.

```swift
import FluidAudio

Task {
let manager = try await MagpieTtsManager.downloadAndCreate(
languages: [.english, .spanish]
)
let result = try await manager.synthesize(
text: "Hello | ˈ n ɛ m o ʊ | from FluidAudio.",
speaker: .john,
language: .english
)
let wav = AudioWAV.data(from: result.samples, sampleRate: result.sampleRate)
try wav.write(to: URL(fileURLWithPath: "hello.wav"))
}
```

```bash
# Pre-download assets for selected languages
swift run fluidaudiocli magpie download --languages en,es

# Synthesize with IPA override enabled (default)
swift run fluidaudiocli magpie text --text "Hello | ˈ n ɛ m o ʊ |." \
--speaker 0 --language en --output hello.wav

# Classifier-free guidance and sampling controls
swift run fluidaudiocli magpie text --text "Bonjour." --language fr \
--cfg 2.5 --temperature 0.6 --topk 80 --seed 42 --output bonjour.wav
```

Parity / probe / compute-plan tooling lives upstream in `mobius` (Python).

Assets (4 CoreML models + `constants/` + per-language tokenizer files) are fetched from [`FluidInference/magpie-tts-multilingual-357m-coreml`](https://huggingface.co/FluidInference/magpie-tts-multilingual-357m-coreml) on first use. The 1-layer local transformer (256d, top-k + temperature sampling, forbidden-token mask) runs on CPU via Accelerate/BNNS; the 12-layer decoder KV cache is rolled stateful across steps.

When `--seed N` is supplied, sampling is driven by a NumPy-compatible MT19937 RNG so the Swift output is bit-reproducible against the Python reference seeded with `np.random.seed(N)`.

## Continuous Integration

- `tests.yml`: Default build matrix covering SwiftPM tests and an iOS archive smoke test.
Expand Down
36 changes: 36 additions & 0 deletions Sources/FluidAudio/ModelNames.swift
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ public enum Repo: String, CaseIterable, Sendable {
case multilingualG2p = "FluidInference/charsiu-g2p-byt5-coreml"
case parakeetTdtCtc110m = "FluidInference/parakeet-tdt-ctc-110m-coreml"
case cohereTranscribeCoreml = "FluidInference/cohere-transcribe-03-2026-coreml/q8"
case magpieTts = "FluidInference/magpie-tts-multilingual-357m-coreml"

/// Repository slug (without owner)
public var name: String {
Expand Down Expand Up @@ -81,6 +82,8 @@ public enum Repo: String, CaseIterable, Sendable {
return "parakeet-tdt-ctc-110m-coreml"
case .cohereTranscribeCoreml:
return "cohere-transcribe-03-2026-coreml/q8"
case .magpieTts:
return "magpie-tts-multilingual-357m-coreml"
}
}

Expand Down Expand Up @@ -171,6 +174,8 @@ public enum Repo: String, CaseIterable, Sendable {
return "parakeet-tdt-ctc-110m"
case .cohereTranscribeCoreml:
return "cohere-transcribe/q8"
case .magpieTts:
return "magpie-tts"
default:
return name.replacingOccurrences(of: "-coreml", with: "")
}
Expand Down Expand Up @@ -591,6 +596,35 @@ public enum ModelNames {
]
}

/// Magpie TTS Multilingual 357M model names.
///
/// Four CoreML models + a `constants/` directory + a `tokenizer/` directory of
/// per-language lookup data. The `decoder_prefill` model is optional; when
/// absent the prefill runs step-by-step through `decoder_step`.
public enum Magpie {
public static let textEncoder = "text_encoder"
public static let decoderPrefill = "decoder_prefill"
public static let decoderStep = "decoder_step"
public static let nanocodecDecoder = "nanocodec_decoder"

public static let textEncoderFile = textEncoder + ".mlmodelc"
public static let decoderPrefillFile = decoderPrefill + ".mlmodelc"
public static let decoderStepFile = decoderStep + ".mlmodelc"
public static let nanocodecDecoderFile = nanocodecDecoder + ".mlmodelc"

public static let constantsDir = "constants"
public static let tokenizerDir = "tokenizer"

/// Files required for English synthesis. Other languages append their own
/// lookup files on top (see `MagpieResourceDownloader`).
public static let requiredModels: Set<String> = [
textEncoderFile,
decoderStepFile,
nanocodecDecoderFile,
constantsDir,
]
}

/// Multilingual G2P (CharsiuG2P ByT5) model names
public enum MultilingualG2P {
public static let encoder = "MultilingualG2PEncoder"
Expand Down Expand Up @@ -760,6 +794,8 @@ public enum ModelNames {
return ModelNames.MultilingualG2P.requiredModels
case .cohereTranscribeCoreml:
return ModelNames.CohereTranscribe.requiredModels
case .magpieTts:
return ModelNames.Magpie.requiredModels
}
}
}
Loading
Loading