feat(tts/magpie): add NVIDIA Magpie TTS Multilingual 357M Swift port#541
feat(tts/magpie): add NVIDIA Magpie TTS Multilingual 357M Swift port#541Alex-Wengg wants to merge 1 commit intomainfrom
Conversation
Ports the Magpie TTS Multilingual 357M autoregressive TTS from Python (mobius PR #24) to Swift, covering 8 languages (EN, ES, DE, FR, IT, VI, ZH, HI). Japanese is deferred pending OpenJTalk integration. Highlights: - Encoder-decoder transformer + NanoCodec vocoder, 22 kHz output. - 5 built-in speakers; `|...|` inline-IPA override routes phoneme tokens directly to the tokenizer for fine-grained pronunciation control. - 1-layer local transformer (256d) runs on CPU via Accelerate/BNNS with top-k + temperature sampling and audio-EOS / forbidden-token masking. - 12-layer decoder KV cache rolled statefully across `decoder_step` calls; optional `decoder_prefill` fast path for the speaker context. - Assets (4 CoreML models + constants/ + tokenizer/) auto-fetch from `FluidInference/magpie-tts-multilingual-357m-coreml` on first use. - New CLI: `fluidaudiocli magpie {download,text,parity,tokenizer-parity}`. - Public API: `MagpieTtsManager.downloadAndCreate(languages:)` actor. - Unit tests: IPA override segmentation, KV-cache shape, NeMo tokenizer parity, and NPY v1 fp16/fp32 reader (17 tests, all passing).
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 0m59s • 04/24/2026, 11:01 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Kokoro TTS Smoke Test ✅
Runtime: 0m37s Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon. |
| case "--no-ipa-override": | ||
| allowIpa = false | ||
| default: | ||
| if text == nil { text = arg } | ||
| } |
There was a problem hiding this comment.
🔴 CLI parser missing --text flag causes README-documented syntax to synthesize wrong text
The README documents the Magpie text subcommand with --text as a named flag (e.g. swift run fluidaudiocli magpie text --text "Hello | ˈ n ɛ m o ʊ |." --speaker 0), but the MagpieCommand.runText parser at Sources/FluidAudioCLI/Commands/MagpieCommand.swift:84-124 has no case "--text": handler. The default: branch at line 123 captures "--text" as the text content, and the actual text argument is silently dropped. Any user following the README examples at README.md:625 or README.md:629 will synthesize the literal string "--text" as speech instead of the intended text.
| case "--no-ipa-override": | |
| allowIpa = false | |
| default: | |
| if text == nil { text = arg } | |
| } | |
| case "--no-ipa-override": | |
| allowIpa = false | |
| case "--text": | |
| if i + 1 < arguments.count { | |
| text = arguments[i + 1] | |
| i += 1 | |
| } | |
| default: | |
| if text == nil { text = arg } |
Was this helpful? React with 👍 or 👎 to provide feedback.
| // Find kth-largest threshold via partial sort. | ||
| var indexed = truncated.enumerated().map { ($0.offset, $0.element) } | ||
| indexed.sort { $0.1 > $1.1 } | ||
| let threshold = indexed[topK - 1].1 | ||
| for i in 0..<truncated.count { | ||
| if truncated[i] < threshold { | ||
| truncated[i] = -.infinity | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
🟡 Top-K sampling keeps more than K tokens when logit values are tied at the threshold
In sampleTopK, the threshold is set to the K-th largest logit and values strictly below it are masked to -inf. When multiple logits share the same value as the threshold, all of them survive, potentially keeping significantly more than K candidates. This diverges from the Python reference which uses torch.topk to select exactly K values. For example, if many logits cluster around the same value near the K boundary, the effective sampling set grows, diluting the probability distribution. In extreme cases (e.g., all logits equal), no tokens would be masked at all despite topK=80 and vocab=2024.
Prompt for agents
The sampleTopK function in MagpieSampler.swift uses a threshold-based approach to top-K filtering that keeps all values >= the K-th largest value. This means ties at the threshold boundary cause more than K tokens to survive. The Python reference uses torch.topk which returns exactly K values (arbitrary tie-breaking). To match the reference behavior, after sorting the indexed array (line 123-124), only keep the first topK entries by index. For example, collect the indices of the top-K entries from the sorted indexed array into a Set, then mask all indices NOT in that set to -.infinity. This ensures exactly K tokens survive regardless of ties.
Was this helpful? React with 👍 or 👎 to provide feedback.
Qwen3-ASR int8 Smoke Test ✅
Performance Metrics
Runtime: 3m12s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 3m 12s • 2026-04-25T03:04:58.885Z |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 6m16s • 04/24/2026, 11:07 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
PocketTTS Smoke Test ✅
Runtime: 0m49s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon. |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 111.8s processing • Test runtime: 1m 55s • 04/24/2026, 11:11 PM EST |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 40.6s diarization time • Test runtime: 2m 22s • 04/24/2026, 11:14 PM EST |
Summary
Ports the NVIDIA Magpie TTS Multilingual 357M autoregressive TTS from Python (mobius #24) to Swift. Closes #49.
.john,.sofia,.aria,.jason,.leo) with 110-token (768d fp16) context embeddings."Hello | ˈ n ɛ m o ʊ | world") routes|…|segments directly to the tokenizer for pronunciation control — first-class feature as requested.HF assets — live
FluidInference/magpie-tts-multilingual-357m-coremlis uploaded and ready (1.4 GB). Ships:text_encoder.{mlmodelc,mlpackage}— both compiled and portabledecoder_step.{mlmodelc,mlpackage}— stateful 12-layer KV cachedecoder_prefill.{mlmodelc,mlpackage}— fast prefill path (110-token batched)nanocodec_decoder.{mlmodelc,mlpackage}— 8-codebook → 22 kHz PCMconstants/—constants.json,speaker_info.json, 8 audio-codebook embeddings, 5 speaker contexts, local-transformer weightstokenizer/— per-language phoneme/jieba/pypinyin lookups (lazy-downloaded)manifest.json— machine-readable index (sha256, file sizes, npy shapes, model IO specs) consumed byMagpieResourceDownloaderArchitecture
text_encoder.mlmodelc(CoreML, cpuAndNeuralEngine)decoder_prefill.mlmodelcfast path, else 110 ×decoder_stepdecoder_step.mlmodelcwith stateful 12-layer KV cache[2, 1, 512, 12, 64]cblas_sgemm) + BNNS (GELU)minFrames, forbidden-token mask[2016, 2018-2023]nanocodec_decoder.mlmodelc— 8×N codes → float PCM → peak-normalizeAssets fetched lazily via
DownloadUtils; only the languages requested indownloadAndCreate(languages:)are materialized.Public API
CLI
Inline IPA — verified working
The
|…|passthrough is native NeMoIpaG2pbehavior (not added by us): segments inside pipes are looked up directly intoken2id.jsonas whitespace-separated phonemes, bypassing G2P.Validated end-to-end with the live HF assets (Python reference): 30 tokens → 43 frames → 2.00 s @ 3.97x RTF.
Guardrails followed
@unchecked Sendable;MagpieTtsManager,MagpieModelStore,MagpieTokenizer,MagpieSynthesizerare allactors.AppLogger(category: "Magpie*")throughout, noprint().MagpieError: Error, LocalizedErrorfor all error paths.Test plan
swift build— clean on macOS 14 / Swift 6 (only pre-existingcblas_sgemmdeprecation warnings from Accelerate).swift test --filter "Magpie|NpyReader"— 17 / 17 pass:MagpieConstantsTests(4) — forbidden-token mask, shape relations, NeMo tokenizer-name parity, per-language file coverageMagpieIpaOverrideTests(7) —|…|segmentation edge casesMagpieKvCacheTests(3) — cache shape,addInputskey count, static output keysNpyReaderTests(3) — fp32 parse, fp16→fp32 upcast, bad-magic rejectionmagpie download→magpie text --text "Hello world." --speaker 0 --language enproduces audible 22 kHz WAV.magpie parity --fixture …) against fixtures emitted by FluidInference/mobius#44; target: MAE < 1e-3 on encoder output, SNR > 40 dB on audio.Companion PR
Conversion pipeline + parity-fixture emitter + manifest generator: FluidInference/mobius#44.
Out of scope (follow-ups)
magpie-benchmark.yml.