diff --git a/Documentation/Models.md b/Documentation/Models.md index cd48a43d4..8826fc597 100644 --- a/Documentation/Models.md +++ b/Documentation/Models.md @@ -53,6 +53,14 @@ TDT models process audio in chunks (~15s with overlap) as batch operations. | **Kokoro ANE (7-stage)** | Same Kokoro 82M weights split into 7 CoreML stages so the ANE-friendly layers (Albert / PostAlbert / Alignment / Vocoder) stay resident on the Neural Engine while Prosody / Noise / Tail run on CPU+GPU. 3-11× RTFx vs. the single-graph Kokoro. Single voice (`af_heart`), ≤510 IPA phonemes per call, no chunker / SSML / custom lexicon. | ANE-optimized variant derived from [laishere/kokoro-coreml](https://github.com/laishere/kokoro-coreml) | | **PocketTTS** | Second TTS backend (~155M params). Autoregressive frame-by-frame generation with dynamic audio chunking. No phoneme stage, works directly on text tokens. | Supports streaming, minimal RAM usage, excellent quality | +## Not Production Ready + +Models that are functionally complete and shipped, but **not yet recommended for production use** — RTFx or WER limitations that need community assistance to push past. Open to PRs / issue reports / perf investigations. + +| Model | Status | +|-------|--------| +| **Magpie TTS Multilingual** ([FluidAudio#541](https://github.com/FluidInference/FluidAudio/pull/541), [mobius#44](https://github.com/FluidInference/mobius/pull/44), [HF](https://huggingface.co/FluidInference/magpie-tts-multilingual-357m-coreml)) | NVIDIA NeMo Magpie TTS Multilingual 357M, 8 languages (en/es/de/fr/it/vi/zh/hi), 5 built-in speakers. 4-model CoreML pipeline (text_encoder + decoder_prefill + decoder_step + nanocodec_decoder) + pure-Swift Local Transformer (Accelerate + BNNS). Custom IPA override via `\|...\|` segments. **Quite slow on Apple Silicon — RTFx ≈ 0.04 (~25× slower than realtime), ~30 s cold first synth, ~96 s warm for an 8-word English sentence on M-series.** Audio is ASR-clean on 4/5 speakers; spk0 has a single trailing-word artifact attributable to fp16 sampler-trajectory drift. Throughput investigation, MLX-backed LocalTransformer, CFG perf, and Japanese support (OpenJTalk + MeCab) are pending. For real-time TTS use Kokoro or PocketTTS instead. | + ## Evaluated Models (Not Supported) Models we converted and tested but are not supported: too large for on-device deployment, limitations or superseded by better approaches. @@ -83,4 +91,5 @@ Models we converted and tested but are not supported: too large for on-device de | Kokoro TTS | [FluidInference/kokoro-82m-coreml](https://huggingface.co/FluidInference/kokoro-82m-coreml) | | Kokoro ANE (7-stage) | [FluidInference/kokoro-82m-coreml/tree/main/ANE](https://huggingface.co/FluidInference/kokoro-82m-coreml/tree/main/ANE) | | PocketTTS | [FluidInference/pocket-tts-coreml](https://huggingface.co/FluidInference/pocket-tts-coreml) | +| Magpie TTS Multilingual | [FluidInference/magpie-tts-multilingual-357m-coreml](https://huggingface.co/FluidInference/magpie-tts-multilingual-357m-coreml) | | Nemotron Streaming | [FluidInference/nemotron-speech-streaming-en-0.6b-coreml](https://huggingface.co/FluidInference/nemotron-speech-streaming-en-0.6b-coreml) | diff --git a/Documentation/TTS/Magpie.md b/Documentation/TTS/Magpie.md new file mode 100644 index 000000000..9124b59e0 --- /dev/null +++ b/Documentation/TTS/Magpie.md @@ -0,0 +1,154 @@ +# Magpie TTS Multilingual (Swift Port) + +Swift port of NVIDIA NeMo Magpie TTS Multilingual 357M, exported to CoreML. +Lives under `Sources/FluidAudio/TTS/Magpie/`. + +## Status + +Functional but **quite slow — needs significant perf work, not for real-time +or latency-sensitive use.** First synth on a fresh process is dominated by +CoreML model load + first-call ANE compile (~30 s); warm synths run at +~96 s wall for an 8-word English sentence on M-series, i.e. RTFx ≈ **0.04** +(~25× slower than realtime). Whether the throughput ceiling is a model +characteristic, a CoreML conversion limitation, or both is still being +investigated and is expected to improve in subsequent iterations. For +real-time use prefer Kokoro (~20× RTFx) or PocketTTS (~1.5–2× RTFx); +Magpie's value prop is multilingual coverage and the 5 built-in speaker +contexts, not throughput. + +Audio quality is perceptually clean across all 5 speakers and ASR-clean on +4/5; speaker 0 has a single trailing-word artifact ("…and") attributable +to fp16 sampler-trajectory drift, not a structural bug. + +Not yet covered: Japanese (deferred — needs OpenJTalk XCFramework + MeCab +dict), CFG performance optimization, MLX-backed LocalTransformer, +throughput investigation (the headline gap). + +## Architecture + +``` +text → MagpieTokenizer (per-language) → text_encoder.mlmodelc + ↓ +speaker_N.npy (110×768) → decoder_prefill.mlmodelc (1 batched call) ──┐ + ↓ + ┌──── KV cache (12 layers × [2,1,512,12,64] fp16) + ↓ + AR loop (decoder_step.mlmodelc, ≤500 steps): + ├─ LocalTransformer (Swift, Accelerate+BNNS) + ├─ Sampler (top-k=80, temp=0.6, forbidden mask) + ├─ embed sampled (8) codes → next decoder_step input + └─ stop on audio_eos_id (2017) or maxSteps + ↓ + nanocodec_decoder.mlmodelc → 22 050 Hz Float32 PCM +``` + +## Compute placement (verified end-to-end) + +| Model | Compute units | Reasoning | +| ------------------ | ------------------------ | ------------------------------------------------------------------------------------------------------------ | +| `text_encoder` | `.cpuAndNeuralEngine` | Runs on ANE; ~3.5× vs CPU. | +| `decoder_prefill` | `.cpuAndNeuralEngine` | Runs on ANE; ~3.2× vs CPU. One batched call replaces 110 sequential `decoder_step` calls. | +| `decoder_step` | **`.cpuAndGPU`** | Pinned. ANE compile fails (`MILCompilerForANE: ANECCompile() FAILED`) due to rank-4 split-K/V scatter; on `.cpuAndNeuralEngine` it falls back to CPU at ~hundreds-of-ms cost per call. GPU (Metal MPS) is fastest. Verified: 96 s warm vs 103 s warm on `.cpuAndNeuralEngine`. | +| `nanocodec_decoder`| `.cpuAndNeuralEngine` | Runs on ANE. | + +The pin is implemented in `MagpieModelStore.swift:60` — caller-supplied +`computeUnits` is honored for all models *except* `decoder_step`, which is +forced to `.cpuAndGPU` (or `.cpuOnly` if the caller asked for `.cpuOnly`). + +## Performance journey + +Three optimizations landed during the port; numbers are warm-avg wall time on +M-series for an 8-word English sentence. + +| Stage | Wall (warm) | Speedup | +| ------------------------------------------------------- | ----------- | ------- | +| Baseline: 110-step prefill loop, ANE on decoder_step | ~420 s | 1.0× | +| **Wire `decoder_prefill.mlmodelc` (1 batched call)** | ~110 s | 3.8× | +| **Pin decoder_step to `.cpuAndGPU`** | ~96 s | 4.4× | + +Asset was already on HF (`FluidInference/magpie-tts-multilingual-357m-coreml`) +and downloaded by `MagpieResourceDownloader`, just unused. `prefillFast` +(`MagpiePrefill.swift:23`) replaces 110 sequential `decoder_step` calls with +one `decoder_prefill` call whose 12 stacked-K/V outputs (`var_208`, `var_374`, +… `var_1958`, each `[2, 1, 512, 12, 64]` fp16) are sliced via two `memcpy`s +per layer into the KV cache (`MagpieKvCache.seedFromPrefillOutputs`). + +## Public API + +```swift +let manager = try await MagpieTtsManager.downloadAndCreate( + languages: [.english], + cacheDirectory: nil, + computeUnits: .cpuAndNeuralEngine, // decoder_step pinned to GPU internally + progressHandler: nil +) + +let result = try await manager.synthesize( + text: "Hello world.", + speaker: .john, + language: .english, + options: .default +) +// result.samples : [Float] (22 050 Hz) +// result.codeCount : Int +// result.durationSeconds : Double +``` + +## CLI + +```bash +# Download all assets eagerly +swift run fluidaudiocli magpie download + +# Synth +swift run fluidaudiocli magpie text "Hello world." --speaker 0 --output hello.wav +``` + +Parity, probe, and compute-plan tooling live upstream in `mobius` (Python) — +they exercise the export pipeline and are out of scope for the Swift runtime. + +## Known issues + +1. **spk0 trailing-word drift.** ASR shows a stray word at the end (e.g. + "…seashore, and"). Stage-by-stage parity probe (in `mobius`) localizes it + to fp16 sampler-trajectory non-determinism between Python+CoreML reference + and Swift+CoreML host: prefill SNR degrades L0=64 dB → L11=44 dB through + the 12-layer cache, then compounds in the AR loop. CoreML itself is + consistent between languages; the drift is host-floating-point + RNG/sampler + ordering. Not user-perceptible on speakers 1–4. + +2. **`decoder_step` ANE compile failure is real.** Earlier benchmark with + zeroed `position` scalars showed a 3× ANE speedup; that was misleading — + with real incrementing positions the ANEF compile fails at runtime per + call. Keep the `.cpuAndGPU` pin. + +## File map + +``` +Sources/FluidAudio/TTS/Magpie/ +├── MagpieTtsManager.swift # public actor +├── MagpieConstants.swift # shapes, ids, file names, HF repo id +├── MagpieError.swift +├── MagpieTypes.swift +├── Assets/ +│ ├── MagpieModelStore.swift # actor; loads 4 mlmodelcs, per-model compute units +│ ├── MagpieResourceDownloader.swift # HF download via DownloadUtils +│ ├── MagpieConstantsStore.swift +│ └── MagpieLocalTransformerWeights.swift +├── LocalTransformer/ +│ ├── MagpieLocalTransformer.swift # 1-layer transformer (attention + FFN) via Accelerate (cblas_sgemm) + BNNS (GELU) +│ └── MagpieSampler.swift # top-k + temp + forbidden mask + CFG merge +├── Pipeline/ +│ ├── Preprocess/ # per-language tokenizers + IPA override +│ └── Synthesize/ +│ ├── MagpieSynthesizer.swift # orchestrates encode → prefill → AR → nanocodec +│ ├── MagpieKvCache.swift # 12 layers × (cache, position); seedFromPrefillOutputs +│ ├── MagpiePrefill.swift # prefillFast (batched) + prefill (110-step fallback) +│ └── MagpieNanocodec.swift +└── Shared/ + ├── NpyReader.swift # .npy v1 (fp32/fp16/int) + └── MagpieMT19937.swift # deterministic RNG matching Python reference + +Sources/FluidAudioCLI/Commands/ +└── MagpieCommand.swift # dispatch (download / text) +``` diff --git a/README.md b/README.md index 002c046c5..4e293854e 100644 --- a/README.md +++ b/README.md @@ -37,7 +37,7 @@ Want to convert your own model? Check [möbius](https://github.com/FluidInferenc - **Automatic Speech Recognition (ASR)**: [Parakeet TDT v3](Documentation/Models.md#batch-transcription-near-real-time) (0.6b) and other TDT/CTC models for batch transcription supporting 25 European languages, Japanese, and Chinese; [Parakeet EOU](Documentation/Models.md#streaming-transcription-true-real-time) (120m) for streaming ASR with end-of-utterance detection (English only). See all [ASR models](Documentation/Models.md#asr-models). - **Inverse Text Normalization (ITN)**: Post-process ASR output to convert spoken-form to written-form ("two hundred" → "200"). See [text-processing-rs](https://github.com/FluidInference/text-processing-rs) -- **Text-to-Speech (TTS)**: Kokoro (82m) for parallel synthesis with SSML and pronunciation control across 9 languages (EN, ES, FR, HI, IT, JA, PT, ZH); PocketTTS for streaming TTS with voice cloning support (EN, DE, ES, FR, IT, PT — 6L and 24L variants) +- **Text-to-Speech (TTS)**: Kokoro (82m) for parallel synthesis with SSML and pronunciation control across 9 languages (EN, ES, FR, HI, IT, JA, PT, ZH); PocketTTS for streaming TTS with voice cloning support (EN, DE, ES, FR, IT, PT — 6L and 24L variants); **Magpie (357m, experimental)** autoregressive multilingual TTS with 5 speakers, `|…|` IPA override, and 8-language coverage (EN, ES, DE, FR, IT, VI, ZH, HI) — note: quite slow (~0.04 RTFx on Apple Silicon, ~25× slower than realtime) and needs further perf work, see [Magpie docs](Documentation/TTS/Magpie.md) before adopting - **Speaker Diarization (Online + Offline)**: Speaker separation and identification across audio streams. Streaming pipeline for real-time processing and offline batch pipeline with advanced clustering. - **Speaker Embedding Extraction**: Generate speaker embeddings for voice comparison and clustering, you can use this for speaker identification - **Voice Activity Detection (VAD)**: Voice activity detection with Silero models @@ -607,6 +607,60 @@ swift run fluidaudiocli tts "Hello from FluidAudio." --auto-download --output ou Dictionary and model assets are cached under `~/.cache/fluidaudio/Models/kokoro`. +### Magpie (Multilingual) — experimental + +> ⚠️ **Quite slow on Apple Silicon — needs significant perf work; not for +> real-time / latency-sensitive use.** First synth on a fresh process is +> dominated by CoreML model load + first-call ANE compile (~30 s). Warm +> synths run at **~96 s wall for an 8-word English sentence** on M-series +> (RTFx ≈ **0.04**, i.e. ~25× slower than realtime). Output is +> perceptually clean / ASR-clean across 4 of the 5 speakers; speaker 0 +> has a single trailing-word artifact attributable to fp16 +> sampler-trajectory drift (not a structural bug). Whether the throughput +> ceiling is a model characteristic, a CoreML conversion limitation, or +> both is still being investigated and is expected to improve in +> subsequent iterations. **Use Kokoro (~20× RTFx) or PocketTTS +> (~1.5–2× RTFx) for real-time use.** Magpie ships for multilingual +> coverage and the 5 speaker contexts, not throughput. + +Magpie TTS Multilingual (357M) is NVIDIA's autoregressive encoder-decoder TTS with 8-codebook NanoCodec vocoder output at 22.05 kHz. It exposes 5 built-in speakers and supports 8 languages (English, Spanish, German, French, Italian, Vietnamese, Mandarin, Hindi) with a `|…|` IPA override that routes inline phoneme sequences directly to the tokenizer. Japanese is deferred pending OpenJTalk integration. + +```swift +import FluidAudio + +Task { + let manager = try await MagpieTtsManager.downloadAndCreate( + languages: [.english, .spanish] + ) + let result = try await manager.synthesize( + text: "Hello | ˈ n ɛ m o ʊ | from FluidAudio.", + speaker: .john, + language: .english + ) + let wav = AudioWAV.data(from: result.samples, sampleRate: result.sampleRate) + try wav.write(to: URL(fileURLWithPath: "hello.wav")) +} +``` + +```bash +# Pre-download assets for selected languages +swift run fluidaudiocli magpie download --languages en,es + +# Synthesize with IPA override enabled (default) +swift run fluidaudiocli magpie text --text "Hello | ˈ n ɛ m o ʊ |." \ + --speaker 0 --language en --output hello.wav + +# Classifier-free guidance and sampling controls +swift run fluidaudiocli magpie text --text "Bonjour." --language fr \ + --cfg 2.5 --temperature 0.6 --topk 80 --seed 42 --output bonjour.wav +``` + +Parity / probe / compute-plan tooling lives upstream in `mobius` (Python). + +Assets (4 CoreML models + `constants/` + per-language tokenizer files) are fetched from [`FluidInference/magpie-tts-multilingual-357m-coreml`](https://huggingface.co/FluidInference/magpie-tts-multilingual-357m-coreml) on first use. The 1-layer local transformer (256d, top-k + temperature sampling, forbidden-token mask) runs on CPU via Accelerate/BNNS; the 12-layer decoder KV cache is rolled stateful across steps. + +When `--seed N` is supplied, sampling is driven by a NumPy-compatible MT19937 RNG so the Swift output is bit-reproducible against the Python reference seeded with `np.random.seed(N)`. + ## Continuous Integration - `tests.yml`: Default build matrix covering SwiftPM tests and an iOS archive smoke test. diff --git a/Sources/FluidAudio/ModelNames.swift b/Sources/FluidAudio/ModelNames.swift index 437b0422f..f40860abe 100644 --- a/Sources/FluidAudio/ModelNames.swift +++ b/Sources/FluidAudio/ModelNames.swift @@ -31,6 +31,7 @@ public enum Repo: String, CaseIterable, Sendable { case parakeetTdtCtc110m = "FluidInference/parakeet-tdt-ctc-110m-coreml" case cosyvoice3 = "FluidInference/CosyVoice3-0.5B-coreml" case cohereTranscribeCoreml = "FluidInference/cohere-transcribe-03-2026-coreml/q8" + case magpieTts = "FluidInference/magpie-tts-multilingual-357m-coreml" /// Repository slug (without owner) public var name: String { @@ -87,6 +88,8 @@ public enum Repo: String, CaseIterable, Sendable { return "CosyVoice3-0.5B-coreml" case .cohereTranscribeCoreml: return "cohere-transcribe-03-2026-coreml/q8" + case .magpieTts: + return "magpie-tts-multilingual-357m-coreml" } } @@ -185,6 +188,8 @@ public enum Repo: String, CaseIterable, Sendable { return "cosyvoice3" case .cohereTranscribeCoreml: return "cohere-transcribe/q8" + case .magpieTts: + return "magpie-tts" default: return name.replacingOccurrences(of: "-coreml", with: "") } @@ -642,6 +647,35 @@ public enum ModelNames { } } + /// Magpie TTS Multilingual 357M model names. + /// + /// Four CoreML models + a `constants/` directory + a `tokenizer/` directory of + /// per-language lookup data. The `decoder_prefill` model is optional; when + /// absent the prefill runs step-by-step through `decoder_step`. + public enum Magpie { + public static let textEncoder = "text_encoder" + public static let decoderPrefill = "decoder_prefill" + public static let decoderStep = "decoder_step" + public static let nanocodecDecoder = "nanocodec_decoder" + + public static let textEncoderFile = textEncoder + ".mlmodelc" + public static let decoderPrefillFile = decoderPrefill + ".mlmodelc" + public static let decoderStepFile = decoderStep + ".mlmodelc" + public static let nanocodecDecoderFile = nanocodecDecoder + ".mlmodelc" + + public static let constantsDir = "constants" + public static let tokenizerDir = "tokenizer" + + /// Files required for English synthesis. Other languages append their own + /// lookup files on top (see `MagpieResourceDownloader`). + public static let requiredModels: Set = [ + textEncoderFile, + decoderStepFile, + nanocodecDecoderFile, + constantsDir, + ] + } + /// Multilingual G2P (CharsiuG2P ByT5) model names public enum MultilingualG2P { public static let encoder = "MultilingualG2PEncoder" @@ -848,6 +882,8 @@ public enum ModelNames { return ModelNames.CosyVoice3.requiredModels case .cohereTranscribeCoreml: return ModelNames.CohereTranscribe.requiredModels + case .magpieTts: + return ModelNames.Magpie.requiredModels } } } diff --git a/Sources/FluidAudio/TTS/Magpie/Assets/MagpieConstantsStore.swift b/Sources/FluidAudio/TTS/Magpie/Assets/MagpieConstantsStore.swift new file mode 100644 index 000000000..1d53c7b65 --- /dev/null +++ b/Sources/FluidAudio/TTS/Magpie/Assets/MagpieConstantsStore.swift @@ -0,0 +1,178 @@ +import Foundation + +/// Decoded shape / hyperparameter metadata from `constants/constants.json`. +/// +/// The field names mirror the Python exporter +/// (`mobius/.../export_constants.py`). Unknown keys are ignored so the exporter +/// can add fields without breaking Swift. All fields have safe defaults matching +/// the published 357M checkpoint so the Swift port remains usable if a key is +/// dropped in a future rebuild. +public struct MagpieModelConfig: Sendable, Decodable { + public let dModel: Int + public let numDecoderLayers: Int + public let numHeads: Int + public let headDim: Int + public let numCodebooks: Int + public let numCodesPerCodebook: Int + public let maxCacheLength: Int + public let maxTextLength: Int + public let audioBosId: Int32 + public let audioEosId: Int32 + public let speakerContextLength: Int + + enum CodingKeys: String, CodingKey { + case dModel = "d_model" + case numDecoderLayers = "num_decoder_layers" + case numHeads = "num_heads" + case headDim = "head_dim" + case numCodebooks = "num_codebooks" + case numCodesPerCodebook = "num_codes_per_codebook" + case maxCacheLength = "max_cache_length" + case maxTextLength = "max_text_length" + case audioBosId = "audio_bos_id" + case audioEosId = "audio_eos_id" + case speakerContextLength = "speaker_context_length" + } + + public init(from decoder: Decoder) throws { + let c = try decoder.container(keyedBy: CodingKeys.self) + dModel = (try? c.decode(Int.self, forKey: .dModel)) ?? MagpieConstants.dModel + numDecoderLayers = + (try? c.decode(Int.self, forKey: .numDecoderLayers)) ?? MagpieConstants.numDecoderLayers + numHeads = (try? c.decode(Int.self, forKey: .numHeads)) ?? MagpieConstants.numHeads + headDim = (try? c.decode(Int.self, forKey: .headDim)) ?? MagpieConstants.headDim + numCodebooks = + (try? c.decode(Int.self, forKey: .numCodebooks)) ?? MagpieConstants.numCodebooks + numCodesPerCodebook = + (try? c.decode(Int.self, forKey: .numCodesPerCodebook)) + ?? MagpieConstants.numCodesPerCodebook + maxCacheLength = + (try? c.decode(Int.self, forKey: .maxCacheLength)) ?? MagpieConstants.maxCacheLength + maxTextLength = + (try? c.decode(Int.self, forKey: .maxTextLength)) ?? MagpieConstants.maxTextLength + audioBosId = (try? c.decode(Int32.self, forKey: .audioBosId)) ?? MagpieConstants.audioBosId + audioEosId = (try? c.decode(Int32.self, forKey: .audioEosId)) ?? MagpieConstants.audioEosId + speakerContextLength = + (try? c.decode(Int.self, forKey: .speakerContextLength)) + ?? MagpieConstants.speakerContextLength + } + + public init( + dModel: Int = MagpieConstants.dModel, + numDecoderLayers: Int = MagpieConstants.numDecoderLayers, + numHeads: Int = MagpieConstants.numHeads, + headDim: Int = MagpieConstants.headDim, + numCodebooks: Int = MagpieConstants.numCodebooks, + numCodesPerCodebook: Int = MagpieConstants.numCodesPerCodebook, + maxCacheLength: Int = MagpieConstants.maxCacheLength, + maxTextLength: Int = MagpieConstants.maxTextLength, + audioBosId: Int32 = MagpieConstants.audioBosId, + audioEosId: Int32 = MagpieConstants.audioEosId, + speakerContextLength: Int = MagpieConstants.speakerContextLength + ) { + self.dModel = dModel + self.numDecoderLayers = numDecoderLayers + self.numHeads = numHeads + self.headDim = headDim + self.numCodebooks = numCodebooks + self.numCodesPerCodebook = numCodesPerCodebook + self.maxCacheLength = maxCacheLength + self.maxTextLength = maxTextLength + self.audioBosId = audioBosId + self.audioEosId = audioEosId + self.speakerContextLength = speakerContextLength + } +} + +/// Loaded constants: config, per-speaker embeddings (fp32), per-codebook +/// audio embeddings (fp32). All arrays are stored row-major. +public struct MagpieConstantsBundle: Sendable { + public let config: MagpieModelConfig + /// Shape: [numSpeakers][contextLength × dModel]. Row-major. + public let speakerEmbeddings: [[Float]] + /// Shape: [numCodebooks][numCodesPerCodebook × dModel]. Row-major. + public let audioEmbeddings: [[Float]] + /// Text tokenizer EOS id (from `tokenizer_metadata.json`; 0 if absent). + public let textEosId: Int32 +} + +/// Loads Magpie constants from a directory (typically `/constants/`). +public enum MagpieConstantsLoader { + + private static let logger = AppLogger(category: "MagpieConstantsLoader") + + public static func load(from constantsDir: URL) throws -> MagpieConstantsBundle { + let config = try loadConfig(from: constantsDir) + + var speakerEmbeddings: [[Float]] = [] + speakerEmbeddings.reserveCapacity(MagpieConstants.numSpeakers) + for idx in 0.. Int32 { + let url = dir.appendingPathComponent(MagpieConstants.Files.tokenizerMetadataJson) + guard FileManager.default.fileExists(atPath: url.path), + let data = try? Data(contentsOf: url), + let json = try? JSONSerialization.jsonObject(with: data) as? [String: Any] + else { + return 0 + } + if let eos = json["eos_token_id"] as? Int { + return Int32(eos) + } + if let eos = json["text_eos_id"] as? Int { + return Int32(eos) + } + return 0 + } + + private static func loadConfig(from dir: URL) throws -> MagpieModelConfig { + let url = dir.appendingPathComponent(MagpieConstants.Files.constantsJson) + guard FileManager.default.fileExists(atPath: url.path) else { + logger.warning("constants.json missing; falling back to built-in defaults") + return MagpieModelConfig() + } + do { + let data = try Data(contentsOf: url) + return try JSONDecoder().decode(MagpieModelConfig.self, from: data) + } catch { + throw MagpieError.invalidConstants("constants.json: \(error)") + } + } + +} diff --git a/Sources/FluidAudio/TTS/Magpie/Assets/MagpieLocalTransformerWeights.swift b/Sources/FluidAudio/TTS/Magpie/Assets/MagpieLocalTransformerWeights.swift new file mode 100644 index 000000000..f5cc371a5 --- /dev/null +++ b/Sources/FluidAudio/TTS/Magpie/Assets/MagpieLocalTransformerWeights.swift @@ -0,0 +1,162 @@ +import Foundation + +/// Weights for the Swift-side 1-layer Local Transformer that samples the 8 +/// codebook tokens per frame. +/// +/// Shapes match the NumPy reference in `mobius/models/tts/magpie/coreml/generate_coreml.py` +/// (fn `local_transformer_forward`). All arrays are kept row-major fp32 so the +/// Accelerate + BNNS forward pass can consume them directly. +public struct MagpieLocalTransformerWeights: Sendable { + // Input projection: (localDim, dModel) weight + (localDim,) bias. + public let inProjWeight: [Float] + public let inProjBias: [Float] + /// Positional embedding slots: (maxPositions, localDim). + public let posEmbedding: [Float] + /// RMSNorm / LayerNorm weights: (localDim,) each. + public let norm1Weight: [Float] + public let norm2Weight: [Float] + /// Self-attention QKV weight: (3*localDim, localDim). + public let saQkvWeight: [Float] + /// Self-attention output weight: (localDim, localDim). + public let saOWeight: [Float] + /// FFN conv kernel=1: (ffnDim, localDim) then (localDim, ffnDim). + public let ffnConv1Weight: [Float] + public let ffnConv2Weight: [Float] + /// Per-codebook output heads: 8× (numCodesPerCodebook, localDim) + (numCodesPerCodebook,). + public let outProjWeights: [[Float]] + public let outProjBiases: [[Float]] + + // Cached dimensions for convenience. + public let localDim: Int + public let dModel: Int + public let ffnDim: Int + public let maxPositions: Int + public let numCodebooks: Int + public let numCodesPerCodebook: Int +} + +public enum MagpieLocalTransformerLoader { + + private static let logger = AppLogger(category: "MagpieLocalTransformerLoader") + + /// Loads all `local_transformer/*.npy` files from `constantsDir`. + public static func load( + from constantsDir: URL, + config: MagpieModelConfig + ) throws -> MagpieLocalTransformerWeights { + let ltDir = constantsDir.appendingPathComponent(MagpieConstants.Files.localTransformerDir) + guard FileManager.default.fileExists(atPath: ltDir.path) else { + throw MagpieError.modelFileNotFound(MagpieConstants.Files.localTransformerDir) + } + + let localDim = MagpieConstants.localTransformerDim + let ffnDim = MagpieConstants.localTransformerFfnDim + let maxPositions = MagpieConstants.localTransformerMaxPositions + let dModel = config.dModel + let numCodebooks = config.numCodebooks + let numCodesPerCodebook = config.numCodesPerCodebook + + func loadNpy(_ name: String, expecting shape: [Int]) throws -> [Float] { + let url = ltDir.appendingPathComponent(name) + guard FileManager.default.fileExists(atPath: url.path) else { + throw MagpieError.modelFileNotFound("\(MagpieConstants.Files.localTransformerDir)/\(name)") + } + let array = try NpyReader.read(from: url) + try array.assertShape(shape, label: name) + return array.data + } + + let inProjWeight = try loadNpy( + MagpieConstants.Files.LocalTransformer.inProjWeight, + expecting: [localDim, dModel]) + let inProjBias = try loadNpy( + MagpieConstants.Files.LocalTransformer.inProjBias, + expecting: [localDim]) + let posEmbedding = try loadNpy( + MagpieConstants.Files.LocalTransformer.posEmb, + expecting: [maxPositions, localDim]) + let norm1Weight = try loadNpy( + MagpieConstants.Files.LocalTransformer.norm1Weight, + expecting: [localDim]) + let norm2Weight = try loadNpy( + MagpieConstants.Files.LocalTransformer.norm2Weight, + expecting: [localDim]) + let saQkvWeight = try loadNpy( + MagpieConstants.Files.LocalTransformer.saQkvWeight, + expecting: [3 * localDim, localDim]) + let saOWeight = try loadNpy( + MagpieConstants.Files.LocalTransformer.saOWeight, + expecting: [localDim, localDim]) + // Conv1d kernel=1 is effectively (out, in) matmul; the exporter keeps + // the trailing kernel dim so we accept either [out, in] or [out, in, 1]. + let ffnConv1Weight = try loadFlexible( + name: MagpieConstants.Files.LocalTransformer.ffnConv1Weight, + directory: ltDir, + primary: [ffnDim, localDim], + alternate: [ffnDim, localDim, 1]) + let ffnConv2Weight = try loadFlexible( + name: MagpieConstants.Files.LocalTransformer.ffnConv2Weight, + directory: ltDir, + primary: [localDim, ffnDim], + alternate: [localDim, ffnDim, 1]) + + var outProjWeights: [[Float]] = [] + var outProjBiases: [[Float]] = [] + outProjWeights.reserveCapacity(numCodebooks) + outProjBiases.reserveCapacity(numCodebooks) + for cb in 0.. [Float] { + let url = directory.appendingPathComponent(name) + guard FileManager.default.fileExists(atPath: url.path) else { + throw MagpieError.modelFileNotFound( + "\(MagpieConstants.Files.localTransformerDir)/\(name)") + } + let array = try NpyReader.read(from: url) + if array.shape == primary || array.shape == alternate { + return array.data + } + throw MagpieError.invalidNpyFile( + path: name, + reason: "expected shape \(primary) or \(alternate), got \(array.shape)") + } +} diff --git a/Sources/FluidAudio/TTS/Magpie/Assets/MagpieModelStore.swift b/Sources/FluidAudio/TTS/Magpie/Assets/MagpieModelStore.swift new file mode 100644 index 000000000..d92e7c719 --- /dev/null +++ b/Sources/FluidAudio/TTS/Magpie/Assets/MagpieModelStore.swift @@ -0,0 +1,216 @@ +@preconcurrency import CoreML +import Foundation + +/// Actor-based store for Magpie CoreML models + constants + LocalTransformer weights. +/// +/// Manages loading of 3 required models (text_encoder, decoder_step, nanocodec_decoder) +/// and 1 optional model (decoder_prefill). Also holds the pre-loaded +/// `MagpieConstantsBundle` and `MagpieLocalTransformerWeights` so the synthesizer +/// can hit all assets from a single entry point. +public actor MagpieModelStore { + + private let logger = AppLogger(category: "MagpieModelStore") + + private var textEncoderModel: MLModel? + private var decoderPrefillModel: MLModel? // optional fast path + private var decoderStepModel: MLModel? + private var nanocodecDecoderModel: MLModel? + + private var constantsBundle: MagpieConstantsBundle? + private var localTransformerWeights: MagpieLocalTransformerWeights? + + private var repoDirectory: URL? + + private let directory: URL? + private let computeUnits: MLComputeUnits + private let preferredLanguages: Set + + /// - Parameters: + /// - directory: Optional override for the base cache directory. + /// - computeUnits: CoreML compute preference for all models. + /// - preferredLanguages: Set of languages whose tokenizer data should be fetched. + public init( + directory: URL? = nil, + computeUnits: MLComputeUnits = .cpuAndNeuralEngine, + preferredLanguages: Set = [.english] + ) { + self.directory = directory + self.computeUnits = computeUnits + self.preferredLanguages = preferredLanguages + } + + /// Download (if missing) and load all Magpie CoreML models + constants. + public func loadIfNeeded() async throws { + if textEncoderModel != nil { + return + } + + let repoDir = try await MagpieResourceDownloader.ensureAssets( + languages: preferredLanguages, + directory: directory, + includePrefill: true + ) + self.repoDirectory = repoDir + + logger.info("Loading Magpie CoreML models from \(repoDir.path)…") + + let config = MLModelConfiguration() + config.computeUnits = computeUnits + + // `decoder_step.mlmodelc` reliably fails ANE compilation + // (`MILCompilerForANE error: ANECCompile() FAILED`) due to its rank-4 + // split-K/V scatter layout, then falls back to CPU at the cost of one + // failed ANE compile attempt per call (~hundreds of ms each). Pin it + // to `.cpuAndGPU` so CoreML skips the ANE attempt entirely and runs + // on Metal MPS — verified end-to-end as the fastest path + // (96s warm vs 103s warm on `.cpuAndNeuralEngine`). + let gpuConfig = MLModelConfiguration() + gpuConfig.computeUnits = + computeUnits == .cpuOnly ? .cpuOnly : .cpuAndGPU + + // `nanocodec_decoder.mlmodelc` is fastest on **CPU only**. The model's + // upsample stack (5 transposed convs + 96 sin/pow per-frame embedding + // ops + 86 LeakyReLU) doesn't map well onto Metal MPS, and ANE compile + // fails on its conv stack. Empirically (M-series, single fwd of 256 + // frames): + // .cpuOnly ~2.87 s + // .cpuAndGPU ~3.86 s + // .cpuAndNeuralEngine ~10.12 s (ANE compile fail → CPU fallback dance) + // .all ~2.95 s + // Putting it on `.cpuAndGPU` also makes `decoder_step` ~40 ms/step + // because both contend for the same Metal queue. Pinning nanocodec to + // CPU keeps Metal exclusive for decoder_step (25 ms/step) and saves a + // full second on the nanocodec call → ~1.03x RTFx vs ~0.91x before. + let cpuConfig = MLModelConfiguration() + cpuConfig.computeUnits = .cpuOnly + + let loadStart = Date() + + textEncoderModel = try loadModel( + repoDir: repoDir, + fileName: ModelNames.Magpie.textEncoderFile, + config: config, + required: true) + + decoderStepModel = try loadModel( + repoDir: repoDir, + fileName: ModelNames.Magpie.decoderStepFile, + config: gpuConfig, + required: true) + + nanocodecDecoderModel = try loadModel( + repoDir: repoDir, + fileName: ModelNames.Magpie.nanocodecDecoderFile, + config: cpuConfig, + required: true) + + decoderPrefillModel = try loadModel( + repoDir: repoDir, + fileName: ModelNames.Magpie.decoderPrefillFile, + config: config, + required: false) + + let elapsed = Date().timeIntervalSince(loadStart) + logger.info( + "Magpie models loaded in \(String(format: "%.2f", elapsed))s (prefill \(decoderPrefillModel == nil ? "absent" : "present"))" + ) + + // Load constants + local transformer weights. + let constantsDir = MagpieResourceDownloader.constantsDirectory(in: repoDir) + let bundle = try MagpieConstantsLoader.load(from: constantsDir) + constantsBundle = bundle + localTransformerWeights = try MagpieLocalTransformerLoader.load( + from: constantsDir, config: bundle.config) + } + + public func textEncoder() throws -> MLModel { + guard let model = textEncoderModel else { + throw MagpieError.notInitialized + } + return model + } + + public func decoderStep() throws -> MLModel { + guard let model = decoderStepModel else { + throw MagpieError.notInitialized + } + return model + } + + public func nanocodecDecoder() throws -> MLModel { + guard let model = nanocodecDecoderModel else { + throw MagpieError.notInitialized + } + return model + } + + public func decoderPrefill() throws -> MLModel { + guard let model = decoderPrefillModel else { + throw MagpieError.notInitialized + } + return model + } + + public func hasDecoderPrefill() -> Bool { + decoderPrefillModel != nil + } + + public func constants() throws -> MagpieConstantsBundle { + guard let bundle = constantsBundle else { + throw MagpieError.notInitialized + } + return bundle + } + + public func localTransformer() throws -> MagpieLocalTransformerWeights { + guard let weights = localTransformerWeights else { + throw MagpieError.notInitialized + } + return weights + } + + public func repoDir() throws -> URL { + guard let dir = repoDirectory else { + throw MagpieError.notInitialized + } + return dir + } + + /// Release all loaded models + constants. Resource downloads on disk are kept. + public func unload() { + textEncoderModel = nil + decoderPrefillModel = nil + decoderStepModel = nil + nanocodecDecoderModel = nil + constantsBundle = nil + localTransformerWeights = nil + } + + // MARK: - Helpers + + private func loadModel( + repoDir: URL, fileName: String, config: MLModelConfiguration, required: Bool + ) throws -> MLModel? { + let modelURL = repoDir.appendingPathComponent(fileName) + guard FileManager.default.fileExists(atPath: modelURL.path) else { + if required { + throw MagpieError.modelFileNotFound(fileName) + } else { + logger.notice("Optional model \(fileName) not present; skipping") + return nil + } + } + do { + let model = try MLModel(contentsOf: modelURL, configuration: config) + logger.info("Loaded \(fileName)") + return model + } catch { + if required { + throw MagpieError.corruptedModel(fileName, underlying: "\(error)") + } else { + logger.warning("Failed to load optional \(fileName): \(error)") + return nil + } + } + } +} diff --git a/Sources/FluidAudio/TTS/Magpie/Assets/MagpieResourceDownloader.swift b/Sources/FluidAudio/TTS/Magpie/Assets/MagpieResourceDownloader.swift new file mode 100644 index 000000000..c331353b1 --- /dev/null +++ b/Sources/FluidAudio/TTS/Magpie/Assets/MagpieResourceDownloader.swift @@ -0,0 +1,188 @@ +import Foundation + +/// Downloads Magpie TTS models, constants, and per-language tokenizer data from HuggingFace. +/// +/// The HF repo (`FluidInference/magpie-tts-multilingual-357m-coreml`) ships: +/// - 3 required CoreML models + 1 optional prefill model at the repo root +/// - `constants/` with model config, speaker embeddings, audio codebook tables, and +/// the local-transformer weights (downloaded as one subtree) +/// - `tokenizer/` with per-language lookup data (lazy per language) +public enum MagpieResourceDownloader { + + private static let logger = AppLogger(category: "MagpieResourceDownloader") + + /// Ensure the CoreML models + `constants/` directory are present locally, and + /// ensure tokenizer data for each requested language is present. Returns the + /// resolved repo directory (i.e. the root containing the `.mlmodelc` files). + public static func ensureAssets( + languages: Set = [.english], + directory: URL? = nil, + includePrefill: Bool = true, + progressHandler: DownloadUtils.ProgressHandler? = nil + ) async throws -> URL { + let modelsRoot = try directory ?? defaultCacheRoot() + let repoDir = modelsRoot.appendingPathComponent(Repo.magpieTts.folderName) + + let rootModelsPresent = ModelNames.Magpie.requiredModels.allSatisfy { entry in + FileManager.default.fileExists(atPath: repoDir.appendingPathComponent(entry).path) + } + + if !rootModelsPresent { + logger.info("Downloading Magpie TTS models from HuggingFace…") + try await DownloadUtils.downloadRepo( + .magpieTts, to: modelsRoot, progressHandler: progressHandler) + } else { + logger.info("Magpie TTS models found in cache") + } + + if includePrefill { + let prefillURL = repoDir.appendingPathComponent(ModelNames.Magpie.decoderPrefillFile) + if !FileManager.default.fileExists(atPath: prefillURL.path) { + logger.info("Fetching optional decoder_prefill model") + do { + try await DownloadUtils.downloadSubdirectory( + .magpieTts, + subdirectory: ModelNames.Magpie.decoderPrefillFile, + to: repoDir + ) + } catch { + logger.warning( + "decoder_prefill unavailable; falling back to step-by-step prefill: \(error)" + ) + } + } + } + + for language in languages { + try await ensureTokenizer(for: language, repoDirectory: repoDir) + } + + return repoDir + } + + /// Ensure tokenizer data for `language` exists. No-op for ByT5-only languages + /// (French, Italian, Vietnamese) since those use pure byte-level encoding. + public static func ensureTokenizer( + for language: MagpieLanguage, repoDirectory: URL + ) async throws { + let files = MagpieTokenizerFiles.files(for: language) + if files.isEmpty { return } + + let tokenizerDir = repoDirectory.appendingPathComponent(ModelNames.Magpie.tokenizerDir) + if !FileManager.default.fileExists(atPath: tokenizerDir.path) { + try FileManager.default.createDirectory( + at: tokenizerDir, withIntermediateDirectories: true) + } + + for file in files { + let localURL = tokenizerDir.appendingPathComponent(file) + if FileManager.default.fileExists(atPath: localURL.path) { continue } + + let remotePath = "\(ModelNames.Magpie.tokenizerDir)/\(file)" + logger.info("Downloading Magpie tokenizer file: \(remotePath)") + let remoteURL: URL + do { + remoteURL = try ModelRegistry.resolveModel(Repo.magpieTts.remotePath, remotePath) + } catch { + throw MagpieError.downloadFailed( + "failed to resolve HF URL for \(remotePath): \(error)") + } + + do { + let data = try await AssetDownloader.fetchData( + from: remoteURL, + description: "magpie tokenizer \(file)", + logger: logger + ) + try data.write(to: localURL, options: [.atomic]) + } catch { + throw MagpieError.tokenizerDataMissing( + language: language.rawValue, file: file) + } + } + } + + /// Return the directory that holds constants (JSON + npy + local_transformer/). + public static func constantsDirectory(in repoDirectory: URL) -> URL { + repoDirectory.appendingPathComponent(ModelNames.Magpie.constantsDir) + } + + /// Return the directory that holds per-language tokenizer lookups. + public static func tokenizerDirectory(in repoDirectory: URL) -> URL { + repoDirectory.appendingPathComponent(ModelNames.Magpie.tokenizerDir) + } + + private static func defaultCacheRoot() throws -> URL { + let base: URL + #if os(macOS) + base = FileManager.default.homeDirectoryForCurrentUser + .appendingPathComponent(".cache") + #else + guard + let first = FileManager.default.urls(for: .cachesDirectory, in: .userDomainMask).first + else { + throw MagpieError.downloadFailed("failed to locate caches directory") + } + base = first + #endif + let root = base.appendingPathComponent("fluidaudio").appendingPathComponent("Models") + if !FileManager.default.fileExists(atPath: root.path) { + try FileManager.default.createDirectory(at: root, withIntermediateDirectories: true) + } + return root + } +} + +/// Authoritative list of per-language tokenizer files. The emitters in +/// `mobius/models/tts/magpie/export_tokenizers.py` produce these names; the Swift +/// tokenizers consume them. +public enum MagpieTokenizerFiles { + /// Tokenizer filenames emitted by + /// `mobius/models/tts/magpie/coreml/export_tokenizers.py`. The naming convention + /// is `{tokenizer_name}_{suffix}.json` where `tokenizer_name` follows the NeMo + /// AggregatedTTSTokenizer names (e.g. `english_phoneme`, `french_chartokenizer`). + public static func files(for language: MagpieLanguage) -> [String] { + let base = tokenizerName(for: language) + switch language { + case .english, .spanish, .italian, .vietnamese: + // IPA G2P: token2id + phoneme_dict. + return ["\(base)_token2id.json", "\(base)_phoneme_dict.json"] + case .german: + // IPA G2P with heteronym fallback. + return [ + "\(base)_token2id.json", + "\(base)_phoneme_dict.json", + "\(base)_heteronyms.json", + ] + case .french, .hindi: + // Char-based tokenizers: only token2id lookup. + return ["\(base)_token2id.json"] + case .mandarin: + // pypinyin (phrase + char) + tone / letter / token2id maps. + return [ + "\(base)_token2id.json", + "\(base)_pinyin_dict.json", + "\(base)_tone_dict.json", + "\(base)_ascii_letter_dict.json", + "mandarin_pypinyin_char_dict.json", + "mandarin_pypinyin_phrase_dict.json", + "mandarin_jieba_dict.json", + ] + } + } + + /// NeMo tokenizer name for the given language (matches the Python map in + /// `generate_coreml._tokenize_text`). + public static func tokenizerName(for language: MagpieLanguage) -> String { + switch language { + case .english: return "english_phoneme" + case .spanish: return "spanish_phoneme" + case .german: return "german_phoneme" + case .italian: return "italian_phoneme" + case .vietnamese: return "vietnamese_phoneme" + case .mandarin: return "mandarin_phoneme" + case .french: return "french_chartokenizer" + case .hindi: return "hindi_chartokenizer" + } + } +} diff --git a/Sources/FluidAudio/TTS/Magpie/LocalTransformer/MagpieLocalTransformer.swift b/Sources/FluidAudio/TTS/Magpie/LocalTransformer/MagpieLocalTransformer.swift new file mode 100644 index 000000000..a9c9ae6cd --- /dev/null +++ b/Sources/FluidAudio/TTS/Magpie/LocalTransformer/MagpieLocalTransformer.swift @@ -0,0 +1,349 @@ +import Accelerate +import Foundation + +/// Swift-side 1-layer Local Transformer forward pass. +/// +/// Mirrors `local_transformer_forward` in +/// `mobius/models/tts/magpie/coreml/generate_coreml.py` (lines 108–155): +/// pre-norm causal self-attention → pre-norm FFN with tanh-GELU. Single attention +/// head, localDim=256. Uses BLAS (`cblas_sgemm`) for every matmul so the AR loop +/// stays cache-resident. +/// +/// The transformer is stateless across frames — each call to +/// `MagpieLocalTransformerSampler.sample(...)` rebuilds the sequence from the +/// current decoder hidden state and the 8 tokens sampled so far. +public struct MagpieLocalTransformer: Sendable { + + public let weights: MagpieLocalTransformerWeights + + public init(weights: MagpieLocalTransformerWeights) { + self.weights = weights + } + + /// Forward pass for a sequence of length `T` (T ≤ numCodebooks+2). + /// + /// - Parameter sequence: `[T * localDim]` row-major fp32 (input sequence + /// including positional embeddings yet to be added — this routine adds them). + /// Caller must supply `T` explicitly to avoid ambiguity on partial buffers. + /// - Returns: `[T * localDim]` row-major output. + public func forward(sequence: [Float], length T: Int) -> [Float] { + precondition(sequence.count >= T * weights.localDim, "sequence buffer too small") + precondition(T <= weights.maxPositions, "sequence length exceeds maxPositions") + + let D = weights.localDim + let ffnD = weights.ffnDim + + // x = sequence[:T*D] + posEmbedding[:T*D] + var x = Swift.Array(sequence.prefix(T * D)) + addPositional(into: &x, length: T) + + // ── Pre-norm causal self-attention ── + var xNorm = layerNorm(x, length: T, weight: weights.norm1Weight) + + // QKV = xNorm @ sa_qkv_weight.T (T,D) × (3D,D)ᵀ → (T, 3D) + var qkv = Swift.Array(repeating: 0, count: T * 3 * D) + matmulTransB( + a: xNorm, aRows: T, aCols: D, + b: weights.saQkvWeight, bRows: 3 * D, bCols: D, + out: &qkv) + + // Split QKV into Q, K, V (each T × D). Direct memcpy from packed (T, 3D) + // buffer; no intermediate Swift sub-array allocations per row. + var q = Swift.Array(repeating: 0, count: T * D) + var k = Swift.Array(repeating: 0, count: T * D) + var v = Swift.Array(repeating: 0, count: T * D) + let bytesPerRow = D * MemoryLayout.size + qkv.withUnsafeBufferPointer { srcPtr in + q.withUnsafeMutableBufferPointer { qPtr in + k.withUnsafeMutableBufferPointer { kPtr in + v.withUnsafeMutableBufferPointer { vPtr in + guard let src = srcPtr.baseAddress, + let qb = qPtr.baseAddress, + let kb = kPtr.baseAddress, + let vb = vPtr.baseAddress + else { return } + for t in 0..(repeating: 0, count: T * T) + matmulTransB( + a: q, aRows: T, aCols: D, + b: k, bRows: T, bCols: D, + out: &attn) + let scale = Float(1.0 / sqrt(Double(D))) + var scaleVar = scale + vDSP_vsmul(attn, 1, &scaleVar, &attn, 1, vDSP_Length(T * T)) + + // Causal mask + softmax + for t in 0.. t (future). Then softmax over [0, t]. + var maxVal: Float = -.infinity + for j in 0...t { + if attn[t * T + j] > maxVal { maxVal = attn[t * T + j] } + } + var denom: Float = 0 + for j in 0.. 0 { + let invDenom = 1.0 / denom + for j in 0...t { + attn[t * T + j] *= invDenom + } + } + } + + // saOut = attn @ V (T × T) × (T × D) → (T × D) + var saOut = Swift.Array(repeating: 0, count: T * D) + matmul( + a: attn, aRows: T, aCols: T, + b: v, bRows: T, bCols: D, + out: &saOut) + + // saOut = saOut @ sa_o_weight.T (T, D) × (D, D)ᵀ → (T, D) + var saProj = Swift.Array(repeating: 0, count: T * D) + matmulTransB( + a: saOut, aRows: T, aCols: D, + b: weights.saOWeight, bRows: D, bCols: D, + out: &saProj) + + // x += saProj + vDSP_vadd(x, 1, saProj, 1, &x, 1, vDSP_Length(T * D)) + + // ── Pre-norm FFN ── + xNorm = layerNorm(x, length: T, weight: weights.norm2Weight) + + // h = gelu(xNorm @ ffn_conv1_weight.T) → (T, ffnD) + var h = Swift.Array(repeating: 0, count: T * ffnD) + matmulTransB( + a: xNorm, aRows: T, aCols: D, + b: weights.ffnConv1Weight, bRows: ffnD, bCols: D, + out: &h) + applyGeluTanh(into: &h) + + // x += h @ ffn_conv2_weight.T → (T, D) + var ffnOut = Swift.Array(repeating: 0, count: T * D) + matmulTransB( + a: h, aRows: T, aCols: ffnD, + b: weights.ffnConv2Weight, bRows: D, bCols: ffnD, + out: &ffnOut) + vDSP_vadd(x, 1, ffnOut, 1, &x, 1, vDSP_Length(T * D)) + + return x + } + + /// Project a (dModel,) decoder hidden state through the input projection + /// → (localDim,). Used by the sampler to seed the LT sequence. + public func projectInput(hidden: [Float]) -> [Float] { + precondition(hidden.count == weights.dModel) + var out = weights.inProjBias // copy bias + // out += inProjWeight @ hidden (localDim, dModel) × (dModel,) → (localDim,) + inProjWeightApply(hidden: hidden, accumulate: &out) + return out + } + + /// Compute logits for codebook `cb`: last-timestep out_proj head. + public func codebookLogits(lastHidden: [Float], codebook: Int) -> [Float] { + precondition(lastHidden.count == weights.localDim) + let numCodes = weights.numCodesPerCodebook + var logits = weights.outProjBiases[codebook] // copy bias (numCodes,) + // logits += outProjWeights[codebook] @ lastHidden (numCodes, localDim) × (localDim,) + let w = weights.outProjWeights[codebook] + w.withUnsafeBufferPointer { wPtr in + lastHidden.withUnsafeBufferPointer { hPtr in + logits.withUnsafeMutableBufferPointer { outPtr in + cblas_sgemv( + CblasRowMajor, CblasNoTrans, + Int32(numCodes), Int32(weights.localDim), + 1.0, + wPtr.baseAddress, Int32(weights.localDim), + hPtr.baseAddress, 1, + 1.0, + outPtr.baseAddress, 1) + } + } + } + return logits + } + + // MARK: - Private helpers + + private func addPositional(into buffer: inout [Float], length T: Int) { + let D = weights.localDim + let count = T * D + var tmp = buffer + weights.posEmbedding.withUnsafeBufferPointer { posPtr in + tmp.withUnsafeMutableBufferPointer { dstPtr in + // Only use first T rows of posEmbedding. + vDSP_vadd( + dstPtr.baseAddress!, 1, + posPtr.baseAddress!, 1, + dstPtr.baseAddress!, 1, + vDSP_Length(count)) + } + } + buffer = tmp + } + + private func layerNorm(_ x: [Float], length T: Int, weight: [Float]) -> [Float] { + let D = weights.localDim + var out = Swift.Array(repeating: 0, count: T * D) + let eps: Float = 1e-5 + x.withUnsafeBufferPointer { xPtr in + weight.withUnsafeBufferPointer { wPtr in + out.withUnsafeMutableBufferPointer { outPtr in + guard let xBase = xPtr.baseAddress, + let wBase = wPtr.baseAddress, + let outBase = outPtr.baseAddress + else { return } + for t in 0..