feat(tts/pocket): multi-language support (EN + 9 new packs)#540
feat(tts/pocket): multi-language support (EN + 9 new packs)#540Alex-Wengg wants to merge 6 commits intomainfrom
Conversation
Adds first-class support for PocketTTS language packs that upstream kyutai/pocket-tts just published (issue #49): - english (existing, 6-layer, unchanged HF paths) - french_24l (24-layer only; upstream ships no 6-layer French) - german, german_24l - italian, italian_24l - portuguese, portuguese_24l - spanish, spanish_24l English continues to live at the legacy HF repo root to keep existing caches valid; new packs live under `v2/<lang>/`. Only the requested subtree is downloaded, so English-only users pay zero extra bytes and non-English users skip the English pack entirely. Library changes: - New `PocketTtsLanguage` enum with 10 cases + `repoSubdirectory` and `transformerLayers` (6 vs 24) - `ModelNames.PocketTTS.requiredModels(for:)` and `mimiDecoderFile(for:)` dispatch between legacy `mimi_decoder_v2` and new `mimi_decoder` filenames - `PocketTtsLayerKeys` discovers KV-cache I/O names at runtime so 6L and 24L packs share the same inference path - `PocketTtsResourceDownloader.ensureModels(language:)` fetches only the matching `v2/<lang>/` subtree; `ensureMimiEncoder()` no longer pulls the English pack just to enable voice cloning - `PocketTtsModelStore` / `PocketTtsManager` / `PocketTtsSession` / `PocketTtsSynthesizer` thread language through load, constants, KV cache sizing, and per-(language,voice) caching - Voice cloning works across languages: Mimi encoder is shared, cloned `PocketTtsVoiceData` can be fed to any language's manager CLI: - `fluidaudiocli tts --backend pocket --language <id>` with supported list printed on invalid input (default: english) Tests: - New `PocketTtsLanguageTests` (10 pure-logic cases) covers `repoSubdirectory`, layer counts, `requiredModels(for:)`, `mimiDecoderFile(for:)`, and English back-compat alias - Existing `PocketTtsSessionTests` updated for the new `emptyKVCacheState(layers:)` API (no behavior change) Docs: - `Documentation/TTS/PocketTTS.md` gains a Languages section (table of IDs + HF paths, Swift + CLI examples) and a Cloning-Across-Languages example - README mentions EN/DE/ES/FR/IT/PT under PocketTTS
PocketTTS Smoke Test ✅
Runtime: 0m34s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon. |
Kokoro TTS Smoke Test ✅
Runtime: 0m39s Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon. |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 0m52s • 04/25/2026, 07:12 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 35.1s diarization time • Test runtime: 1m 46s • 04/25/2026, 07:22 PM EST |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Qwen3-ASR int8 Smoke Test ✅
Performance Metrics
Runtime: 5m23s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 3m 1s • 2026-04-25T23:23:59.424Z |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 5m16s • 04/25/2026, 07:17 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 120.8s processing • Test runtime: 2m 0s • 04/25/2026, 07:19 PM EST |
Two related improvements to the multi-language PocketTTS pipeline.
1. v2 voice safetensors loading (drops the English-stopgap fallback)
Upstream v2 language packs ship per-voice prebakes as
`<voice>.safetensors` containing pre-computed LM transformer KV cache
snapshots (per-layer `[2, 1, seqLen, 16, 64]` F32 + I64 offset).
Previously the runtime fell back to the English flat
`<voice>_audio_prompt.bin` `[1, 125, 1024]`, producing English voice
acoustic identity on non-English models — short, distorted output.
- `PocketTtsConstantsLoader.loadVoice` picks `.safetensors` first,
falls back to `.bin` for legacy English. New `loadVoiceSnapshot`
parses the safetensors header (8-byte LE u64 + JSON) and extracts
per-layer `transformer.layers.{N}.self_attn/{cache,offset}` tensors.
Layer count auto-detected from key indices (handles 6L and 24L).
- `PocketTtsVoiceData` gains optional `cacheSnapshot`. New
`PocketTtsVoiceCacheSnapshot` carries flat per-layer K/V + offset.
- `PocketTtsSynthesizer.kvCacheStateFromSnapshot` allocates the
`[2, 1, kvCacheMaxLen, 16, 64]` MLMultiArrays and copies the K
block (outer dim 0) and V block (outer dim 1) into the first
`seqLen` positions independently — they don't lie at adjacent
offsets in the destination because dest seq capacity is larger.
- `prefillKVCache` branches on snapshot presence: snapshot path
skips `cond_step` voice prefill entirely; flat path unchanged.
Text prefill runs identically in both cases.
- `PocketTtsResourceDownloader.ensureVoice` requests
`<voice>.safetensors` for non-English language packs and
`<voice>_audio_prompt.bin` for English; existing on-disk file in
either format short-circuits the download.
Verified end-to-end:
- english (legacy .bin) — 3.60s rms 5913 (no regression)
- spanish alba (v2 safetensors) — 1.60s rms 6191
- spanish charles (v2) — 3.92s rms 4377
- german alba (v2) — 2.88s rms 5715
- italian alba (v2) — 4.08s rms 5545
- portuguese alba (v2) — 3.28s rms 6822
24L variants share the same code path (layer-count agnostic) but
were not exercised here due to local disk constraints.
2. Dynamic Mimi decoder schema discovery
The Mimi audio codec ships in two upstream variants:
- Legacy English: `attn{0,1}_cache` `[2, 1, 8, 256, 64]` heads-first,
includes `attn{0,1}_end_offset` inputs and `new_end_offset*` outputs.
- v2 packs: `attn{0,1}_cache` `[2, 1, 256, 8, 64]` seq-first, no
end_offset I/O.
CoreML auto-generates `var_NNN` output names per conversion so they
differ between packs. The previous static `mimiStateMapping` only
matched the legacy English pack and crashed on v2 packs.
- New `PocketTtsMimiKeys` discovers the audio output (the only
`[1, 1, 1920]` tensor) and pairs each state input to its update
output via three rules: pass-through (input name == output name),
`*end_offset*` reservation (so legacy English's 4-output `[1]`
bucket pairs offsets vs end_offsets correctly), and shape-bucket
fallback ordered by trailing var-number.
- `PocketTtsModelStore` discovers + caches keys per language.
- `PocketTtsSynthesizer` (one-shot, streaming, session) and
`PocketTtsSession` consume `mimiKeys` instead of the dropped
static mapping.
- `PocketTtsSynthesizer+Mimi.runMimiDecoder` and
`loadMimiInitialState` use the discovered shape map for state I/O.
Net result: one runtime path serves English (legacy) + 9 v2 language
packs (5 languages × 6L/24L pairs minus french-6L which upstream
didn't publish), with native voice acoustic identity throughout.
- ModelNames.PocketTTS: remove 3 backward-compat aliases (mimiDecoder, mimiDecoderFile constant, requiredModels static let). The new language- aware APIs (mimiDecoderFile(for:), requiredModels(for:)) are the only callers. Repo.requiredModels switch updated to call requiredModels(for: .english) directly. - PocketTtsConstants.kvCacheLayers: remove orphaned constant. Comment already deferred to PocketTtsLanguage.transformerLayers; nothing reads the static. - PocketTtsLanguageTests: drop testEveryLanguageHasValidLayerCount, testRequiredModelsAlwaysHasFiveEntries, testLegacyRequiredModelsMatches- English. The first two were tautologies; the third covered the now- removed alias. 7 PocketTtsLanguageTests still passing; full build clean.
Three multi-language correctness bugs flagged by Devin Review: 1. embedTokens used hardcoded vocabSize=4001 (English) for bounds-checking against textEmbedTable, but v2 packs ship per-language tables of varying row counts. Out-of-range token IDs would either crash (OOB read at `id*dim`) or silently clamp to wrong embeddings. Derive vocabSize from the actual loaded table: `textEmbedTable.count / dim`. 2. makeSession() unconditionally called prefillKVCacheVoice, whose loop `0..<voiceData.promptLength` is a no-op for v2 voice packs (promptLength == 0, voice arrives via cacheSnapshot). Result: every session-mode non-English synthesis produced unconditioned (zero-prefill) audio. Mirror prefillKVCache's snapshot-vs-flat dispatch in makeSession. 3. ensureModels(language:) only forwarded progressHandler to downloadRepo (English path); the downloadSubdirectory call for non-English packs ignored it. Added progressHandler to downloadSubdirectory and emit .listing + per-file .downloading phases, then plumbed it through PocketTtsResourceDownloader. Build clean, PocketTtsLanguageTests green (7/7).
Every PocketTTS language now ships under `v2/<lang>/` on HuggingFace
(including English). The old root-level English pack — flat
`<voice>_audio_prompt.bin` voice files, `mimi_decoder_v2.mlmodelc`,
optional `repoSubdirectory == nil` branch — is removed wholesale.
- `PocketTtsLanguage.repoSubdirectory` becomes non-optional `String`
returning `v2/<rawValue>` for every case.
- `ModelNames.PocketTTS`: drop `mimiDecoderLegacy{,File}`, rename
`mimiDecoderV2` → canonical `mimiDecoder`, replace
`requiredModels(for:)` dispatch with a single `requiredModels` set.
- `PocketTtsResourceDownloader.ensureModels`: always download via
`downloadSubdirectory("v2/<lang>")`. The `downloadRepo` English fast
path is gone.
- `PocketTtsResourceDownloader.ensureVoice`: only fetches
`<voice>.safetensors`. The `<voice>_audio_prompt.bin` download
fallback is removed.
- `PocketTtsConstantsLoader.loadVoice`: only reads `.safetensors`.
- `PocketTtsModelStore.isMimiEncoderAvailable`: always walks two
components up from the language root to reach the repo root (encoder
is shared and lives at the repo top).
Voice cloning is unaffected: cloned voices still produce a runtime
`audioPrompt` and use the `prefillKVCacheVoice` path. Only on-disk
voice file format and download paths are simplified.
Tests updated: `PocketTtsLanguageTests` now asserts every language
follows the uniform `v2/<rawValue>` layout. 16/16 PocketTts tests
green.
There was a problem hiding this comment.
🟡 Missing progressHandler call for zero-sized files in downloadSubdirectory
When a file has size == 0, the code at DownloadUtils.swift:640-642 creates the file and continues without calling progressHandler. The PR adds progress handler calls for the "file already exists" path (lines 627-632) and the "file downloaded" path (lines 673-677), but omits the zero-sized file path. This means the progress fraction won't advance for zero-sized files, and if the last file in the batch is zero-sized, the reported progress will never reach 1.0 for that file. The log message at line 679 is also skipped.
(Refers to lines 640-642)
Was this helpful? React with 👍 or 👎 to provide feedback.
| return UInt64(littleEndian: typed.pointee) | ||
| } | ||
| let headerStart = 8 | ||
| let headerEnd = headerStart + Int(headerLen) |
There was a problem hiding this comment.
🟡 Int(headerLen) traps on corrupt safetensors file with large header length
At PocketTtsConstantsLoader.swift:331, Int(headerLen) converts a UInt64 read directly from the file's first 8 bytes. If the safetensors file is corrupt or malicious and contains a header length value greater than Int.max (~9.2 exabytes), this Int(_:) initializer will trap with a fatal error before the subsequent guard headerEnd <= data.count bounds check at line 332 can catch it. Using Int(exactly: headerLen) with a guard would safely reject such files.
Was this helpful? React with 👍 or 👎 to provide feedback.
This reverts commit 6256164.
Summary
Adds first-class support for PocketTTS language packs upstream
kyutai/pocket-ttsjust published, tracking issue #49. Users pick a language at manager construction; English continues to use the legacy HF repo root (zero breaking change, zero extra download for EN-only users).Supported languages
englishfrench_24lv2/french_24lgermanv2/germangerman_24lv2/german_24litalianv2/italianitalian_24lv2/italian_24lportuguesev2/portugueseportuguese_24lv2/portuguese_24lspanishv2/spanishspanish_24lv2/spanish_24lFrench ships 24-layer only upstream; no 6-layer French pack exists.
Changes
PocketTtsLanguage: new enum (10 cases) withrepoSubdirectoryandtransformerLayers.ModelNames.PocketTTS:requiredModels(for:)andmimiDecoderFile(for:)dispatch between the legacymimi_decoder_v2filename (English root) andmimi_decoder(new packs). Back-compat alias retained.PocketTtsLayerKeys: discovers KV-cache I/O names at runtime so 6L and 24L packs share the same inference path.PocketTtsMimiKeys(new): discovers the Mimi decoder's audio output + per-state input→output pairing dynamically. Handles legacy English's 4-output[1]-shape bucket (offsets vs end_offsets) via name-based reservation, then shape-bucket fallback ordered by trailing var-number.<voice>.safetensorscontaining pre-computed LM transformer KV cache snapshots (per-layer[2, 1, seqLen, 16, 64]F32 + I64 offset).PocketTtsConstantsLoader.loadVoiceSnapshotparses the safetensors header (8-byte LE u64 + JSON) and extracts per-layer cache + offset tensors.PocketTtsSynthesizer.kvCacheStateFromSnapshotcopies K/V blocks into the runtime[2, 1, kvCacheMaxLen, 16, 64]state independently. Skips the per-tokencond_stepvoice prefill. Legacy English<voice>_audio_prompt.binflat path unchanged.PocketTtsResourceDownloader:ensureModels(language:)fetches only the requestedv2/<lang>/subtree viaDownloadUtils.downloadSubdirectory.ensureVoicerequests.safetensorsfor v2 packs and.binfor legacy English.ensureMimiEncoder()no longer pulls the whole English pack just to enable voice cloning.PocketTtsModelStore/PocketTtsManager/PocketTtsSession/PocketTtsSynthesizer: language threaded through load + constants + KV-cache sizing. Voice data is cached per(language, voice). Mimi keys discovered + cached per language.PocketTtsVoiceDatafrom one language's manager can be fed to another (see docs).fluidaudiocli tts --backend pocket --language <id>(defaultenglish). Unknown values log the supported list and fall back to English.Documentation/TTS/PocketTTS.mdgains a Languages section + cross-language cloning example. README mentions the new set.Tests
PocketTtsLanguageTests— 10 pure-logic cases coveringrepoSubdirectory, layer counts,requiredModels(for:),mimiDecoderFile(for:), and the English back-compat alias. No model download / no network.PocketTtsSessionTestsupdated for the newemptyKVCacheState(layers:)signature (no behavior change).Test plan
swift build— clean Release buildswift format lint --recursive --configuration .swift-format— cleanswift test --filter PocketTts— 23/23 passtts "Hello world" --backend pocket --output /tmp/en.wav— 3.60s rms 5913 (no regression vs main)tts "Hola, esto es una prueba en español." --backend pocket --language spanish --voice alba --output /tmp/es.wav— 1.60s rms 6191tts --backend pocket --language spanish --voice charles— 3.92s rms 4377tts --backend pocket --language german --voice alba— 2.88s rms 5715tts --backend pocket --language italian --voice alba— 4.08s rms 5545tts --backend pocket --language portuguese --voice alba— 3.28s rms 6822HF asset state (verified via HEAD sweep)
All 10 v2 language packs (
english,spanish,spanish_24l,german,german_24l,italian,italian_24l,portuguese_24l,french_24l) carry the full set of 21 voice.safetensorsfiles.portuguese(6L) is missing two voices upstream:stuart_bellandvera(19/21 present). Total: 208/210 voice files available.Non-goals
PocketTtsManager(create a new manager instead).Closes #49