Skip to content

feat(tts/pocket): multi-language support (EN + 9 new packs)#540

Open
Alex-Wengg wants to merge 6 commits intomainfrom
feat/pocket-tts-languages
Open

feat(tts/pocket): multi-language support (EN + 9 new packs)#540
Alex-Wengg wants to merge 6 commits intomainfrom
feat/pocket-tts-languages

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Apr 25, 2026

Summary

Adds first-class support for PocketTTS language packs upstream kyutai/pocket-tts just published, tracking issue #49. Users pick a language at manager construction; English continues to use the legacy HF repo root (zero breaking change, zero extra download for EN-only users).

Supported languages

ID Layers HF subtree
english 6 (repo root, legacy)
french_24l 24 v2/french_24l
german 6 v2/german
german_24l 24 v2/german_24l
italian 6 v2/italian
italian_24l 24 v2/italian_24l
portuguese 6 v2/portuguese
portuguese_24l 24 v2/portuguese_24l
spanish 6 v2/spanish
spanish_24l 24 v2/spanish_24l

French ships 24-layer only upstream; no 6-layer French pack exists.

Changes

  • PocketTtsLanguage: new enum (10 cases) with repoSubdirectory and transformerLayers.
  • ModelNames.PocketTTS: requiredModels(for:) and mimiDecoderFile(for:) dispatch between the legacy mimi_decoder_v2 filename (English root) and mimi_decoder (new packs). Back-compat alias retained.
  • PocketTtsLayerKeys: discovers KV-cache I/O names at runtime so 6L and 24L packs share the same inference path.
  • PocketTtsMimiKeys (new): discovers the Mimi decoder's audio output + per-state input→output pairing dynamically. Handles legacy English's 4-output [1]-shape bucket (offsets vs end_offsets) via name-based reservation, then shape-bucket fallback ordered by trailing var-number.
  • v2 voice safetensors prebakes: non-English packs ship <voice>.safetensors containing pre-computed LM transformer KV cache snapshots (per-layer [2, 1, seqLen, 16, 64] F32 + I64 offset). PocketTtsConstantsLoader.loadVoiceSnapshot parses the safetensors header (8-byte LE u64 + JSON) and extracts per-layer cache + offset tensors. PocketTtsSynthesizer.kvCacheStateFromSnapshot copies K/V blocks into the runtime [2, 1, kvCacheMaxLen, 16, 64] state independently. Skips the per-token cond_step voice prefill. Legacy English <voice>_audio_prompt.bin flat path unchanged.
  • PocketTtsResourceDownloader: ensureModels(language:) fetches only the requested v2/<lang>/ subtree via DownloadUtils.downloadSubdirectory. ensureVoice requests .safetensors for v2 packs and .bin for legacy English. ensureMimiEncoder() no longer pulls the whole English pack just to enable voice cloning.
  • PocketTtsModelStore / PocketTtsManager / PocketTtsSession / PocketTtsSynthesizer: language threaded through load + constants + KV-cache sizing. Voice data is cached per (language, voice). Mimi keys discovered + cached per language.
  • Voice cloning across languages: Mimi encoder is shared; PocketTtsVoiceData from one language's manager can be fed to another (see docs).
  • CLI: fluidaudiocli tts --backend pocket --language <id> (default english). Unknown values log the supported list and fall back to English.
  • Docs: Documentation/TTS/PocketTTS.md gains a Languages section + cross-language cloning example. README mentions the new set.

Tests

  • New PocketTtsLanguageTests — 10 pure-logic cases covering repoSubdirectory, layer counts, requiredModels(for:), mimiDecoderFile(for:), and the English back-compat alias. No model download / no network.
  • PocketTtsSessionTests updated for the new emptyKVCacheState(layers:) signature (no behavior change).
  • Full PocketTTS test suite: 23/23 passing.

Test plan

  • swift build — clean Release build
  • swift format lint --recursive --configuration .swift-format — clean
  • swift test --filter PocketTts — 23/23 pass
  • Manual: tts "Hello world" --backend pocket --output /tmp/en.wav — 3.60s rms 5913 (no regression vs main)
  • Manual: tts "Hola, esto es una prueba en español." --backend pocket --language spanish --voice alba --output /tmp/es.wav — 1.60s rms 6191
  • Manual: tts --backend pocket --language spanish --voice charles — 3.92s rms 4377
  • Manual: tts --backend pocket --language german --voice alba — 2.88s rms 5715
  • Manual: tts --backend pocket --language italian --voice alba — 4.08s rms 5545
  • Manual: tts --backend pocket --language portuguese --voice alba — 3.28s rms 6822
  • Manual: 24L variants (french_24l, *_24l) — code path is layer-count agnostic but not exercised locally due to disk constraints

HF asset state (verified via HEAD sweep)

All 10 v2 language packs (english, spanish, spanish_24l, german, german_24l, italian, italian_24l, portuguese_24l, french_24l) carry the full set of 21 voice .safetensors files. portuguese (6L) is missing two voices upstream: stuart_bell and vera (19/21 present). Total: 208/210 voice files available.

Non-goals

  • Runtime language switching on a live PocketTtsManager (create a new manager instead).
  • Auto-inferring language from text.
  • French 6-layer (upstream did not ship it).

Closes #49

Adds first-class support for PocketTTS language packs that upstream
kyutai/pocket-tts just published (issue #49):

- english            (existing, 6-layer, unchanged HF paths)
- french_24l         (24-layer only; upstream ships no 6-layer French)
- german, german_24l
- italian, italian_24l
- portuguese, portuguese_24l
- spanish, spanish_24l

English continues to live at the legacy HF repo root to keep existing
caches valid; new packs live under `v2/<lang>/`. Only the requested
subtree is downloaded, so English-only users pay zero extra bytes and
non-English users skip the English pack entirely.

Library changes:
- New `PocketTtsLanguage` enum with 10 cases + `repoSubdirectory` and
  `transformerLayers` (6 vs 24)
- `ModelNames.PocketTTS.requiredModels(for:)` and
  `mimiDecoderFile(for:)` dispatch between legacy `mimi_decoder_v2` and
  new `mimi_decoder` filenames
- `PocketTtsLayerKeys` discovers KV-cache I/O names at runtime so 6L
  and 24L packs share the same inference path
- `PocketTtsResourceDownloader.ensureModels(language:)` fetches only
  the matching `v2/<lang>/` subtree; `ensureMimiEncoder()` no longer
  pulls the English pack just to enable voice cloning
- `PocketTtsModelStore` / `PocketTtsManager` / `PocketTtsSession` /
  `PocketTtsSynthesizer` thread language through load, constants, KV
  cache sizing, and per-(language,voice) caching
- Voice cloning works across languages: Mimi encoder is shared, cloned
  `PocketTtsVoiceData` can be fed to any language's manager

CLI:
- `fluidaudiocli tts --backend pocket --language <id>` with supported
  list printed on invalid input (default: english)

Tests:
- New `PocketTtsLanguageTests` (10 pure-logic cases) covers
  `repoSubdirectory`, layer counts, `requiredModels(for:)`,
  `mimiDecoderFile(for:)`, and English back-compat alias
- Existing `PocketTtsSessionTests` updated for the new
  `emptyKVCacheState(layers:)` API (no behavior change)

Docs:
- `Documentation/TTS/PocketTTS.md` gains a Languages section (table of
  IDs + HF paths, Swift + CLI examples) and a Cloning-Across-Languages
  example
- README mentions EN/DE/ES/FR/IT/PT under PocketTTS
devin-ai-integration[bot]

This comment was marked as resolved.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 25, 2026

PocketTTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (187.5 KB)

Runtime: 0m34s

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 25, 2026

Kokoro TTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (634.8 KB)

Runtime: 0m39s

Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 25, 2026

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 10.69x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 45.4s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.045s Average chunk processing time
Max Chunk Time 0.091s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 0m52s • 04/25/2026, 07:12 PM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 25, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 29.86x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 8.857 25.2 Fetching diarization models
Model Compile 3.796 10.8 CoreML compilation
Audio Load 0.060 0.2 Loading audio file
Segmentation 10.541 30.0 Detecting speech regions
Embedding 17.568 50.0 Extracting speaker voices
Clustering 7.027 20.0 Grouping same speakers
Total 35.143 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 35.1s diarization time • Test runtime: 1m 46s • 04/25/2026, 07:22 PM EST

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 25, 2026

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 677.6x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 767.6x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 25, 2026

Qwen3-ASR int8 Smoke Test ✅

Check Result
Build
Model download
Model load
Transcription pipeline
Decoder size 571 MB (vs 1.1 GB f32)

Performance Metrics

Metric CI Value Expected on Apple Silicon
Median RTFx 0.04x ~2.5x
Overall RTFx 0.04x ~2.5x

Runtime: 5m23s

Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 25, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 33.4% <35%
Miss Rate 24.4% - -
False Alarm 0.2% - -
Speaker Error 8.8% - -
RTFx 8.6x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 3m 1s • 2026-04-25T23:23:59.424Z

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 25, 2026

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 5.95x
test-other 1.35% 0.00% 3.61x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 5.49x
test-other 1.00% 0.00% 3.52x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.63x Streaming real-time factor
Avg Chunk Time 1.464s Average time to process each chunk
Max Chunk Time 1.562s Maximum chunk processing time
First Token 1.739s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.66x Streaming real-time factor
Avg Chunk Time 1.425s Average time to process each chunk
Max Chunk Time 1.813s Maximum chunk processing time
First Token 1.395s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 5m16s • 04/25/2026, 07:17 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 25, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 10.4% <20% Diarization Error Rate (lower is better)
RTFx 11.12x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 12.672 13.4 Fetching diarization models
Model Compile 5.431 5.8 CoreML compilation
Audio Load 0.057 0.1 Loading audio file
Segmentation 26.626 28.2 VAD + speech detection
Embedding 94.105 99.7 Speaker embedding extraction
Clustering (VBx) 0.104 0.1 Hungarian algorithm + VBx clustering
Total 94.366 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 10.4% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 120.8s processing • Test runtime: 2m 0s • 04/25/2026, 07:19 PM EST

Two related improvements to the multi-language PocketTTS pipeline.

1. v2 voice safetensors loading (drops the English-stopgap fallback)

   Upstream v2 language packs ship per-voice prebakes as
   `<voice>.safetensors` containing pre-computed LM transformer KV cache
   snapshots (per-layer `[2, 1, seqLen, 16, 64]` F32 + I64 offset).
   Previously the runtime fell back to the English flat
   `<voice>_audio_prompt.bin` `[1, 125, 1024]`, producing English voice
   acoustic identity on non-English models — short, distorted output.

   - `PocketTtsConstantsLoader.loadVoice` picks `.safetensors` first,
     falls back to `.bin` for legacy English. New `loadVoiceSnapshot`
     parses the safetensors header (8-byte LE u64 + JSON) and extracts
     per-layer `transformer.layers.{N}.self_attn/{cache,offset}` tensors.
     Layer count auto-detected from key indices (handles 6L and 24L).
   - `PocketTtsVoiceData` gains optional `cacheSnapshot`. New
     `PocketTtsVoiceCacheSnapshot` carries flat per-layer K/V + offset.
   - `PocketTtsSynthesizer.kvCacheStateFromSnapshot` allocates the
     `[2, 1, kvCacheMaxLen, 16, 64]` MLMultiArrays and copies the K
     block (outer dim 0) and V block (outer dim 1) into the first
     `seqLen` positions independently — they don't lie at adjacent
     offsets in the destination because dest seq capacity is larger.
   - `prefillKVCache` branches on snapshot presence: snapshot path
     skips `cond_step` voice prefill entirely; flat path unchanged.
     Text prefill runs identically in both cases.
   - `PocketTtsResourceDownloader.ensureVoice` requests
     `<voice>.safetensors` for non-English language packs and
     `<voice>_audio_prompt.bin` for English; existing on-disk file in
     either format short-circuits the download.

   Verified end-to-end:
   - english (legacy .bin)        — 3.60s rms 5913 (no regression)
   - spanish alba (v2 safetensors) — 1.60s rms 6191
   - spanish charles (v2)          — 3.92s rms 4377
   - german alba (v2)              — 2.88s rms 5715
   - italian alba (v2)             — 4.08s rms 5545
   - portuguese alba (v2)          — 3.28s rms 6822

   24L variants share the same code path (layer-count agnostic) but
   were not exercised here due to local disk constraints.

2. Dynamic Mimi decoder schema discovery

   The Mimi audio codec ships in two upstream variants:
    - Legacy English: `attn{0,1}_cache` `[2, 1, 8, 256, 64]` heads-first,
      includes `attn{0,1}_end_offset` inputs and `new_end_offset*` outputs.
    - v2 packs: `attn{0,1}_cache` `[2, 1, 256, 8, 64]` seq-first, no
      end_offset I/O.

   CoreML auto-generates `var_NNN` output names per conversion so they
   differ between packs. The previous static `mimiStateMapping` only
   matched the legacy English pack and crashed on v2 packs.

   - New `PocketTtsMimiKeys` discovers the audio output (the only
     `[1, 1, 1920]` tensor) and pairs each state input to its update
     output via three rules: pass-through (input name == output name),
     `*end_offset*` reservation (so legacy English's 4-output `[1]`
     bucket pairs offsets vs end_offsets correctly), and shape-bucket
     fallback ordered by trailing var-number.
   - `PocketTtsModelStore` discovers + caches keys per language.
   - `PocketTtsSynthesizer` (one-shot, streaming, session) and
     `PocketTtsSession` consume `mimiKeys` instead of the dropped
     static mapping.
   - `PocketTtsSynthesizer+Mimi.runMimiDecoder` and
     `loadMimiInitialState` use the discovered shape map for state I/O.

Net result: one runtime path serves English (legacy) + 9 v2 language
packs (5 languages × 6L/24L pairs minus french-6L which upstream
didn't publish), with native voice acoustic identity throughout.
devin-ai-integration[bot]

This comment was marked as resolved.

- ModelNames.PocketTTS: remove 3 backward-compat aliases (mimiDecoder,
  mimiDecoderFile constant, requiredModels static let). The new language-
  aware APIs (mimiDecoderFile(for:), requiredModels(for:)) are the only
  callers. Repo.requiredModels switch updated to call requiredModels(for:
  .english) directly.
- PocketTtsConstants.kvCacheLayers: remove orphaned constant. Comment
  already deferred to PocketTtsLanguage.transformerLayers; nothing reads
  the static.
- PocketTtsLanguageTests: drop testEveryLanguageHasValidLayerCount,
  testRequiredModelsAlwaysHasFiveEntries, testLegacyRequiredModelsMatches-
  English. The first two were tautologies; the third covered the now-
  removed alias.

7 PocketTtsLanguageTests still passing; full build clean.
Three multi-language correctness bugs flagged by Devin Review:

1. embedTokens used hardcoded vocabSize=4001 (English) for bounds-checking
   against textEmbedTable, but v2 packs ship per-language tables of varying
   row counts. Out-of-range token IDs would either crash (OOB read at
   `id*dim`) or silently clamp to wrong embeddings. Derive vocabSize from
   the actual loaded table: `textEmbedTable.count / dim`.

2. makeSession() unconditionally called prefillKVCacheVoice, whose loop
   `0..<voiceData.promptLength` is a no-op for v2 voice packs (promptLength
   == 0, voice arrives via cacheSnapshot). Result: every session-mode
   non-English synthesis produced unconditioned (zero-prefill) audio.
   Mirror prefillKVCache's snapshot-vs-flat dispatch in makeSession.

3. ensureModels(language:) only forwarded progressHandler to downloadRepo
   (English path); the downloadSubdirectory call for non-English packs
   ignored it. Added progressHandler to downloadSubdirectory and emit
   .listing + per-file .downloading phases, then plumbed it through
   PocketTtsResourceDownloader.

Build clean, PocketTtsLanguageTests green (7/7).
Every PocketTTS language now ships under `v2/<lang>/` on HuggingFace
(including English). The old root-level English pack — flat
`<voice>_audio_prompt.bin` voice files, `mimi_decoder_v2.mlmodelc`,
optional `repoSubdirectory == nil` branch — is removed wholesale.

- `PocketTtsLanguage.repoSubdirectory` becomes non-optional `String`
  returning `v2/<rawValue>` for every case.
- `ModelNames.PocketTTS`: drop `mimiDecoderLegacy{,File}`, rename
  `mimiDecoderV2` → canonical `mimiDecoder`, replace
  `requiredModels(for:)` dispatch with a single `requiredModels` set.
- `PocketTtsResourceDownloader.ensureModels`: always download via
  `downloadSubdirectory("v2/<lang>")`. The `downloadRepo` English fast
  path is gone.
- `PocketTtsResourceDownloader.ensureVoice`: only fetches
  `<voice>.safetensors`. The `<voice>_audio_prompt.bin` download
  fallback is removed.
- `PocketTtsConstantsLoader.loadVoice`: only reads `.safetensors`.
- `PocketTtsModelStore.isMimiEncoderAvailable`: always walks two
  components up from the language root to reach the repo root (encoder
  is shared and lives at the repo top).

Voice cloning is unaffected: cloned voices still produce a runtime
`audioPrompt` and use the `prefillKVCacheVoice` path. Only on-disk
voice file format and download paths are simplified.

Tests updated: `PocketTtsLanguageTests` now asserts every language
follows the uniform `v2/<rawValue>` layout. 16/16 PocketTts tests
green.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 10 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing progressHandler call for zero-sized files in downloadSubdirectory

When a file has size == 0, the code at DownloadUtils.swift:640-642 creates the file and continues without calling progressHandler. The PR adds progress handler calls for the "file already exists" path (lines 627-632) and the "file downloaded" path (lines 673-677), but omits the zero-sized file path. This means the progress fraction won't advance for zero-sized files, and if the last file in the batch is zero-sized, the reported progress will never reach 1.0 for that file. The log message at line 679 is also skipped.

(Refers to lines 640-642)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

return UInt64(littleEndian: typed.pointee)
}
let headerStart = 8
let headerEnd = headerStart + Int(headerLen)
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Int(headerLen) traps on corrupt safetensors file with large header length

At PocketTtsConstantsLoader.swift:331, Int(headerLen) converts a UInt64 read directly from the file's first 8 bytes. If the safetensors file is corrupt or malicious and contains a header length value greater than Int.max (~9.2 exabytes), this Int(_:) initializer will trap with a fatal error before the subsequent guard headerEnd <= data.count bounds check at line 332 can catch it. Using Int(exactly: headerLen) with a guard would safely reject such files.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Model Support Requests

1 participant