feat(tts/pocket): multi-language support (EN + 9 new packs) by Alex-Wengg · Pull Request #540 · FluidInference/FluidAudio

Alex-Wengg · 2026-04-25T02:41:26Z

Summary

Adds first-class support for PocketTTS language packs upstream kyutai/pocket-tts just published, tracking issue #49. Users pick a language at manager construction; English continues to use the legacy HF repo root (zero breaking change, zero extra download for EN-only users).

Supported languages

ID	Layers	HF subtree
`english`	6	(repo root, legacy)
`french_24l`	24	`v2/french_24l`
`german`	6	`v2/german`
`german_24l`	24	`v2/german_24l`
`italian`	6	`v2/italian`
`italian_24l`	24	`v2/italian_24l`
`portuguese`	6	`v2/portuguese`
`portuguese_24l`	24	`v2/portuguese_24l`
`spanish`	6	`v2/spanish`
`spanish_24l`	24	`v2/spanish_24l`

French ships 24-layer only upstream; no 6-layer French pack exists.

Changes

PocketTtsLanguage: new enum (10 cases) with repoSubdirectory and transformerLayers.
ModelNames.PocketTTS: requiredModels(for:) and mimiDecoderFile(for:) dispatch between the legacy mimi_decoder_v2 filename (English root) and mimi_decoder (new packs). Back-compat alias retained.
PocketTtsLayerKeys: discovers KV-cache I/O names at runtime so 6L and 24L packs share the same inference path.
PocketTtsMimiKeys (new): discovers the Mimi decoder's audio output + per-state input→output pairing dynamically. Handles legacy English's 4-output [1]-shape bucket (offsets vs end_offsets) via name-based reservation, then shape-bucket fallback ordered by trailing var-number.
v2 voice safetensors prebakes: non-English packs ship <voice>.safetensors containing pre-computed LM transformer KV cache snapshots (per-layer [2, 1, seqLen, 16, 64] F32 + I64 offset). PocketTtsConstantsLoader.loadVoiceSnapshot parses the safetensors header (8-byte LE u64 + JSON) and extracts per-layer cache + offset tensors. PocketTtsSynthesizer.kvCacheStateFromSnapshot copies K/V blocks into the runtime [2, 1, kvCacheMaxLen, 16, 64] state independently. Skips the per-token cond_step voice prefill. Legacy English <voice>_audio_prompt.bin flat path unchanged.
PocketTtsResourceDownloader: ensureModels(language:) fetches only the requested v2/<lang>/ subtree via DownloadUtils.downloadSubdirectory. ensureVoice requests .safetensors for v2 packs and .bin for legacy English. ensureMimiEncoder() no longer pulls the whole English pack just to enable voice cloning.
PocketTtsModelStore / PocketTtsManager / PocketTtsSession / PocketTtsSynthesizer: language threaded through load + constants + KV-cache sizing. Voice data is cached per (language, voice). Mimi keys discovered + cached per language.
Voice cloning across languages: Mimi encoder is shared; PocketTtsVoiceData from one language's manager can be fed to another (see docs).
CLI: fluidaudiocli tts --backend pocket --language <id> (default english). Unknown values log the supported list and fall back to English.
Docs: Documentation/TTS/PocketTTS.md gains a Languages section + cross-language cloning example. README mentions the new set.

Tests

New PocketTtsLanguageTests — 10 pure-logic cases covering repoSubdirectory, layer counts, requiredModels(for:), mimiDecoderFile(for:), and the English back-compat alias. No model download / no network.
PocketTtsSessionTests updated for the new emptyKVCacheState(layers:) signature (no behavior change).
Full PocketTTS test suite: 23/23 passing.

Test plan

HF asset state (verified via HEAD sweep)

All 10 v2 language packs (english, spanish, spanish_24l, german, german_24l, italian, italian_24l, portuguese_24l, french_24l) carry the full set of 21 voice .safetensors files. portuguese (6L) is missing two voices upstream: stuart_bell and vera (19/21 present). Total: 208/210 voice files available.

Non-goals

Runtime language switching on a live PocketTtsManager (create a new manager instead).
Auto-inferring language from text.
French 6-layer (upstream did not ship it).

Closes #49

Adds first-class support for PocketTTS language packs that upstream kyutai/pocket-tts just published (issue #49): - english (existing, 6-layer, unchanged HF paths) - french_24l (24-layer only; upstream ships no 6-layer French) - german, german_24l - italian, italian_24l - portuguese, portuguese_24l - spanish, spanish_24l English continues to live at the legacy HF repo root to keep existing caches valid; new packs live under `v2/<lang>/`. Only the requested subtree is downloaded, so English-only users pay zero extra bytes and non-English users skip the English pack entirely. Library changes: - New `PocketTtsLanguage` enum with 10 cases + `repoSubdirectory` and `transformerLayers` (6 vs 24) - `ModelNames.PocketTTS.requiredModels(for:)` and `mimiDecoderFile(for:)` dispatch between legacy `mimi_decoder_v2` and new `mimi_decoder` filenames - `PocketTtsLayerKeys` discovers KV-cache I/O names at runtime so 6L and 24L packs share the same inference path - `PocketTtsResourceDownloader.ensureModels(language:)` fetches only the matching `v2/<lang>/` subtree; `ensureMimiEncoder()` no longer pulls the English pack just to enable voice cloning - `PocketTtsModelStore` / `PocketTtsManager` / `PocketTtsSession` / `PocketTtsSynthesizer` thread language through load, constants, KV cache sizing, and per-(language,voice) caching - Voice cloning works across languages: Mimi encoder is shared, cloned `PocketTtsVoiceData` can be fed to any language's manager CLI: - `fluidaudiocli tts --backend pocket --language <id>` with supported list printed on invalid input (default: english) Tests: - New `PocketTtsLanguageTests` (10 pure-logic cases) covers `repoSubdirectory`, layer counts, `requiredModels(for:)`, `mimiDecoderFile(for:)`, and English back-compat alias - Existing `PocketTtsSessionTests` updated for the new `emptyKVCacheState(layers:)` API (no behavior change) Docs: - `Documentation/TTS/PocketTTS.md` gains a Languages section (table of IDs + HF paths, Swift + CLI examples) and a Cloning-Across-Languages example - README mentions EN/DE/ES/FR/IT/PT under PocketTTS

github-actions · 2026-04-25T02:45:25Z

PocketTTS Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Synthesis pipeline	✅
Output WAV	✅ (187.5 KB)

_{Runtime: 0m34s}

_{Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.}

github-actions · 2026-04-25T02:45:50Z

Kokoro TTS Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Synthesis pipeline	✅
Output WAV	✅ (634.8 KB)

_{Runtime: 0m39s}

_{Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon.}

github-actions · 2026-04-25T02:46:14Z

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric	Value	Description
WER (Avg)	7.03%	Average Word Error Rate
WER (Med)	4.17%	Median Word Error Rate
RTFx	10.69x	Real-time factor (higher = faster)
Total Audio	470.6s	Total audio duration processed
Total Time	45.4s	Total processing time

Streaming Metrics

Metric	Value	Description
Avg Chunk Time	0.045s	Average chunk processing time
Max Chunk Time	0.091s	Maximum chunk processing time
EOU Detections	0	Total End-of-Utterance detections

_{Test runtime: 0m52s • 04/25/2026, 07:12 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O}

github-actions · 2026-04-25T02:48:17Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	29.86x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	8.857	25.2	Fetching diarization models
Model Compile	3.796	10.8	CoreML compilation
Audio Load	0.060	0.2	Loading audio file
Segmentation	10.541	30.0	Detecting speech regions
Embedding	17.568	50.0	Extracting speaker voices
Clustering	7.027	20.0	Grouping same speakers
Total	35.143	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 35.1s diarization time • Test runtime: 1m 46s • 04/25/2026, 07:22 PM EST}

github-actions · 2026-04-25T02:50:43Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	677.6x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	767.6x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2026-04-25T02:50:45Z

Qwen3-ASR int8 Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Transcription pipeline	✅
Decoder size	571 MB (vs 1.1 GB f32)

Performance Metrics

Metric	CI Value	Expected on Apple Silicon
Median RTFx	0.04x	~2.5x
Overall RTFx	0.04x	~2.5x

_{Runtime: 5m23s}

_{Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.}

github-actions · 2026-04-25T02:52:56Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	33.4%	<35%	✅
Miss Rate	24.4%	-	-
False Alarm	0.2%	-	-
Speaker Error	8.8%	-	-
RTFx	8.6x	>1.0x	✅
Speakers	4/4	-	-

_{Sortformer High-Latency • ES2004a • Runtime: 3m 1s • 2026-04-25T23:23:59.424Z}

github-actions · 2026-04-25T02:55:16Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	5.95x	✅
test-other	1.35%	0.00%	3.61x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.80%	0.00%	5.49x	✅
test-other	1.00%	0.00%	3.52x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.63x	Streaming real-time factor
Avg Chunk Time	1.464s	Average time to process each chunk
Max Chunk Time	1.562s	Maximum chunk processing time
First Token	1.739s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.66x	Streaming real-time factor
Avg Chunk Time	1.425s	Average time to process each chunk
Max Chunk Time	1.813s	Maximum chunk processing time
First Token	1.395s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 5m16s • 04/25/2026, 07:17 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

github-actions · 2026-04-25T02:56:44Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	10.4%	<20%	✅	Diarization Error Rate (lower is better)
RTFx	11.12x	>1.0x	✅	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	12.672	13.4	Fetching diarization models
Model Compile	5.431	5.8	CoreML compilation
Audio Load	0.057	0.1	Loading audio file
Segmentation	26.626	28.2	VAD + speech detection
Embedding	94.105	99.7	Speaker embedding extraction
Clustering (VBx)	0.104	0.1	Hungarian algorithm + VBx clustering
Total	94.366	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	10.4%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 120.8s processing • Test runtime: 2m 0s • 04/25/2026, 07:19 PM EST}

Two related improvements to the multi-language PocketTTS pipeline. 1. v2 voice safetensors loading (drops the English-stopgap fallback) Upstream v2 language packs ship per-voice prebakes as `<voice>.safetensors` containing pre-computed LM transformer KV cache snapshots (per-layer `[2, 1, seqLen, 16, 64]` F32 + I64 offset). Previously the runtime fell back to the English flat `<voice>_audio_prompt.bin` `[1, 125, 1024]`, producing English voice acoustic identity on non-English models — short, distorted output. - `PocketTtsConstantsLoader.loadVoice` picks `.safetensors` first, falls back to `.bin` for legacy English. New `loadVoiceSnapshot` parses the safetensors header (8-byte LE u64 + JSON) and extracts per-layer `transformer.layers.{N}.self_attn/{cache,offset}` tensors. Layer count auto-detected from key indices (handles 6L and 24L). - `PocketTtsVoiceData` gains optional `cacheSnapshot`. New `PocketTtsVoiceCacheSnapshot` carries flat per-layer K/V + offset. - `PocketTtsSynthesizer.kvCacheStateFromSnapshot` allocates the `[2, 1, kvCacheMaxLen, 16, 64]` MLMultiArrays and copies the K block (outer dim 0) and V block (outer dim 1) into the first `seqLen` positions independently — they don't lie at adjacent offsets in the destination because dest seq capacity is larger. - `prefillKVCache` branches on snapshot presence: snapshot path skips `cond_step` voice prefill entirely; flat path unchanged. Text prefill runs identically in both cases. - `PocketTtsResourceDownloader.ensureVoice` requests `<voice>.safetensors` for non-English language packs and `<voice>_audio_prompt.bin` for English; existing on-disk file in either format short-circuits the download. Verified end-to-end: - english (legacy .bin) — 3.60s rms 5913 (no regression) - spanish alba (v2 safetensors) — 1.60s rms 6191 - spanish charles (v2) — 3.92s rms 4377 - german alba (v2) — 2.88s rms 5715 - italian alba (v2) — 4.08s rms 5545 - portuguese alba (v2) — 3.28s rms 6822 24L variants share the same code path (layer-count agnostic) but were not exercised here due to local disk constraints. 2. Dynamic Mimi decoder schema discovery The Mimi audio codec ships in two upstream variants: - Legacy English: `attn{0,1}_cache` `[2, 1, 8, 256, 64]` heads-first, includes `attn{0,1}_end_offset` inputs and `new_end_offset*` outputs. - v2 packs: `attn{0,1}_cache` `[2, 1, 256, 8, 64]` seq-first, no end_offset I/O. CoreML auto-generates `var_NNN` output names per conversion so they differ between packs. The previous static `mimiStateMapping` only matched the legacy English pack and crashed on v2 packs. - New `PocketTtsMimiKeys` discovers the audio output (the only `[1, 1, 1920]` tensor) and pairs each state input to its update output via three rules: pass-through (input name == output name), `*end_offset*` reservation (so legacy English's 4-output `[1]` bucket pairs offsets vs end_offsets correctly), and shape-bucket fallback ordered by trailing var-number. - `PocketTtsModelStore` discovers + caches keys per language. - `PocketTtsSynthesizer` (one-shot, streaming, session) and `PocketTtsSession` consume `mimiKeys` instead of the dropped static mapping. - `PocketTtsSynthesizer+Mimi.runMimiDecoder` and `loadMimiInitialState` use the discovered shape map for state I/O. Net result: one runtime path serves English (legacy) + 9 v2 language packs (5 languages × 6L/24L pairs minus french-6L which upstream didn't publish), with native voice acoustic identity throughout.

- ModelNames.PocketTTS: remove 3 backward-compat aliases (mimiDecoder, mimiDecoderFile constant, requiredModels static let). The new language- aware APIs (mimiDecoderFile(for:), requiredModels(for:)) are the only callers. Repo.requiredModels switch updated to call requiredModels(for: .english) directly. - PocketTtsConstants.kvCacheLayers: remove orphaned constant. Comment already deferred to PocketTtsLanguage.transformerLayers; nothing reads the static. - PocketTtsLanguageTests: drop testEveryLanguageHasValidLayerCount, testRequiredModelsAlwaysHasFiveEntries, testLegacyRequiredModelsMatches- English. The first two were tautologies; the third covered the now- removed alias. 7 PocketTtsLanguageTests still passing; full build clean.

Three multi-language correctness bugs flagged by Devin Review: 1. embedTokens used hardcoded vocabSize=4001 (English) for bounds-checking against textEmbedTable, but v2 packs ship per-language tables of varying row counts. Out-of-range token IDs would either crash (OOB read at `id*dim`) or silently clamp to wrong embeddings. Derive vocabSize from the actual loaded table: `textEmbedTable.count / dim`. 2. makeSession() unconditionally called prefillKVCacheVoice, whose loop `0..<voiceData.promptLength` is a no-op for v2 voice packs (promptLength == 0, voice arrives via cacheSnapshot). Result: every session-mode non-English synthesis produced unconditioned (zero-prefill) audio. Mirror prefillKVCache's snapshot-vs-flat dispatch in makeSession. 3. ensureModels(language:) only forwarded progressHandler to downloadRepo (English path); the downloadSubdirectory call for non-English packs ignored it. Added progressHandler to downloadSubdirectory and emit .listing + per-file .downloading phases, then plumbed it through PocketTtsResourceDownloader. Build clean, PocketTtsLanguageTests green (7/7).

Every PocketTTS language now ships under `v2/<lang>/` on HuggingFace (including English). The old root-level English pack — flat `<voice>_audio_prompt.bin` voice files, `mimi_decoder_v2.mlmodelc`, optional `repoSubdirectory == nil` branch — is removed wholesale. - `PocketTtsLanguage.repoSubdirectory` becomes non-optional `String` returning `v2/<rawValue>` for every case. - `ModelNames.PocketTTS`: drop `mimiDecoderLegacy{,File}`, rename `mimiDecoderV2` → canonical `mimiDecoder`, replace `requiredModels(for:)` dispatch with a single `requiredModels` set. - `PocketTtsResourceDownloader.ensureModels`: always download via `downloadSubdirectory("v2/<lang>")`. The `downloadRepo` English fast path is gone. - `PocketTtsResourceDownloader.ensureVoice`: only fetches `<voice>.safetensors`. The `<voice>_audio_prompt.bin` download fallback is removed. - `PocketTtsConstantsLoader.loadVoice`: only reads `.safetensors`. - `PocketTtsModelStore.isMimiEncoderAvailable`: always walks two components up from the language root to reach the repo root (encoder is shared and lives at the repo top). Voice cloning is unaffected: cloned voices still produce a runtime `audioPrompt` and use the `prefillKVCacheVoice` path. Only on-disk voice file format and download paths are simplified. Tests updated: `PocketTtsLanguageTests` now asserts every language follows the uniform `v2/<rawValue>` layout. 16/16 PocketTts tests green.

devin-ai-integration

Devin Review found 2 new potential issues.

View 10 additional findings in Devin Review.

devin-ai-integration · 2026-04-25T22:40:49Z

🟡 Missing progressHandler call for zero-sized files in downloadSubdirectory

When a file has size == 0, the code at DownloadUtils.swift:640-642 creates the file and continues without calling progressHandler. The PR adds progress handler calls for the "file already exists" path (lines 627-632) and the "file downloaded" path (lines 673-677), but omits the zero-sized file path. This means the progress fraction won't advance for zero-sized files, and if the last file in the batch is zero-sized, the reported progress will never reach 1.0 for that file. The log message at line 679 is also skipped.

(Refers to lines 640-642)

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-04-25T22:40:50Z

+            return UInt64(littleEndian: typed.pointee)
+        }
+        let headerStart = 8
+        let headerEnd = headerStart + Int(headerLen)


🟡 Int(headerLen) traps on corrupt safetensors file with large header length

At PocketTtsConstantsLoader.swift:331, Int(headerLen) converts a UInt64 read directly from the file's first 8 bytes. If the safetensors file is corrupt or malicious and contains a header length value greater than Int.max (~9.2 exabytes), this Int(_:) initializer will trap with a fatal error before the subsequent guard headerEnd <= data.count bounds check at line 332 can catch it. Using Int(exactly: headerLen) with a guard would safely reject such files.

Was this helpful? React with 👍 or 👎 to provide feedback.

This reverts commit 6256164.

This comment was marked as resolved.

Sign in to view

Alex-Wengg added 3 commits April 25, 2026 17:31

devin-ai-integration Bot reviewed Apr 25, 2026

View reviewed changes

Revert "refactor(tts/pocket): drop v1 (legacy English root) HF layout"

e89cc51

This reverts commit 6256164.

Conversation

Alex-Wengg commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Supported languages

Changes

Tests

Test plan

HF asset state (verified via HEAD sweep)

Non-goals

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PocketTTS Smoke Test ✅

Uh oh!

github-actions Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Kokoro TTS Smoke Test ✅

Uh oh!

github-actions Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parakeet EOU Benchmark Results ✅

Performance Metrics

Streaming Metrics

Uh oh!

github-actions Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3-ASR int8 Smoke Test ✅

Performance Metrics

Uh oh!

github-actions Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Uh oh!

github-actions Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASR Benchmark Results ✅

Parakeet v3 (multilingual)

Parakeet v2 (English-optimized)

Streaming (v3)

Streaming (v2)

Expected RTFx Performance on Physical M1 Hardware:

Uh oh!

github-actions Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Alex-Wengg commented Apr 25, 2026 •

edited

Loading

github-actions Bot commented Apr 25, 2026 •

edited

Loading

github-actions Bot commented Apr 25, 2026 •

edited

Loading

github-actions Bot commented Apr 25, 2026 •

edited

Loading

github-actions Bot commented Apr 25, 2026 •

edited

Loading

github-actions Bot commented Apr 25, 2026 •

edited

Loading

github-actions Bot commented Apr 25, 2026 •

edited

Loading

github-actions Bot commented Apr 25, 2026 •

edited

Loading

github-actions Bot commented Apr 25, 2026 •

edited

Loading

github-actions Bot commented Apr 25, 2026 •

edited

Loading

devin-ai-integration Bot Apr 25, 2026 •

edited

Loading