Skip to content

feat(tts): CosyVoice3 Mandarin zero-shot TTS port#536

Open
Alex-Wengg wants to merge 9 commits intomainfrom
tts/cosyvoice3-swift-port
Open

feat(tts): CosyVoice3 Mandarin zero-shot TTS port#536
Alex-Wengg wants to merge 9 commits intomainfrom
tts/cosyvoice3-swift-port

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Apr 21, 2026

Summary

Swift port of CosyVoice3 (Mandarin zero-shot TTS) wired through the
four validated CoreML mlpackages hosted at
FluidInference/CosyVoice3-0.5B-coreml.
Delivered in two layered phases matching the existing Kokoro manager shape:

  • Phase 1 (parity harness): full Swift pipeline that ingests a Python
    frontend fixture (.safetensors) and produces WAV within parity of the
    Python reference — validates all four CoreML bindings, 24-layer Qwen2
    KV-cache slicing, RAS sampler, and Flow / HiFT wiring.
  • Phase 2 (native frontend): pure-Swift Qwen2 BPE tokenizer + Qwen2
    text embeddings + minimal Mandarin text normalizer + 24 kHz log-mel
    DSP so callers can synthesize directly from String input without a
    Python dependency.

Conversion pipeline that produced the mlpackages lives at
FluidInference/mobius#42.

What's shipped

Public API (Sources/FluidAudio/TTS/CosyVoice3/)

public actor CosyVoice3TtsManager {
    public init(directory: URL? = nil, computeUnits: MLComputeUnits = .cpuAndNeuralEngine)
    public static func downloadAndCreate(from repo: Repo = .cosyvoice3,
                                         computeUnits: MLComputeUnits = .cpuAndNeuralEngine)
                                         async throws -> CosyVoice3TtsManager
    public func initialize() async throws
    public func synthesize(text: String,
                           promptAssets: CosyVoice3PromptAssets,
                           options: CosyVoice3SynthesisOptions = .init(),
                           prenormalized: Bool = false) async throws -> CosyVoice3SynthesisResult
}

TtsBackend gains case cosyvoice3; ModelNames gets the
CosyVoice3 enum plus Repo.cosyvoice3 pointing at the HF repo.

Pipeline components

Layer File Notes
Model loader Assets/CosyVoice3ModelStore.swift Flat + nested layout probing, .mlmodelc compile cache
Downloader Assets/CosyVoice3ResourceDownloader.swift DownloadUtils wrapper for the 4 mlpackages + embeddings
Safetensors Shared/SafetensorsReader.swift ~170 LoC pure-Swift mmap + fp16/fp32/i32 accessors
Prefill/decode Pipeline/Synthesize/CosyVoice3Synthesizer.swift In-place [24,1,2,768,64] fp16 KV-cache passthrough
Sampler Pipeline/Synthesize/CosyVoice3RasSampler.swift top-p / top-k / repetition mask, seed-tokens bypass
Speech embed Pipeline/Synthesize/CosyVoice3SpeechEmbeddings.swift Lazy mmap of 6761×896 fp16 table (12 MB)
Frontend Pipeline/Preprocess/CosyVoice3TextFrontend.swift Special-token splitting + lm_input assembly
Tokenizer Pipeline/Preprocess/Qwen2BpeTokenizer.swift tiktoken-compatible byte-level BPE, 151 936 vocab
Text embed Pipeline/Preprocess/CosyVoice3TextEmbeddings.swift 151 936×896 fp16 mmap → row copy
TN Pipeline/Preprocess/CosyVoice3ChineseNormalizer.swift Minimal regex-free port of frontend_utils.py
Prompt mel Pipeline/Preprocess/CosyVoice3PromptMel.swift 24 kHz log-mel matching matcha audio.py

CLI (Sources/FluidAudioCLI/Commands/)

fluidaudio tts --backend cosyvoice3-parity --fixture … --models-dir … --output …
fluidaudio tts --backend cosyvoice3 --text "希望你以后能够做的比我还好用" \
               --prompt-assets … --models-dir … --output …
fluidaudio tts --backend cosyvoice3-tokenizer --fixture …     # BPE parity
fluidaudio tts --backend cosyvoice3-frontend --text …         # lm_input dump

Tests

  • CosyVoice3ChineseNormalizerTests — 8 cases covering contains_chinese,
    replace_blank, corner marks, brackets, digit spellout, trailing
    comma collapse, end-to-end, is_only_punctuation.
  • CosyVoice3PromptMelTests — 8 cases covering the matcha frame-count
    formula, zero-audio log floor clamp, 200 Hz sine peak in low mel bins,
    exact reflect-pad semantics, periodic Hann endpoints, mel-basis shape /
    non-zero integrals, token-ratio trimming (and the throws-if-too-short
    path).

Integration

  • ModelNames.swiftCosyVoice3 enum + Repo.cosyvoice3
  • TtsBackend.swiftcase cosyvoice3
  • TTSCommand.swift — subcommand wiring

Test plan

  • swift build (release)
  • Full swift test on this branch: 1 435 tests, 24 skipped, 0 failures (~13 min)
  • --filter CosyVoice3ChineseNormalizer — 8/8 pass
  • --filter CosyVoice3PromptMel — 8/8 pass
  • Phase 1 end-to-end parity vs build/wavs/e2e_shipping.wav (max|Δ| < 1e-3, SNR > 40 dB, CPU-only fp32 Flow)
  • Phase 2 end-to-end round-trip: Swift output → whisper.base → expected transcript

Non-goals / follow-ups

  • SpeechTokenizer and CAMPPlus remain Python-side for prompt asset
    preparation; both have CoreML mlpackages but the required DSPs aren't
    yet ported. Users pass pre-computed promptSpeechIds / spkEmbedding
    in CosyVoice3PromptAssets for now.
  • Full wetext.ZhNormalizer (year / currency / decimals / units) is not
    ported. Callers that need production-grade TN run wetext server-side
    and pass prenormalized: true.
  • Flow stays fp32 (1.2 GB) until CoreMLTools pins layer_norm fused fp16.

🤖 Generated with Claude Code


Open in Devin Review

Swift port of CosyVoice3 zero-shot Mandarin TTS targeting the four
validated CoreML mlpackages hosted at
FluidInference/CosyVoice3-0.5B-coreml. Mirrors the Kokoro manager API
shape (public actor, init, initialize, synthesize → Data).

Phase 1 — parity harness
- CosyVoice3ModelStore loads LLM-Prefill-T256-M768, LLM-Decode-M768,
  Flow-N250-fp32, HiFT-T500-fp16 from a local build dir or HF repo
- SafetensorsReader: pure-Swift mmap + typed accessors (fp16/fp32/i32)
- CosyVoice3RasSampler: top-p / top-k / repetition mask, with
  seedTokens() bypass for parity tests
- CosyVoice3Synthesizer: prefill → decode loop with in-place KV-cache
  passthrough [24,1,2,768,64] fp16 → Flow (N=250) → HiFT (T=500)
- Speech embedding lazy mmap (6761×896 fp16)
- Frontend fixture ingest for parity against Python reference WAV

Phase 2 — native Mandarin frontend
- Qwen2 byte-level BPE tokenizer (tiktoken-compatible), 151 936 vocab
- Qwen2 text embedding table lookup (151 936×896 fp16 mmap)
- CosyVoice3TextFrontend: special-token splitting, lm_input assembly
- CosyVoice3ChineseNormalizer: minimal regex-free TN port of
  frontend_utils.py (replace_blank, corner marks, brackets, digit
  spellout, trailing comma collapse). Callers can pass
  prenormalized: true to bypass.
- CosyVoice3PromptMel: 24 kHz log-mel matching matcha audio.py
  (n_fft=1920, hop=480, win=1920, num_mels=80, reflect-pad 720,
  center=False, Slaney norm, log floor 1e-5, magnitude eps 1e-9)

Public API
- CosyVoice3TtsManager: actor with init(directory:), initialize(),
  synthesize(text:promptAssets:options:prenormalized:), and
  downloadAndCreate(from repo:)
- CosyVoice3PromptAssets: prompt text + speech IDs + mel + speaker
  embedding bundle, loadable from safetensors

CLI (Sources/FluidAudioCLI/Commands/)
- cosyvoice3-parity: fixture → WAV, compares to reference
- cosyvoice3-text: text → audio via full frontend
- cosyvoice3-tokenizer: Qwen2 BPE parity harness
- cosyvoice3-frontend: dump assembled lm_input for debugging

Integration
- TtsBackend.swift: +case cosyvoice3
- ModelNames.swift: +CosyVoice3 enum + Repo.cosyvoice3

Tests (XCTest)
- CosyVoice3ChineseNormalizerTests (8 cases, end-to-end parity)
- CosyVoice3PromptMelTests (8 cases: frame count, zero clamp, sine
  argmax, reflect pad, Hann, mel basis, trim-to-token-ratio)

Full swift test: 1435 tests, 24 skipped, 0 failures.

Models on HF: https://huggingface.co/FluidInference/CosyVoice3-0.5B-coreml
Conversion pipeline: FluidInference/mobius PR #42

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 6.07x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 79.1s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.079s Average chunk processing time
Max Chunk Time 0.158s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 1m41s • 04/21/2026, 09:49 PM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 30.60x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 8.752 25.5 Fetching diarization models
Model Compile 3.751 10.9 CoreML compilation
Audio Load 0.054 0.2 Loading audio file
Segmentation 10.287 30.0 Detecting speech regions
Embedding 17.145 50.0 Extracting speaker voices
Clustering 6.858 20.0 Grouping same speakers
Total 34.298 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 34.3s diarization time • Test runtime: 2m 12s • 04/21/2026, 09:52 PM EST

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

Qwen3-ASR int8 Smoke Test ✅

Check Result
Build
Model download
Model load
Transcription pipeline
Decoder size 571 MB (vs 1.1 GB f32)

Performance Metrics

Metric CI Value Expected on Apple Silicon
Median RTFx 0.04x ~2.5x
Overall RTFx 0.04x ~2.5x

Runtime: 6m21s

Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 6 potential issues.

View 7 additional findings in Devin Review.

Open in Devin Review

/// pre-recorded Python token stream one id at a time. This is how the parity
/// harness bit-matches despite the `torch.multinomial` RNG mismatch between
/// PyTorch and Swift.
public final class CosyVoice3RasSampler: @unchecked Sendable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 @unchecked Sendable on CosyVoice3RasSampler with mutable state enables data races

CosyVoice3RasSampler has mutable fields (rng, seedQueue, seedIdx) that are modified during sample() and seedTokens(). Marking it @unchecked Sendable allows it to be shared across concurrency domains without synchronization, enabling data races on these fields. The repository rules in AGENTS.md, CLAUDE.md, and CONTRIBUTING.md explicitly state: "NEVER use @unchecked Sendable - implement proper thread safety with actors/MainActor". The rest of the codebase uses actors (e.g., ProgressEmitter, MLArrayCache) and has zero @unchecked Sendable usage. This class should be converted to an actor or wrapped in proper synchronization.

Prompt for agents
CosyVoice3RasSampler at Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3RasSampler.swift:19 is marked @unchecked Sendable but contains mutable state (rng: SeedableRng, seedQueue: [Int32], seedIdx: Int) that is mutated during sample() and seedTokens(). This violates the repository rule that forbids @unchecked Sendable. Options: (1) Convert to an actor. (2) Since it is only used locally within CosyVoice3Synthesizer.synthesize(), remove the Sendable conformance entirely — the sampler does not need to cross concurrency domains. (3) If Sendable is required, wrap mutable state behind a lock or use an actor.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

/// invariant before passing to Flow (matches the
/// `speech_feat, speech_feat_len[:] = speech_feat[:, :2 * token_len], 2 * token_len`
/// clamp in the Python frontend).
public final class CosyVoice3PromptMel: @unchecked Sendable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 @unchecked Sendable on CosyVoice3PromptMel with mutable buffers — documented not thread-safe

CosyVoice3PromptMel contains mutable reusable buffers (frameBuf, realIn, imagIn, realOut, imagOut, magnitude, imagSq) that are modified during compute(). The class's own documentation at line 61 says "not thread-safe; wrap with a queue if shared", yet it is marked @unchecked Sendable, directly contradicting the stated thread-safety guarantee. This violates the mandatory repository rule: "NEVER use @unchecked Sendable".

Prompt for agents
CosyVoice3PromptMel at Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3PromptMel.swift:38 is marked @unchecked Sendable but has mutable instance buffers (frameBuf, realIn, imagIn, realOut, imagOut, magnitude, imagSq) modified during compute(). The doc comment itself says 'not thread-safe; wrap with a queue if shared'. Options: (1) Remove Sendable conformance — the mel extractor is used locally and does not need to cross actor boundaries. (2) Convert to an actor. (3) Allocate fresh buffers per compute() call instead of reusing instance vars, making the type truly immutable and safely Sendable.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

import Foundation

/// Four CoreML models for the CosyVoice3 inference pipeline.
public struct CosyVoice3Models: @unchecked Sendable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 @unchecked Sendable on CosyVoice3Models wrapping non-Sendable MLModel

CosyVoice3Models wraps four MLModel instances which are not Sendable in Swift. Marking the struct @unchecked Sendable bypasses the compiler's concurrency safety checks. This violates the mandatory repository rule in AGENTS.md/CLAUDE.md: "NEVER use @unchecked Sendable". The existing codebase has zero @unchecked Sendable usage.

Prompt for agents
CosyVoice3Models at CosyVoice3Models.swift:5 is a struct wrapping four MLModel instances (which are not Sendable) and is marked @unchecked Sendable. This violates the NEVER use @unchecked Sendable rule. Since the models are loaded and owned by the CosyVoice3ModelStore actor, the struct does not need to be Sendable — it can be kept internal to the actor's isolation domain. Alternatively, wrap the MLModel references in an actor that serializes prediction calls.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

/// Mirrors `verify/test_coreml_e2e_fp16.py::main()` in Python. Each stage is
/// implemented as a method on this type, keeping the state (KV cache, running
/// decoded list) local to a single synthesis call.
public final class CosyVoice3Synthesizer: @unchecked Sendable {
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 @unchecked Sendable on CosyVoice3Synthesizer violates mandatory repo rule

AGENTS.md and CLAUDE.md state: "NEVER use @unchecked Sendable". CosyVoice3Synthesizer is a mutable final class with let properties, but it is called from the CosyVoice3TtsManager actor. It should itself be an actor or be restructured to avoid @unchecked Sendable.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +55 to +59
if CosyVoice3Constants.stopRange.contains(topId) {
logger.info("First token \(topId) is a stop token; no speech generated")
} else {
decoded.append(topId)
}
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Decode loop not skipped when first prefill token is a stop token

When the first token sampled from prefill logits falls in stopRange (6561…6760), the code at line 70 logs "no speech generated" but does not break or return. Execution falls through to the decode loop (line 77), which feeds the stop-token embedding into the decode model and may accumulate non-stop tokens into decoded. This produces semantically incorrect audio: the LLM signaled EOS at step 0 but the pipeline continues generating. The fix is to either return early with an empty/error result, or guard the decode loop entry with a check like guard !CosyVoice3Constants.stopRange.contains(topId) else { throw ... }.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

// non-compact (e.g. [40960, 512, 1]) — use logical indexing.
let hiftFrames = CosyVoice3Constants.hiftMaxFrames
let melBins = CosyVoice3Constants.melBins
let validFrames = min(newMelFrames, hiftFrames)
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 HiFT mel slice can read out-of-bounds when newMelStart > 0

validFrames is capped at hiftFrames (500) but does not account for the newMelStart offset. The source array fullMel has shape [1, 80, 500], so valid third-axis indices are 0…499. The access at line 301 reads index newMelStart + f, where f goes up to validFrames - 1. When newMelStart > 0, newMelStart + validFrames can exceed 500, causing an out-of-bounds read. While the invariant newMelStart + newMelFrames <= 500 normally holds (since 2 * nTotal ≤ 500), newMelStart comes from the model's num_prompt_mel output which is not validated, so a slightly off model value triggers a crash.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 6.11x
test-other 1.59% 0.00% 3.77x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 6.03x
test-other 1.22% 0.00% 3.90x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.65x Streaming real-time factor
Avg Chunk Time 1.400s Average time to process each chunk
Max Chunk Time 1.570s Maximum chunk processing time
First Token 1.678s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.69x Streaming real-time factor
Avg Chunk Time 1.303s Average time to process each chunk
Max Chunk Time 1.460s Maximum chunk processing time
First Token 1.290s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 5m34s • 04/21/2026, 09:51 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

Kokoro TTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (634.8 KB)

Runtime: 0m43s

Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 737.4x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 710.2x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 33.4% <35%
Miss Rate 24.4% - -
False Alarm 0.2% - -
Speaker Error 8.8% - -
RTFx 12.3x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 2m 29s • 2026-04-22T01:53:17.718Z

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

PocketTTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (176.3 KB)

Runtime: 0m27s

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 10.4% <20% Diarization Error Rate (lower is better)
RTFx 9.51x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 14.553 13.2 Fetching diarization models
Model Compile 6.237 5.7 CoreML compilation
Audio Load 0.057 0.1 Loading audio file
Segmentation 31.376 28.4 VAD + speech detection
Embedding 109.948 99.7 Speaker embedding extraction
Clustering (VBx) 0.125 0.1 Hungarian algorithm + VBx clustering
Total 110.285 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 10.4% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 141.4s processing • Test runtime: 2m 21s • 04/21/2026, 09:49 PM EST

Wraps CosyVoice3ResourceDownloader.{ensureCoreModels,ensureTextFrontendAssets,ensureVoice}
under --backend cosyvoice3-download. Pre-downloads all ~3.2 GB of HF assets
(4 mlmodelcs, speech+runtime embeddings, tokenizer, default voice bundle)
into ~/.cache/fluidaudio/Models/cosyvoice3/ so subsequent --backend
cosyvoice3-text runs are cache hits.

Verified fresh cold-start download from FluidInference/CosyVoice3-0.5B-coreml
in 194s: 17/17 model files + 4/4 tokenizer files + sidecar embeddings + voice
bundle all land correctly, peak mem 46 MB (streaming to disk).

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 5 new potential issues.

View 11 additional findings in Devin Review.

Open in Devin Review

/// Swift-side we mmap the exported safetensors and convert one row from fp16
/// to fp32 per decode step into a freshly allocated `[1, 1, 896]` fp32
/// MLMultiArray (the decode mlpackage declares fp32 at its I/O boundary).
public final class CosyVoice3SpeechEmbeddings: @unchecked Sendable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 @unchecked Sendable on CosyVoice3SpeechEmbeddings (rule violation)

Repository rules in AGENTS.md and CLAUDE.md explicitly state: "NEVER use @unchecked Sendable." CosyVoice3SpeechEmbeddings has only let properties after init. The @unchecked Sendable bypass should be replaced with proper conformance or an actor.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

/// The Phase 1 per-step decode embedding path still uses
/// `CosyVoice3SpeechEmbeddings` (fp16 table) to save memory during long
/// autoregressive loops; that code remains unchanged.
public final class CosyVoice3TextEmbeddings: @unchecked Sendable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 @unchecked Sendable on CosyVoice3TextEmbeddings (rule violation)

Repository rules in AGENTS.md and CLAUDE.md explicitly state: "NEVER use @unchecked Sendable." CosyVoice3TextEmbeddings has only let properties after init. The @unchecked Sendable bypass should be replaced with proper conformance or an actor.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

/// CAMPPlus speaker embedding and SpeechTokenizer prompt ids remain
/// Python-computed and shipped via `CosyVoice3PromptAssets` (see
/// `CosyVoice3TtsManager` Phase 2 API).
public final class CosyVoice3TextFrontend: @unchecked Sendable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 @unchecked Sendable on CosyVoice3TextFrontend (rule violation)

Repository rules in AGENTS.md and CLAUDE.md explicitly state: "NEVER use @unchecked Sendable." CosyVoice3TextFrontend has only let properties after init. The @unchecked Sendable bypass should be replaced with proper conformance or an actor.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

/// Special tokens are passed in separately (from a JSON map exported alongside
/// the CosyVoice3 fixtures — the runtime add_special_tokens list in Python is
/// not encoded in the HF assets).
public final class Qwen2BpeTokenizer: @unchecked Sendable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 @unchecked Sendable on Qwen2BpeTokenizer (rule violation)

Repository rules in AGENTS.md and CLAUDE.md explicitly state: "NEVER use @unchecked Sendable." Qwen2BpeTokenizer has only let properties after init. The @unchecked Sendable bypass should be replaced with proper conformance or an actor.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

/// - raw tensor payload (referenced by offsets above)
///
/// Used for Phase 1 fixture + speech embedding table mmap.
public final class SafetensorsFile: @unchecked Sendable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 @unchecked Sendable on SafetensorsFile (rule violation)

Repository rules in AGENTS.md and CLAUDE.md explicitly state: "NEVER use @unchecked Sendable." SafetensorsFile has only let properties after init. The @unchecked Sendable bypass should be replaced with proper conformance or an actor.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Alex-Wengg and others added 5 commits April 21, 2026 21:39
Benchmarked Flow across all MLComputeUnits and found the prior
fp32/cpuOnly shipping config was both the slowest and heaviest:

  config                    p50      NaN
  fp32 cpuOnly           58,862 ms   0/3
  fp32 cpuAndGPU        113,564 ms   0/3   (prior default: 2× slower than cpuOnly)
  fp16 cpuOnly           16,203 ms   5/5   (LayerNorm overflow)
  fp16 cpuAndGPU         17,261 ms   0/10  (GPU uses fp32 accumulators)
  fp16 cpuAndNE/all      hang: MILCompilerForANE ANECCompile() FAILED

Ship Flow-N250-fp16 forced to .cpuAndGPU:
  - 3× faster end-to-end (full e2e: 39.8s vs ~125s before on a 4.6s utterance)
  - mlpackage shrinks 1.2 GB → 638 MB (disk + download cut ~600 MB)
  - Whisper ASR roundtrip on Swift output: 13/14 chars correct on
    "希望你以后能够做的比我还好用" (Python fp16 e2e was 14/14 in parallel validation)

ModelStore now ignores the user-supplied computeUnits for Flow and
always applies .cpuAndGPU (the only viable path — cpuOnly NaNs, ANE
refuses to compile).

Co-Authored-By: Claude <noreply@anthropic.com>
Swap the CosyVoice3 decode path to the new stateful mlpackage shipped
from the mobius repo. The 24-layer KV cache (48 per-layer buffers,
[1, 2, 768, 64] fp16 each) is now held inside a CoreML `MLState` and
mutated in place across decode steps via `withMultiArray(for:)`, so
the synthesizer no longer passes kv_k / kv_v MLMultiArrays in and out
every step.

- Package.swift: bump platforms to macOS 15 / iOS 18 (required for
  CoreML StateType).
- ModelNames.swift: rename llmDecode to `LLM-Decode-M768-fp16-stateful`.
- CosyVoice3Constants.swift: update filename + subdir, document the
  cpuAndGPU constraint (ANE refuses stateful graph compile, same as
  Flow).
- CosyVoice3ModelStore.swift: load decode with `.cpuAndGPU` explicitly.
- CosyVoice3Synthesizer.swift: seed state from prefill kv_k_out /
  kv_v_out (fp32 at CoreML I/O, cast to fp16 at the state boundary);
  reusable per-step `inputs_embeds` + `cur_len` MLMultiArrays; call
  `prediction(from:using:)` with the MLState.
- CosyVoice3SpeechEmbeddings.swift: add `copyEmbedding(tokenId:into:)`
  so the hot decode loop can reuse a single scratch MLMultiArray
  instead of allocating per step.

Parity: end-to-end WAV output identical in length (83520 samples) to
both the Python reference and the prior pass-through Swift output;
no regression in sample-level metrics.

Co-Authored-By: Claude <noreply@anthropic.com>
Wire PocketTTS up to the language packs Kyutai now publishes under
`languages/<id>/` (english, french_24l, german[_24l], italian[_24l],
portuguese[_24l], spanish[_24l]). English keeps the legacy root layout
for zero-breaking-change to existing users; new languages download only
the requested `languages/<id>/` subtree from the HF repo.

- PocketTtsLanguage enum + ModelNames.requiredModels(for:) thread the
  language root through the downloader, model store, and session.
- 6L vs 24L variants differ only in transformer layer count; layer keys
  are now discovered at runtime via PocketTtsLayerKeys instead of being
  hard-coded to 6.
- PocketTtsMimiSchema captures Mimi decoder I/O (legacy English uses
  mimi_decoder_v2.mlmodelc, other languages use mimi_decoder.mlmodelc).
- Constants loader scopes to a per-language `constants_bin/` so each
  pack carries its own tokenizer + text embed table + voice prompts.
- CLI: --language flag (validates against PocketTtsLanguage.allCases),
  default english.
- Cross-language voice cloning still works: mimi_encoder is shared,
  cloned embeddings can be paired with any language pack.

Out of scope for v1: runtime language switching on a live manager
(instantiate a new manager instead), French 6L (upstream only ships
24L), automatic language detection from text.
Capture the failed Flow ANE-port attempt in the Constants/ModelStore docstrings
so the rationale survives in-tree: the BC1S rewrite (Linear→Conv2d 1×1, axis-1
LayerNorm, manual SDPA, pre-baked rotary) compiled and ran ~3× faster but
collapsed mel dynamic range from [-12.5, +5.2] to [-10.1, -0.8] (MAE 2.58 vs
fp32; plan target was <1e-3), producing HiFT audio at ~40× lower peak amplitude.
Reverted to fp16 cpuAndGPU baseline.

Synthesizer + parity CLI now print STAGES (prefill/seed/decode/flow/hift) and
RTFX lines so CV3 perf can be tracked without re-instrumenting each run.

No behavior change to shipping pipeline.
Compare two audio files via DiarizerManager's 256-d speaker embedding
extractor + cosine similarity. Useful for sanity-checking voice cloning
output (does the synthesized voice match the reference?) and for
diarization debugging.

Usage:
  fluidaudio speaker-similarity <a.wav> <b.wav> [--threshold 0.65] [--json]
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 22 additional findings in Devin Review.

Open in Devin Review

Comment on lines +128 to +132
print(
String(
format:
"STAGES prefill=%.3fs seed=%.3fs decode=%.3fs(%d steps, %.2f tok/s) flow=%.3fs hift=%.3fs",
prefillSec, seedSec, decodeSec, decodeSteps, decodeTps, flowSec, hiftSec))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 print() used in production library code instead of AppLogger

CLAUDE.md states: "Use AppLogger(category:) from Shared/AppLogger.swift — not print() in production code." CosyVoice3Synthesizer.synthesize() uses print() to emit stage timings to stdout. The class already has a logger property (CosyVoice3Synthesizer.swift:16) — the print() should use logger.info(...) instead.

Suggested change
print(
String(
format:
"STAGES prefill=%.3fs seed=%.3fs decode=%.3fs(%d steps, %.2f tok/s) flow=%.3fs hift=%.3fs",
prefillSec, seedSec, decodeSec, decodeSteps, decodeTps, flowSec, hiftSec))
logger.info(
String(
format:
"STAGES prefill=%.3fs seed=%.3fs decode=%.3fs(%d steps, %.2f tok/s) flow=%.3fs hift=%.3fs",
prefillSec, seedSec, decodeSec, decodeSteps, decodeTps, flowSec, hiftSec))
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant