feat(tts): CosyVoice3 Mandarin zero-shot TTS port#536
feat(tts): CosyVoice3 Mandarin zero-shot TTS port#536Alex-Wengg wants to merge 9 commits intomainfrom
Conversation
Swift port of CosyVoice3 zero-shot Mandarin TTS targeting the four validated CoreML mlpackages hosted at FluidInference/CosyVoice3-0.5B-coreml. Mirrors the Kokoro manager API shape (public actor, init, initialize, synthesize → Data). Phase 1 — parity harness - CosyVoice3ModelStore loads LLM-Prefill-T256-M768, LLM-Decode-M768, Flow-N250-fp32, HiFT-T500-fp16 from a local build dir or HF repo - SafetensorsReader: pure-Swift mmap + typed accessors (fp16/fp32/i32) - CosyVoice3RasSampler: top-p / top-k / repetition mask, with seedTokens() bypass for parity tests - CosyVoice3Synthesizer: prefill → decode loop with in-place KV-cache passthrough [24,1,2,768,64] fp16 → Flow (N=250) → HiFT (T=500) - Speech embedding lazy mmap (6761×896 fp16) - Frontend fixture ingest for parity against Python reference WAV Phase 2 — native Mandarin frontend - Qwen2 byte-level BPE tokenizer (tiktoken-compatible), 151 936 vocab - Qwen2 text embedding table lookup (151 936×896 fp16 mmap) - CosyVoice3TextFrontend: special-token splitting, lm_input assembly - CosyVoice3ChineseNormalizer: minimal regex-free TN port of frontend_utils.py (replace_blank, corner marks, brackets, digit spellout, trailing comma collapse). Callers can pass prenormalized: true to bypass. - CosyVoice3PromptMel: 24 kHz log-mel matching matcha audio.py (n_fft=1920, hop=480, win=1920, num_mels=80, reflect-pad 720, center=False, Slaney norm, log floor 1e-5, magnitude eps 1e-9) Public API - CosyVoice3TtsManager: actor with init(directory:), initialize(), synthesize(text:promptAssets:options:prenormalized:), and downloadAndCreate(from repo:) - CosyVoice3PromptAssets: prompt text + speech IDs + mel + speaker embedding bundle, loadable from safetensors CLI (Sources/FluidAudioCLI/Commands/) - cosyvoice3-parity: fixture → WAV, compares to reference - cosyvoice3-text: text → audio via full frontend - cosyvoice3-tokenizer: Qwen2 BPE parity harness - cosyvoice3-frontend: dump assembled lm_input for debugging Integration - TtsBackend.swift: +case cosyvoice3 - ModelNames.swift: +CosyVoice3 enum + Repo.cosyvoice3 Tests (XCTest) - CosyVoice3ChineseNormalizerTests (8 cases, end-to-end parity) - CosyVoice3PromptMelTests (8 cases: frame count, zero clamp, sine argmax, reflect pad, Hann, mel basis, trim-to-token-ratio) Full swift test: 1435 tests, 24 skipped, 0 failures. Models on HF: https://huggingface.co/FluidInference/CosyVoice3-0.5B-coreml Conversion pipeline: FluidInference/mobius PR #42 Co-Authored-By: Claude <noreply@anthropic.com>
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 1m41s • 04/21/2026, 09:49 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 34.3s diarization time • Test runtime: 2m 12s • 04/21/2026, 09:52 PM EST |
Qwen3-ASR int8 Smoke Test ✅
Performance Metrics
Runtime: 6m21s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
| /// pre-recorded Python token stream one id at a time. This is how the parity | ||
| /// harness bit-matches despite the `torch.multinomial` RNG mismatch between | ||
| /// PyTorch and Swift. | ||
| public final class CosyVoice3RasSampler: @unchecked Sendable { |
There was a problem hiding this comment.
🔴 @unchecked Sendable on CosyVoice3RasSampler with mutable state enables data races
CosyVoice3RasSampler has mutable fields (rng, seedQueue, seedIdx) that are modified during sample() and seedTokens(). Marking it @unchecked Sendable allows it to be shared across concurrency domains without synchronization, enabling data races on these fields. The repository rules in AGENTS.md, CLAUDE.md, and CONTRIBUTING.md explicitly state: "NEVER use @unchecked Sendable - implement proper thread safety with actors/MainActor". The rest of the codebase uses actors (e.g., ProgressEmitter, MLArrayCache) and has zero @unchecked Sendable usage. This class should be converted to an actor or wrapped in proper synchronization.
Prompt for agents
CosyVoice3RasSampler at Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Synthesize/CosyVoice3RasSampler.swift:19 is marked @unchecked Sendable but contains mutable state (rng: SeedableRng, seedQueue: [Int32], seedIdx: Int) that is mutated during sample() and seedTokens(). This violates the repository rule that forbids @unchecked Sendable. Options: (1) Convert to an actor. (2) Since it is only used locally within CosyVoice3Synthesizer.synthesize(), remove the Sendable conformance entirely — the sampler does not need to cross concurrency domains. (3) If Sendable is required, wrap mutable state behind a lock or use an actor.
Was this helpful? React with 👍 or 👎 to provide feedback.
| /// invariant before passing to Flow (matches the | ||
| /// `speech_feat, speech_feat_len[:] = speech_feat[:, :2 * token_len], 2 * token_len` | ||
| /// clamp in the Python frontend). | ||
| public final class CosyVoice3PromptMel: @unchecked Sendable { |
There was a problem hiding this comment.
🔴 @unchecked Sendable on CosyVoice3PromptMel with mutable buffers — documented not thread-safe
CosyVoice3PromptMel contains mutable reusable buffers (frameBuf, realIn, imagIn, realOut, imagOut, magnitude, imagSq) that are modified during compute(). The class's own documentation at line 61 says "not thread-safe; wrap with a queue if shared", yet it is marked @unchecked Sendable, directly contradicting the stated thread-safety guarantee. This violates the mandatory repository rule: "NEVER use @unchecked Sendable".
Prompt for agents
CosyVoice3PromptMel at Sources/FluidAudio/TTS/CosyVoice3/Pipeline/Preprocess/CosyVoice3PromptMel.swift:38 is marked @unchecked Sendable but has mutable instance buffers (frameBuf, realIn, imagIn, realOut, imagOut, magnitude, imagSq) modified during compute(). The doc comment itself says 'not thread-safe; wrap with a queue if shared'. Options: (1) Remove Sendable conformance — the mel extractor is used locally and does not need to cross actor boundaries. (2) Convert to an actor. (3) Allocate fresh buffers per compute() call instead of reusing instance vars, making the type truly immutable and safely Sendable.
Was this helpful? React with 👍 or 👎 to provide feedback.
| import Foundation | ||
|
|
||
| /// Four CoreML models for the CosyVoice3 inference pipeline. | ||
| public struct CosyVoice3Models: @unchecked Sendable { |
There was a problem hiding this comment.
🔴 @unchecked Sendable on CosyVoice3Models wrapping non-Sendable MLModel
CosyVoice3Models wraps four MLModel instances which are not Sendable in Swift. Marking the struct @unchecked Sendable bypasses the compiler's concurrency safety checks. This violates the mandatory repository rule in AGENTS.md/CLAUDE.md: "NEVER use @unchecked Sendable". The existing codebase has zero @unchecked Sendable usage.
Prompt for agents
CosyVoice3Models at CosyVoice3Models.swift:5 is a struct wrapping four MLModel instances (which are not Sendable) and is marked @unchecked Sendable. This violates the NEVER use @unchecked Sendable rule. Since the models are loaded and owned by the CosyVoice3ModelStore actor, the struct does not need to be Sendable — it can be kept internal to the actor's isolation domain. Alternatively, wrap the MLModel references in an actor that serializes prediction calls.
Was this helpful? React with 👍 or 👎 to provide feedback.
| /// Mirrors `verify/test_coreml_e2e_fp16.py::main()` in Python. Each stage is | ||
| /// implemented as a method on this type, keeping the state (KV cache, running | ||
| /// decoded list) local to a single synthesis call. | ||
| public final class CosyVoice3Synthesizer: @unchecked Sendable { |
There was a problem hiding this comment.
🔴 @unchecked Sendable on CosyVoice3Synthesizer violates mandatory repo rule
AGENTS.md and CLAUDE.md state: "NEVER use @unchecked Sendable". CosyVoice3Synthesizer is a mutable final class with let properties, but it is called from the CosyVoice3TtsManager actor. It should itself be an actor or be restructured to avoid @unchecked Sendable.
Was this helpful? React with 👍 or 👎 to provide feedback.
| if CosyVoice3Constants.stopRange.contains(topId) { | ||
| logger.info("First token \(topId) is a stop token; no speech generated") | ||
| } else { | ||
| decoded.append(topId) | ||
| } |
There was a problem hiding this comment.
🔴 Decode loop not skipped when first prefill token is a stop token
When the first token sampled from prefill logits falls in stopRange (6561…6760), the code at line 70 logs "no speech generated" but does not break or return. Execution falls through to the decode loop (line 77), which feeds the stop-token embedding into the decode model and may accumulate non-stop tokens into decoded. This produces semantically incorrect audio: the LLM signaled EOS at step 0 but the pipeline continues generating. The fix is to either return early with an empty/error result, or guard the decode loop entry with a check like guard !CosyVoice3Constants.stopRange.contains(topId) else { throw ... }.
Was this helpful? React with 👍 or 👎 to provide feedback.
| // non-compact (e.g. [40960, 512, 1]) — use logical indexing. | ||
| let hiftFrames = CosyVoice3Constants.hiftMaxFrames | ||
| let melBins = CosyVoice3Constants.melBins | ||
| let validFrames = min(newMelFrames, hiftFrames) |
There was a problem hiding this comment.
🔴 HiFT mel slice can read out-of-bounds when newMelStart > 0
validFrames is capped at hiftFrames (500) but does not account for the newMelStart offset. The source array fullMel has shape [1, 80, 500], so valid third-axis indices are 0…499. The access at line 301 reads index newMelStart + f, where f goes up to validFrames - 1. When newMelStart > 0, newMelStart + validFrames can exceed 500, causing an out-of-bounds read. While the invariant newMelStart + newMelFrames <= 500 normally holds (since 2 * nTotal ≤ 500), newMelStart comes from the model's num_prompt_mel output which is not validated, so a slightly off model value triggers a crash.
Was this helpful? React with 👍 or 👎 to provide feedback.
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 5m34s • 04/21/2026, 09:51 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Kokoro TTS Smoke Test ✅
Runtime: 0m43s Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon. |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 2m 29s • 2026-04-22T01:53:17.718Z |
PocketTTS Smoke Test ✅
Runtime: 0m27s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon. |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 141.4s processing • Test runtime: 2m 21s • 04/21/2026, 09:49 PM EST |
Wraps CosyVoice3ResourceDownloader.{ensureCoreModels,ensureTextFrontendAssets,ensureVoice}
under --backend cosyvoice3-download. Pre-downloads all ~3.2 GB of HF assets
(4 mlmodelcs, speech+runtime embeddings, tokenizer, default voice bundle)
into ~/.cache/fluidaudio/Models/cosyvoice3/ so subsequent --backend
cosyvoice3-text runs are cache hits.
Verified fresh cold-start download from FluidInference/CosyVoice3-0.5B-coreml
in 194s: 17/17 model files + 4/4 tokenizer files + sidecar embeddings + voice
bundle all land correctly, peak mem 46 MB (streaming to disk).
Co-Authored-By: Claude <noreply@anthropic.com>
| /// Swift-side we mmap the exported safetensors and convert one row from fp16 | ||
| /// to fp32 per decode step into a freshly allocated `[1, 1, 896]` fp32 | ||
| /// MLMultiArray (the decode mlpackage declares fp32 at its I/O boundary). | ||
| public final class CosyVoice3SpeechEmbeddings: @unchecked Sendable { |
There was a problem hiding this comment.
🔴 @unchecked Sendable on CosyVoice3SpeechEmbeddings (rule violation)
Repository rules in AGENTS.md and CLAUDE.md explicitly state: "NEVER use @unchecked Sendable." CosyVoice3SpeechEmbeddings has only let properties after init. The @unchecked Sendable bypass should be replaced with proper conformance or an actor.
Was this helpful? React with 👍 or 👎 to provide feedback.
| /// The Phase 1 per-step decode embedding path still uses | ||
| /// `CosyVoice3SpeechEmbeddings` (fp16 table) to save memory during long | ||
| /// autoregressive loops; that code remains unchanged. | ||
| public final class CosyVoice3TextEmbeddings: @unchecked Sendable { |
There was a problem hiding this comment.
🔴 @unchecked Sendable on CosyVoice3TextEmbeddings (rule violation)
Repository rules in AGENTS.md and CLAUDE.md explicitly state: "NEVER use @unchecked Sendable." CosyVoice3TextEmbeddings has only let properties after init. The @unchecked Sendable bypass should be replaced with proper conformance or an actor.
Was this helpful? React with 👍 or 👎 to provide feedback.
| /// CAMPPlus speaker embedding and SpeechTokenizer prompt ids remain | ||
| /// Python-computed and shipped via `CosyVoice3PromptAssets` (see | ||
| /// `CosyVoice3TtsManager` Phase 2 API). | ||
| public final class CosyVoice3TextFrontend: @unchecked Sendable { |
There was a problem hiding this comment.
🔴 @unchecked Sendable on CosyVoice3TextFrontend (rule violation)
Repository rules in AGENTS.md and CLAUDE.md explicitly state: "NEVER use @unchecked Sendable." CosyVoice3TextFrontend has only let properties after init. The @unchecked Sendable bypass should be replaced with proper conformance or an actor.
Was this helpful? React with 👍 or 👎 to provide feedback.
| /// Special tokens are passed in separately (from a JSON map exported alongside | ||
| /// the CosyVoice3 fixtures — the runtime add_special_tokens list in Python is | ||
| /// not encoded in the HF assets). | ||
| public final class Qwen2BpeTokenizer: @unchecked Sendable { |
There was a problem hiding this comment.
🔴 @unchecked Sendable on Qwen2BpeTokenizer (rule violation)
Repository rules in AGENTS.md and CLAUDE.md explicitly state: "NEVER use @unchecked Sendable." Qwen2BpeTokenizer has only let properties after init. The @unchecked Sendable bypass should be replaced with proper conformance or an actor.
Was this helpful? React with 👍 or 👎 to provide feedback.
| /// - raw tensor payload (referenced by offsets above) | ||
| /// | ||
| /// Used for Phase 1 fixture + speech embedding table mmap. | ||
| public final class SafetensorsFile: @unchecked Sendable { |
There was a problem hiding this comment.
🔴 @unchecked Sendable on SafetensorsFile (rule violation)
Repository rules in AGENTS.md and CLAUDE.md explicitly state: "NEVER use @unchecked Sendable." SafetensorsFile has only let properties after init. The @unchecked Sendable bypass should be replaced with proper conformance or an actor.
Was this helpful? React with 👍 or 👎 to provide feedback.
Benchmarked Flow across all MLComputeUnits and found the prior
fp32/cpuOnly shipping config was both the slowest and heaviest:
config p50 NaN
fp32 cpuOnly 58,862 ms 0/3
fp32 cpuAndGPU 113,564 ms 0/3 (prior default: 2× slower than cpuOnly)
fp16 cpuOnly 16,203 ms 5/5 (LayerNorm overflow)
fp16 cpuAndGPU 17,261 ms 0/10 (GPU uses fp32 accumulators)
fp16 cpuAndNE/all hang: MILCompilerForANE ANECCompile() FAILED
Ship Flow-N250-fp16 forced to .cpuAndGPU:
- 3× faster end-to-end (full e2e: 39.8s vs ~125s before on a 4.6s utterance)
- mlpackage shrinks 1.2 GB → 638 MB (disk + download cut ~600 MB)
- Whisper ASR roundtrip on Swift output: 13/14 chars correct on
"希望你以后能够做的比我还好用" (Python fp16 e2e was 14/14 in parallel validation)
ModelStore now ignores the user-supplied computeUnits for Flow and
always applies .cpuAndGPU (the only viable path — cpuOnly NaNs, ANE
refuses to compile).
Co-Authored-By: Claude <noreply@anthropic.com>
Swap the CosyVoice3 decode path to the new stateful mlpackage shipped from the mobius repo. The 24-layer KV cache (48 per-layer buffers, [1, 2, 768, 64] fp16 each) is now held inside a CoreML `MLState` and mutated in place across decode steps via `withMultiArray(for:)`, so the synthesizer no longer passes kv_k / kv_v MLMultiArrays in and out every step. - Package.swift: bump platforms to macOS 15 / iOS 18 (required for CoreML StateType). - ModelNames.swift: rename llmDecode to `LLM-Decode-M768-fp16-stateful`. - CosyVoice3Constants.swift: update filename + subdir, document the cpuAndGPU constraint (ANE refuses stateful graph compile, same as Flow). - CosyVoice3ModelStore.swift: load decode with `.cpuAndGPU` explicitly. - CosyVoice3Synthesizer.swift: seed state from prefill kv_k_out / kv_v_out (fp32 at CoreML I/O, cast to fp16 at the state boundary); reusable per-step `inputs_embeds` + `cur_len` MLMultiArrays; call `prediction(from:using:)` with the MLState. - CosyVoice3SpeechEmbeddings.swift: add `copyEmbedding(tokenId:into:)` so the hot decode loop can reuse a single scratch MLMultiArray instead of allocating per step. Parity: end-to-end WAV output identical in length (83520 samples) to both the Python reference and the prior pass-through Swift output; no regression in sample-level metrics. Co-Authored-By: Claude <noreply@anthropic.com>
Wire PocketTTS up to the language packs Kyutai now publishes under `languages/<id>/` (english, french_24l, german[_24l], italian[_24l], portuguese[_24l], spanish[_24l]). English keeps the legacy root layout for zero-breaking-change to existing users; new languages download only the requested `languages/<id>/` subtree from the HF repo. - PocketTtsLanguage enum + ModelNames.requiredModels(for:) thread the language root through the downloader, model store, and session. - 6L vs 24L variants differ only in transformer layer count; layer keys are now discovered at runtime via PocketTtsLayerKeys instead of being hard-coded to 6. - PocketTtsMimiSchema captures Mimi decoder I/O (legacy English uses mimi_decoder_v2.mlmodelc, other languages use mimi_decoder.mlmodelc). - Constants loader scopes to a per-language `constants_bin/` so each pack carries its own tokenizer + text embed table + voice prompts. - CLI: --language flag (validates against PocketTtsLanguage.allCases), default english. - Cross-language voice cloning still works: mimi_encoder is shared, cloned embeddings can be paired with any language pack. Out of scope for v1: runtime language switching on a live manager (instantiate a new manager instead), French 6L (upstream only ships 24L), automatic language detection from text.
Capture the failed Flow ANE-port attempt in the Constants/ModelStore docstrings so the rationale survives in-tree: the BC1S rewrite (Linear→Conv2d 1×1, axis-1 LayerNorm, manual SDPA, pre-baked rotary) compiled and ran ~3× faster but collapsed mel dynamic range from [-12.5, +5.2] to [-10.1, -0.8] (MAE 2.58 vs fp32; plan target was <1e-3), producing HiFT audio at ~40× lower peak amplitude. Reverted to fp16 cpuAndGPU baseline. Synthesizer + parity CLI now print STAGES (prefill/seed/decode/flow/hift) and RTFX lines so CV3 perf can be tracked without re-instrumenting each run. No behavior change to shipping pipeline.
Compare two audio files via DiarizerManager's 256-d speaker embedding extractor + cosine similarity. Useful for sanity-checking voice cloning output (does the synthesized voice match the reference?) and for diarization debugging. Usage: fluidaudio speaker-similarity <a.wav> <b.wav> [--threshold 0.65] [--json]
| print( | ||
| String( | ||
| format: | ||
| "STAGES prefill=%.3fs seed=%.3fs decode=%.3fs(%d steps, %.2f tok/s) flow=%.3fs hift=%.3fs", | ||
| prefillSec, seedSec, decodeSec, decodeSteps, decodeTps, flowSec, hiftSec)) |
There was a problem hiding this comment.
🔴 print() used in production library code instead of AppLogger
CLAUDE.md states: "Use AppLogger(category:) from Shared/AppLogger.swift — not print() in production code." CosyVoice3Synthesizer.synthesize() uses print() to emit stage timings to stdout. The class already has a logger property (CosyVoice3Synthesizer.swift:16) — the print() should use logger.info(...) instead.
| print( | |
| String( | |
| format: | |
| "STAGES prefill=%.3fs seed=%.3fs decode=%.3fs(%d steps, %.2f tok/s) flow=%.3fs hift=%.3fs", | |
| prefillSec, seedSec, decodeSec, decodeSteps, decodeTps, flowSec, hiftSec)) | |
| logger.info( | |
| String( | |
| format: | |
| "STAGES prefill=%.3fs seed=%.3fs decode=%.3fs(%d steps, %.2f tok/s) flow=%.3fs hift=%.3fs", | |
| prefillSec, seedSec, decodeSec, decodeSteps, decodeTps, flowSec, hiftSec)) |
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Swift port of CosyVoice3 (Mandarin zero-shot TTS) wired through the
four validated CoreML mlpackages hosted at
FluidInference/CosyVoice3-0.5B-coreml.Delivered in two layered phases matching the existing Kokoro manager shape:
frontend fixture (
.safetensors) and produces WAV within parity of thePython reference — validates all four CoreML bindings, 24-layer Qwen2
KV-cache slicing, RAS sampler, and Flow / HiFT wiring.
text embeddings + minimal Mandarin text normalizer + 24 kHz log-mel
DSP so callers can synthesize directly from
Stringinput without aPython dependency.
Conversion pipeline that produced the mlpackages lives at
FluidInference/mobius#42.
What's shipped
Public API (
Sources/FluidAudio/TTS/CosyVoice3/)TtsBackendgainscase cosyvoice3;ModelNamesgets theCosyVoice3enum plusRepo.cosyvoice3pointing at the HF repo.Pipeline components
Assets/CosyVoice3ModelStore.swift.mlmodelccompile cacheAssets/CosyVoice3ResourceDownloader.swiftDownloadUtilswrapper for the 4 mlpackages + embeddingsShared/SafetensorsReader.swiftPipeline/Synthesize/CosyVoice3Synthesizer.swift[24,1,2,768,64]fp16 KV-cache passthroughPipeline/Synthesize/CosyVoice3RasSampler.swiftPipeline/Synthesize/CosyVoice3SpeechEmbeddings.swiftPipeline/Preprocess/CosyVoice3TextFrontend.swiftPipeline/Preprocess/Qwen2BpeTokenizer.swiftPipeline/Preprocess/CosyVoice3TextEmbeddings.swiftPipeline/Preprocess/CosyVoice3ChineseNormalizer.swiftfrontend_utils.pyPipeline/Preprocess/CosyVoice3PromptMel.swiftmatcha audio.pyCLI (
Sources/FluidAudioCLI/Commands/)Tests
CosyVoice3ChineseNormalizerTests— 8 cases coveringcontains_chinese,replace_blank, corner marks, brackets, digit spellout, trailingcomma collapse, end-to-end,
is_only_punctuation.CosyVoice3PromptMelTests— 8 cases covering the matcha frame-countformula, zero-audio log floor clamp, 200 Hz sine peak in low mel bins,
exact reflect-pad semantics, periodic Hann endpoints, mel-basis shape /
non-zero integrals, token-ratio trimming (and the throws-if-too-short
path).
Integration
ModelNames.swift—CosyVoice3enum +Repo.cosyvoice3TtsBackend.swift—case cosyvoice3TTSCommand.swift— subcommand wiringTest plan
swift build(release)swift teston this branch: 1 435 tests, 24 skipped, 0 failures (~13 min)--filter CosyVoice3ChineseNormalizer— 8/8 pass--filter CosyVoice3PromptMel— 8/8 passbuild/wavs/e2e_shipping.wav(max|Δ| < 1e-3, SNR > 40 dB, CPU-only fp32 Flow)Non-goals / follow-ups
preparation; both have CoreML mlpackages but the required DSPs aren't
yet ported. Users pass pre-computed
promptSpeechIds/spkEmbeddingin
CosyVoice3PromptAssetsfor now.wetext.ZhNormalizer(year / currency / decimals / units) is notported. Callers that need production-grade TN run wetext server-side
and pass
prenormalized: true.layer_normfused fp16.🤖 Generated with Claude Code