Fast speech recognition with NVIDIA's Parakeet models in pure C++.
Built on axiom — a lightweight tensor library with automatic Metal GPU acceleration. No ONNX runtime, no Python runtime, no heavyweight dependencies. Just C++ and one tensor library that outruns PyTorch MPS.
~27ms encoder inference on Apple Silicon GPU for 10s audio (110M model) — 96x faster than CPU. FP16 support for ~2x memory reduction.
| Model | Class | Size | Type | Description |
|---|---|---|---|---|
tdt-ctc-110m |
ParakeetTDTCTC |
110M | Offline | English, dual CTC/TDT decoder heads |
tdt-600m |
ParakeetTDT |
600M | Offline | Multilingual, TDT decoder |
eou-120m |
ParakeetEOU |
120M | Streaming | English, RNNT with end-of-utterance detection |
nemotron-600m |
ParakeetNemotron |
600M | Streaming | Multilingual, configurable latency (80ms–1120ms) |
sortformer |
Sortformer |
117M | Streaming | Speaker diarization (up to 4 speakers) |
diarized |
DiarizedTranscriber |
110M+117M | Offline | ASR + diarization → speaker-attributed words |
All ASR models share the same audio pipeline: 16kHz mono WAV → 80-bin Mel spectrogram → FastConformer encoder.
#include <parakeet/parakeet.hpp>
parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu(); // optional — Metal acceleration
t.to_half(); // optional — FP16 inference (~2x memory reduction)
auto result = t.transcribe("audio.wav");
std::cout << result.text << std::endl;- Multiple decoders — CTC greedy, TDT greedy, CTC beam search, TDT beam search (switch at call site)
- Word timestamps — Per-word start/end times and confidence scores on all decoders
- Beam search + LM — CTC and TDT beam search with optional ARPA n-gram language model fusion
- Phrase boosting — Context biasing via token-level trie for domain-specific vocabulary
- Batch transcription — Multiple files in one batched encoder forward pass
- VAD preprocessing — Silero VAD strips silence before ASR; timestamps auto-remapped
- GPU acceleration — Metal via axiom's MPSGraph compiler (96x speedup on Apple Silicon)
- FP16 inference — Half-precision weights and compute (~2x memory reduction)
- Streaming — EOU and Nemotron models with chunked audio input
- Speaker diarization — Sortformer (up to 4 speakers), combinable with ASR for speaker-attributed words
- C API — Flat
extern "C"FFI for Python, Swift, Go, Rust, and other languages - Multi-format audio — WAV, FLAC, MP3, OGG with automatic resampling
See examples/ for code demonstrating each feature.
git clone --recursive https://github.com/frikallo/parakeet.cpp
cd parakeet.cpp
make build
make testRequirements: C++20 (Clang 14+ or GCC 12+), CMake 3.20+, macOS 13+ for Metal GPU.
# Download from HuggingFace
huggingface-cli download nvidia/parakeet-tdt_ctc-110m --include "*.nemo" --local-dir .
# Convert to safetensors
pip install safetensors torch
python scripts/convert_nemo.py parakeet-tdt_ctc-110m.nemo -o model.safetensorsThe converter supports all model types: 110m-tdt-ctc (default), 600m-tdt, eou-120m, nemotron-600m, sortformer.
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 600m-tdtSilero VAD weights:
python scripts/convert_silero_vad.py -o silero_vad_v5.safetensors| Example | Description |
|---|---|
| basic | Simplest transcription (~20 lines) |
| timestamps | Word/token timestamps with confidence |
| beam-search | CTC/TDT beam search with optional ARPA LM |
| phrase-boost | Context biasing for domain vocabulary |
| batch | Batch transcription of multiple files |
| vad | Standalone VAD and ASR+VAD preprocessing |
| gpu | Metal GPU + FP16 with timing comparison |
| stream | EOU streaming transcription |
| nemotron | Nemotron streaming with latency modes |
| diarize | Sortformer speaker diarization |
| diarized-transcription | ASR + diarization combined |
| c-api | Pure C99 FFI usage |
| cli | Full CLI with all options |
After installing (make install or cmake --install build):
find_package(Parakeet REQUIRED)
target_link_libraries(myapp PRIVATE Parakeet::parakeet)add_subdirectory(third_party/parakeet.cpp)
target_link_libraries(myapp PRIVATE Parakeet::parakeet)g++ -std=c++20 myapp.cpp $(pkg-config --cflags --libs parakeet) -o myappBuilt on a shared FastConformer encoder (Conv2d 8x subsampling → N Conformer blocks with relative positional attention):
| Model | Class | Decoder | Use case |
|---|---|---|---|
| CTC | ParakeetCTC |
Greedy argmax or beam search (+LM) | Fast, English-only |
| RNNT | ParakeetRNNT |
Autoregressive LSTM | Streaming capable |
| TDT | ParakeetTDT |
LSTM + duration prediction, greedy or beam search (+LM) | Better accuracy than RNNT |
| TDT-CTC | ParakeetTDTCTC |
Both TDT and CTC heads | Switch decoder at inference |
Built on a cache-aware streaming FastConformer encoder with causal convolutions and bounded-context attention:
| Model | Class | Decoder | Use case |
|---|---|---|---|
| EOU | ParakeetEOU |
Streaming RNNT | End-of-utterance detection |
| Nemotron | ParakeetNemotron |
Streaming TDT | Configurable latency streaming |
| Model | Class | Architecture | Use case |
|---|---|---|---|
| Sortformer | Sortformer |
NEST encoder → Transformer → sigmoid | Speaker diarization (up to 4 speakers) |
Measured on Apple M3 16GB with simulated audio input (Tensor::randn). Times are per-encoder-forward-pass (Sortformer: full forward pass).
Encoder throughput — 10s audio:
| Model | Params | CPU (ms) | GPU (ms) | GPU Speedup |
|---|---|---|---|---|
| 110m (TDT-CTC) | 110M | 2,581 | 27 | 96x |
| tdt-600m | 600M | 10,779 | 520 | 21x |
| rnnt-600m | 600M | 10,648 | 1,468 | 7x |
| sortformer | 117M | 3,195 | 479 | 7x |
110m GPU scaling across audio lengths:
| Audio | CPU (ms) | GPU (ms) | RTF | Throughput |
|---|---|---|---|---|
| 1s | 262 | 24 | 0.024 | 41x |
| 5s | 1,222 | 26 | 0.005 | 190x |
| 10s | 2,581 | 27 | 0.003 | 370x |
| 30s | 10,061 | 32 | 0.001 | 935x |
| 60s | 26,559 | 72 | 0.001 | 833x |
GPU acceleration powered by axiom's Metal graph compiler which fuses the full encoder into optimized MPSGraph operations.
make bench ARGS="--110m=models/model.safetensors --tdt-600m=models/tdt.safetensors"- Confidence scores — Per-token and per-word confidence from token log-probs
- Phrase boosting — Token-level trie context biasing during decode
- Beam search — CTC prefix beam search and TDT time-synchronous beam search
- N-gram LM fusion — ARPA language models scored at word boundaries
- Multi-format audio — WAV, FLAC, MP3, OGG via dr_libs + stb_vorbis
- Automatic resampling — Windowed sinc interpolation (Kaiser, 16-tap)
- Load from memory —
read_audio(bytes, len), float/int16 buffers - Audio duration query — Header-only duration without full decode
- Progress callbacks — Stage reporting for long files
- Streaming from raw PCM — Direct microphone buffer feeding
- Diarized transcription — ASR + Sortformer → speaker-attributed words
- VAD — Silero VAD v5, standalone + ASR preprocessing
- Batch inference — Padded multi-file encoder forward pass
- Long-form chunking — Overlapping windows for audio >30s
- Neural LM rescoring — N-best reranking with Transformer LM
- C API — Flat C interface for FFI from any language
- FP16 inference — Half-precision weights and compute
- Model quantization — INT8/INT4 for mobile deployment
- Hotword detection — Trigger phrase detection
- Speaker embeddings — Speaker verification from Sortformer/TitaNet
- Audio: 16kHz mono (WAV, FLAC, MP3, OGG — auto-detected and resampled)
- Offline models have ~4-5 minute audio length limits; use streaming models for longer audio
- GPU acceleration requires Apple Silicon with Metal support
MIT