Parakeet.cpp
Ultra fast and portable Parakeet implementation for on-device inference in C++ using Axiom with MPS+Unified Memory
Install / Use
/learn @Frikallo/Parakeet.cppREADME
parakeet.cpp
Fast speech recognition with NVIDIA's Parakeet models in pure C++.
Built on axiom — a lightweight tensor library with automatic Metal GPU acceleration. No ONNX runtime, no Python runtime, no heavyweight dependencies. Just C++ and one tensor library that outruns PyTorch MPS.
~27ms encoder inference on Apple Silicon GPU for 10s audio (110M model) — 96x faster than CPU. FP16 support for ~2x memory reduction.
Supported Models
| Model | Class | Size | Type | Description |
|-------|-------|------|------|-------------|
| tdt-ctc-110m | ParakeetTDTCTC | 110M | Offline | English, dual CTC/TDT decoder heads |
| tdt-600m | ParakeetTDT | 600M | Offline | Multilingual, TDT decoder |
| eou-120m | ParakeetEOU | 120M | Streaming | English, RNNT with end-of-utterance detection |
| nemotron-600m | ParakeetNemotron | 600M | Streaming | Multilingual, configurable latency (80ms–1120ms) |
| sortformer | Sortformer | 117M | Streaming | Speaker diarization (up to 4 speakers) |
| diarized | DiarizedTranscriber | 110M+117M | Offline | ASR + diarization → speaker-attributed words |
All ASR models share the same audio pipeline: 16kHz mono WAV → 80-bin Mel spectrogram → FastConformer encoder.
Quick Start
#include <parakeet/parakeet.hpp>
parakeet::Transcriber t("model.safetensors", "vocab.txt");
t.to_gpu(); // optional — Metal acceleration
t.to_half(); // optional — FP16 inference (~2x memory reduction)
auto result = t.transcribe("audio.wav");
std::cout << result.text << std::endl;
Features
- Multiple decoders — CTC greedy, TDT greedy, CTC beam search, TDT beam search (switch at call site)
- Word timestamps — Per-word start/end times and confidence scores on all decoders
- Beam search + LM — CTC and TDT beam search with optional ARPA n-gram language model fusion
- Phrase boosting — Context biasing via token-level trie for domain-specific vocabulary
- Batch transcription — Multiple files in one batched encoder forward pass
- VAD preprocessing — Silero VAD strips silence before ASR; timestamps auto-remapped
- GPU acceleration — Metal via axiom's MPSGraph compiler (96x speedup on Apple Silicon)
- FP16 inference — Half-precision weights and compute (~2x memory reduction)
- Streaming — EOU and Nemotron models with chunked audio input
- Speaker diarization — Sortformer (up to 4 speakers), combinable with ASR for speaker-attributed words
- C API — Flat
extern "C"FFI for Python, Swift, Go, Rust, and other languages - Multi-format audio — WAV, FLAC, MP3, OGG with automatic resampling
See examples/ for code demonstrating each feature.
Build
git clone --recursive https://github.com/frikallo/parakeet.cpp
cd parakeet.cpp
make build
make test
Requirements: C++20 (Clang 14+ or GCC 12+), CMake 3.20+, macOS 13+ for Metal GPU.
Convert Weights
# Download from HuggingFace
huggingface-cli download nvidia/parakeet-tdt_ctc-110m --include "*.nemo" --local-dir .
# Convert to safetensors
pip install safetensors torch
python scripts/convert_nemo.py parakeet-tdt_ctc-110m.nemo -o model.safetensors
The converter supports all model types: 110m-tdt-ctc (default), 600m-tdt, eou-120m, nemotron-600m, sortformer.
python scripts/convert_nemo.py checkpoint.nemo -o model.safetensors --model 600m-tdt
Silero VAD weights:
python scripts/convert_silero_vad.py -o silero_vad_v5.safetensors
Examples
| Example | Description | |---------|-------------| | basic | Simplest transcription (~20 lines) | | timestamps | Word/token timestamps with confidence | | beam-search | CTC/TDT beam search with optional ARPA LM | | phrase-boost | Context biasing for domain vocabulary | | batch | Batch transcription of multiple files | | vad | Standalone VAD and ASR+VAD preprocessing | | gpu | Metal GPU + FP16 with timing comparison | | stream | EOU streaming transcription | | nemotron | Nemotron streaming with latency modes | | diarize | Sortformer speaker diarization | | diarized-transcription | ASR + diarization combined | | c-api | Pure C99 FFI usage | | cli | Full CLI with all options |
Using as a Library
CMake find_package
After installing (make install or cmake --install build):
find_package(Parakeet REQUIRED)
target_link_libraries(myapp PRIVATE Parakeet::parakeet)
CMake add_subdirectory
add_subdirectory(third_party/parakeet.cpp)
target_link_libraries(myapp PRIVATE Parakeet::parakeet)
pkg-config
g++ -std=c++20 myapp.cpp $(pkg-config --cflags --libs parakeet) -o myapp
Architecture
Offline Models
Built on a shared FastConformer encoder (Conv2d 8x subsampling → N Conformer blocks with relative positional attention):
| Model | Class | Decoder | Use case |
|-------|-------|---------|----------|
| CTC | ParakeetCTC | Greedy argmax or beam search (+LM) | Fast, English-only |
| RNNT | ParakeetRNNT | Autoregressive LSTM | Streaming capable |
| TDT | ParakeetTDT | LSTM + duration prediction, greedy or beam search (+LM) | Better accuracy than RNNT |
| TDT-CTC | ParakeetTDTCTC | Both TDT and CTC heads | Switch decoder at inference |
Streaming Models
Built on a cache-aware streaming FastConformer encoder with causal convolutions and bounded-context attention:
| Model | Class | Decoder | Use case |
|-------|-------|---------|----------|
| EOU | ParakeetEOU | Streaming RNNT | End-of-utterance detection |
| Nemotron | ParakeetNemotron | Streaming TDT | Configurable latency streaming |
Diarization
| Model | Class | Architecture | Use case |
|-------|-------|-------------|----------|
| Sortformer | Sortformer | NEST encoder → Transformer → sigmoid | Speaker diarization (up to 4 speakers) |
Benchmarks
Measured on Apple M3 16GB with simulated audio input (Tensor::randn). Times are per-encoder-forward-pass (Sortformer: full forward pass).
Encoder throughput — 10s audio:
| Model | Params | CPU (ms) | GPU (ms) | GPU Speedup | |-------|--------|----------|----------|-------------| | 110m (TDT-CTC) | 110M | 2,581 | 27 | 96x | | tdt-600m | 600M | 10,779 | 520 | 21x | | rnnt-600m | 600M | 10,648 | 1,468 | 7x | | sortformer | 117M | 3,195 | 479 | 7x |
110m GPU scaling across audio lengths:
| Audio | CPU (ms) | GPU (ms) | RTF | Throughput | |-------|----------|----------|-----|------------| | 1s | 262 | 24 | 0.024 | 41x | | 5s | 1,222 | 26 | 0.005 | 190x | | 10s | 2,581 | 27 | 0.003 | 370x | | 30s | 10,061 | 32 | 0.001 | 935x | | 60s | 26,559 | 72 | 0.001 | 833x |
GPU acceleration powered by axiom's Metal graph compiler which fuses the full encoder into optimized MPSGraph operations.
make bench ARGS="--110m=models/model.safetensors --tdt-600m=models/tdt.safetensors"
Roadmap
Tier 1 — High Impact
- [x] Confidence scores — Per-token and per-word confidence from token log-probs
- [x] Phrase boosting — Token-level trie context biasing during decode
- [x] Beam search — CTC prefix beam search and TDT time-synchronous beam search
- [x] N-gram LM fusion — ARPA language models scored at word boundaries
Audio & I/O
- [x] Multi-format audio — WAV, FLAC, MP3, OGG via dr_libs + stb_vorbis
- [x] Automatic resampling — Windowed sinc interpolation (Kaiser, 16-tap)
- [x] Load from memory —
read_audio(bytes, len), float/int16 buffers - [ ] Audio duration query — Header-only duration without full decode
- [ ] Progress callbacks — Stage reporting for long files
- [ ] Streaming from raw PCM — Direct microphone buffer feeding
Tier 2 — Production Readiness
- [x] Diarized transcription — ASR + Sortformer → speaker-attributed words
- [x] VAD — Silero VAD v5, standalone + ASR preprocessing
- [x] Batch inference — Padded multi-file encoder forward pass
- [ ] Long-form chunking — Overlapping windows for audio >30s
- [ ] Neural LM rescoring — N-best reranking with Transformer LM
Tier 3 — Ecosystem
- [x] C API — Flat C interface for FFI from any language
- [x] FP16 inference — Half-precision weights and compute
- [ ] Model quantization — INT8/INT4 for mobile deployment
- [ ] Hotword detection — Trigger phrase detection
- [ ] Speaker embeddings — Speaker verification from Sortformer/TitaNet
Notes
- Audio: 16kHz mono (WAV, FLAC, MP3, OGG — auto-detected and resampled)
- Offline models have ~4-5 minute audio length limits; use streaming models for longer audio
- GPU acceleration requires Apple Silicon with Metal support
License
MIT
Related Skills
node-connect
339.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.8kCommit, push, and open a PR
