Kokoros
🔥🔥 Kokoro in Rust. https://huggingface.co/hexgrad/Kokoro-82M Insanely fast, realtime TTS with high quality you ever have.
Install / Use
/learn @lucasjinreal/KokorosREADME
Zonos Rust Is On The Way?
Spark-TTS On The Way?
Orpheus-TTS On The Way?
ASMR
https://github.com/user-attachments/assets/1043dfd3-969f-4e10-8b56-daf8285e7420
(typo in video, ignore it)
Digital Human
https://github.com/user-attachments/assets/9f5e8fe9-d352-47a9-b4a1-418ec1769567
<p align="center"> <b>Give a star ⭐ if you like it!</b> </p>Kokoro is a trending top 2 TTS model on huggingface.
This repo provides insanely fast Kokoro infer in Rust, you can now have your built TTS engine powered by Kokoro and infer fast by only a command of koko.
kokoros is a rust crate that provides easy to use TTS ability.
One can directly call koko in terminal to synthesize audio.
kokoros uses a relative small model 87M params, while results in extremly good quality voices results.
Languge support:
- [x] English;
- [x] Chinese (partly);
- [x] Japanese (partly);
- [x] German (partly);
🔥🔥🔥🔥🔥🔥🔥🔥🔥 Kokoros Rust version just got a lot attention now. If you also interested in insanely fast inference, embeded build, wasm support etc, please star this repo! We are keep updating it.
New Discord community: https://discord.gg/E566zfDWqD, Please join us if you interested in Rust Kokoro.
Updates
2025.07.12: 🔥🔥🔥 HTTP API streaming and parallel processing infrastructure. OpenAI-compatible server supports streaming audio generation with"stream": trueachieving 1-2s time-to-first-audio, work-in-progress parallel TTS processing with--instancesflag support, improved logging system with Unix timestamps, and natural-sounding voice generation through advanced chunking;2025.01.22: 🔥🔥🔥 CLI streaming mode supported. You can now using--streamto have fun with stream mode, kudos to mroigo;2025.01.17: 🔥🔥🔥 Style mixing supported! Now, listen the output AMSR effect by simply specific style:af_sky.4+af_nicole.5;2025.01.15: OpenAI compatible server supported, openai format still under polish!2025.01.15: Phonemizer supported! NowKokoroscan inference E2E without anyother dependencies! Kudos to @tstm;2025.01.13: Espeak-ng tokenizer and phonemizer supported! Kudos to @mindreframer ;2025.01.12: ReleasedKokoros;
Prerequisites
To build this project locally, you need the following system dependencies:
macOS
brew install pkg-config opus
Linux (Ubuntu/Debian)
sudo apt-get install pkg-config libopus-dev
Installation
- Download the required model and voice data files:
bash download_all.sh
This will download:
- The Kokoro ONNX model (
checkpoints/kokoro-v1.0.onnx) - The voices data file (
data/voices-v1.0.bin)
Alternatively, you can download them separately:
bash scripts/download_models.sh
bash scripts/download_voices.sh
- Build the project:
cargo build --release
- (Optional) Install Python dependencies for OpenAI client examples:
pip install -r scripts/requirements.txt
- (Optional) Install the binary and voice data system-wide:
bash install.sh
This will copy the koko binary to /usr/local/bin (making it available system-wide as koko) and copy the voice data to $HOME/.cache/kokoros/.
Usage
View available options
./target/release/koko -h
Generate speech for some text
mkdir -p tmp
./target/release/koko text "Hello, this is a TTS test"
The generated audio will be saved to tmp/output.wav by default. You can customize the save location with the --output or -o option:
./target/release/koko text "I hope you're having a great day today!" --output greeting.wav
Generate speech for each line in a file
./target/release/koko file poem.txt
For a file with 3 lines of text, by default, speech audio files tmp/output_0.wav, tmp/output_1.wav, tmp/output_2.wav will be outputted. You can customize the save location with the --output or -o option, using {line} as the line number:
./target/release/koko file lyrics.txt -o "song/lyric_{line}.wav"
Word-level timestamps (TSV sidecar)
Add --timestamps to produce a .tsv file with per-word timings alongside the WAV output. The TSV contains three columns: word, start_sec, end_sec.
Text mode example:
./target/release/koko text \
--output tmp/output.wav \
--timestamps \
"Hello from the timestamped model"
This creates:
tmp/output.wavtmp/output.tsv
File mode example (one pair per line):
./target/release/koko file input.txt \
--output tmp/line_{line}.wav \
--timestamps
For each line N, this creates tmp/line_N.wav and tmp/line_N.tsv.
Notes:
- The sidecar path is derived automatically by replacing the
.wavextension with.tsv. - Sample rate is 24 kHz by default; times are in seconds with 3 decimal places.
Quick start with the Hugging Face timestamped model (copy-paste)
Copy and paste the following to run an end-to-end example using the timestamped Kokoro ONNX model hosted on Hugging Face. This will download the model and voice data to the expected paths and generate both output.wav and output.tsv.
mkdir -p checkpoints data tmp
# 1) Download the timestamped ONNX model from Hugging Face
curl -L \
"https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX-timestamped/resolve/main/onnx/model.onnx" \
-o checkpoints/kokoro-v1.0.onnx
# 2) Download voices data (single binary used by existing models)
curl -L \
"https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/voices-v1.0.bin" \
-o data/voices-v1.0.bin
# 3) Build the binary
cargo build --release
# 4) Run: generates tmp/output.wav and tmp/output.tsv
./target/release/koko text \
--output tmp/output.wav \
--timestamps \
"Hello from the timestamped model"
Notes:
- We keep using the unified
voices-v1.0.bin, which is compatible with the timestamped model. - If the files already exist in
checkpoints/anddata/, the CLI will use them directly.
Parallel Processing Configuration
Configure parallel TTS instances for the OpenAI-compatible server based on your performance preference:
# Best 0.5-2 seconds time-to-first-audio (lowest latency)
./target/release/koko openai --instances 1
# Balanced performance (default, 2 instances, usually best throughput for CPU processing)
./target/release/koko openai
# Best total processing time (Diminishing returns on CPU processing observed on Mac M2)
./target/release/koko openai --instances 4
How to determine the optimal number of instances for your system configuration? Choose your configuration based on use case:
- Single instance for real-time applications requiring immediate audio response irrespective of system configuration.
- Multiple instances for batch processing where total completion time matters more than initial latency.
- This was benchmarked on a Mac M2 with 8 cores and 24GB RAM.
- Tested with the message:
Welcome to our comprehensive technology demonstration session. Today we will explore advanced parallel processing systems thoroughly. These systems utilize multiple computational instances simultaneously for efficiency. Each instance processes different segments concurrently without interference. The coordination between instances ensures seamless output delivery consistently. Modern algorithms optimize resource utilization effectively across all components. Performance improvements are measurable and significant in real scenarios. Quality assurance validates each processing stage thoroughly before deployment. Integration testing confirms system reliability consistently under various conditions. User experience remains smooth throughout operation regardless of complexity. Advanced monitoring tracks system performance metrics continuously during execution.
- Benchmark results (avg of 5) | No. of instances | TTFA | Total time | |------------------|------|------------| | 1 | 1.44s | 19.0s | | 2 | 2.44s | 16.1s | | 4 | 4.98s | 16.6s |
- If you have a CPU, memory bandwidth will be the usual bottleneck. You will have to experiment to find a sweet spot of number of instances giving you optimal throughput on your system configuration.
- If you have a NVIDIA GPU, you can try increasing the number of instances. You are expected to further improve throughput.
- Attempts to make this work on CoreML, would likely start with converting the ONNX model to CoreML or ORT.
Note: The --instances flag is currently supported in API server mode. CLI text commands will support parallel processing in future releases.
OpenAI-Compatible Server
- Start the server:
./target/release/koko openai
- Make API requests using either curl or Python:
Using curl:
# Standard audio generation
curl -X POST http://localhost:3000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Hello, this is a test of the Kokoro TTS system!",
"voice": "af_sky"
}' \
--output sky-says-hello.wav
# Streaming audio generation (PCM format only)
curl -X POST http://localhost:3000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "This is a streaming test with real-time audio gener
