WhisperLiveKit
Simultaneous speech-to-text models
Install / Use
/learn @QuentinFuxa/WhisperLiveKitREADME
Powered by Leading Research:
- Simul-Whisper/Streaming (SOTA 2025) - Ultra-low latency transcription using AlignAtt policy.
- NLLW (2025), based on distilled NLLB (2022, 2024) - Simulatenous translation from & to 200 languages.
- WhisperStreaming (SOTA 2023) - Low latency transcription using LocalAgreement policy
- Streaming Sortformer (SOTA 2025) - Advanced real-time speaker diarization
- Diart (SOTA 2021) - Real-time speaker diarization
- Voxtral Mini (2025) - 4B-parameter multilingual speech model by Mistral AI
- Silero VAD (2024) - Enterprise-grade Voice Activity Detection
Why not just run a simple Whisper model on every audio batch? Whisper is designed for complete utterances, not real-time chunks. Processing small segments loses context, cuts off words mid-syllable, and produces poor transcription. WhisperLiveKit uses state-of-the-art simultaneous speech research for intelligent buffering and incremental processing.
Architecture
<img alt="Architecture" src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/architecture.png" />The backend supports multiple concurrent users. Voice Activity Detection reduces overhead when no voice is detected.
Installation & Quick Start
pip install whisperlivekit
Quick Start
# Start the server — open http://localhost:8000 and start talking
wlk --model base --language en
# Auto-pull model and start server
wlk run whisper:tiny
# Transcribe a file (no server needed)
wlk transcribe meeting.wav
# Generate subtitles
wlk transcribe --format srt podcast.mp3 -o podcast.srt
# Manage models
wlk models # See what's installed
wlk pull large-v3 # Download a model
wlk rm large-v3 # Delete a model
# Benchmark speed and accuracy
wlk bench
API Compatibility
WhisperLiveKit exposes multiple APIs so you can use it as a drop-in replacement:
# OpenAI-compatible REST API
curl http://localhost:8000/v1/audio/transcriptions -F file=@audio.wav
# Works with the OpenAI Python SDK
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
# Deepgram-compatible WebSocket (use any Deepgram SDK)
# Just point your Deepgram client at localhost:8000
# Native WebSocket for real-time streaming
ws://localhost:8000/asr
See docs/API.md for the complete API reference.
- See here for the list of all available languages.
- Check the troubleshooting guide for step-by-step fixes collected from recent GPU setup/env issues.
- For HTTPS requirements, see the Parameters section for SSL configuration options.
Optional Dependencies
| Feature | uv sync | pip install -e |
|-----------|-------------|-------------|
| Apple Silicon MLX Whisper backend | uv sync --extra mlx-whisper | pip install -e ".[mlx-whisper]" |
| Voxtral (MLX backend, Apple Silicon) | uv sync --extra voxtral-mlx | pip install -e ".[voxtral-mlx]" |
| CPU PyTorch stack | uv sync --extra cpu | pip install -e ".[cpu]" |
| CUDA 12.9 PyTorch stack | uv sync --extra cu129 | pip install -e ".[cu129]" |
| Translation | uv sync --extra translation | pip install -e ".[translation]" |
| Sentence tokenizer | uv sync --extra sentence_tokenizer | pip install -e ".[sentence_tokenizer]" |
| Voxtral (HF backend) | uv sync --extra voxtral-hf | pip install -e ".[voxtral-hf]" |
| Speaker diarization (Sortformer / NeMo) | uv sync --extra diarization-sortformer | pip install -e ".[diarization-sortformer]" |
| [Not recommended] Speaker diarization with Diart | uv sync --extra diarization-diart | pip install -e ".[diarization-diart]" |
Supported GPU profiles:
# Profile A: Sortformer diarization
uv sync --extra cu129 --extra diarization-sortformer
# Profile B: Voxtral HF + translation
uv sync --extra cu129 --extra voxtral-hf --extra translation
voxtral-hf and diarization-sortformer are intentionally incompatible extras and must be installed in separate environments.
See Parameters & Configuration below on how to use them.
<p align="center"> <img src="benchmark_scatter_en_aware.png" alt="Speed vs Accuracy — English" width="700"> </p> <p align="center"> <img src="benchmark_scatter_fr_aware.png" alt="Speed vs Accuracy — French" width="700"> </p>Benchmarks use 6 minutes of public LibriVox audiobook recordings per language (30s + 60s + 120s + 180s), with ground truth from Project Gutenberg. Fully reproducible with python scripts/run_scatter_benchmark.py.
We are actively looking for benchmark results on other hardware (NVIDIA GPUs, different Apple Silicon chips, cloud instances). If you run the benchmarks on your machine, please share your results via an issue or PR!
Use it to capture audio from web pages.
Go to chrome-extension for instructions.
Voxtral Backend
WhisperLiveKit supports Voxtral Mini,
a 4B-parameter speech model from Mistral AI that natively handles 100+ languages with automatic
language detection. Whisper also supports auto-detection (--language auto), but Voxtral's per-chunk
detection is more reliable and does not bias towards English.
# Apple Silicon (native MLX, recommended)
pip install -e ".[voxtral-mlx]"
wlk --backend voxtral-mlx
# Linux/GPU (HuggingFace transformers)
pip install transformers torch
wlk --backend voxtral
Voxtral uses its own streaming policy and does not use LocalAgreement or SimulStreaming. See BENCHMARK.md for performance numbers.
Usage Examples
Command-line Interface: Start the transcription server with various options:
# Large model and translate from french to danish
wlk --model large-v3 --language fr --target-language da
# Diarization and server listening on */80
wlk --host 0.0.0.0 --port 80 --model medium --diarization --language fr
# Voxtral multilingual (auto-detects language)
wlk --backend voxtral-mlx
Python API Integration: Check basic_server for a more complete example of how to use the functions and classes.
import asyncio
from contextlib import asynccontextmanager
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import HTMLResponse
from whisperlivekit import AudioProcessor, TranscriptionEngine, parse_args
transcription_engine = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global transcription_engine
transcription_engine = TranscriptionEngine(model_size="medium", diarization=True, lan="en")
yield
app = FastAPI(lifespan=lifespan)
async def handle_websocket_results(websocket: WebSocket, results_generator):
async for response in results_generator:
await websocket.send_json(response)
await websocket.send_json({"type": "ready_to_stop"})
@app.websocket("/asr")
async def websocket_endpoint(websocket: WebSocket):
global transcription_engine
# Create a new AudioProcessor for each connection, passing the shared engine
audio_processor = AudioProcessor(transcription_engine=transcription_engine)
results_generator = await audio_processor.create_tasks()
results_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
await websocket.accept()
while True:
message = await websocket.receive_bytes()
await audio_processor.process_audio(message)
Frontend Implementation: The package includes an HTML/JavaScript implementation here. You can also import it using `from whisperlivekit import
