SkillAgentSearch skills...

LLMVoX

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

Install / Use

/learn @mbzuai-oryx/LLMVoX

README

Accepted at ACL 2025 Findings 🏅

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

<div> <a href="https://mbzuai-oryx.github.io/LLMVoX/"><img src="https://img.shields.io/badge/Project-Page-blue" alt="Project Page"></a> <a href="https://arxiv.org/abs/2503.04724"><img src="https://img.shields.io/badge/arXiv-2503.04724-b31b1b.svg" alt="arXiv"></a> <a href="LICENSE.txt"> <img src="https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg" alt="License: CC BY-NC-SA 4.0"> </a> </div> <p align="center"> <img src="https://i.imgur.com/waxVImv.png" alt="Oryx Video-ChatGPT"> </p>

Sambal Shikhar, Mohammed Irfan K, Sahal Shaji Mullappilly, Fahad Khan, Jean Lahoud, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal

Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE

<p align="center"> <img src="assets/arch_diagram.svg" alt="LLMVoX Architecture" width="800px"> </p>

<video src="https://github.com/user-attachments/assets/6d305563-3c62-4f14-a8aa-acedf2143f76" width="200" controls></video>

Overview

LLMVoX is a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming Text-to-Speech (TTS) system designed to convert text outputs from Large Language Models into high-fidelity streaming speech with low latency. Our approach achieves significantly lower Word Error Rate compared to speech-enabled LLMs while operating at comparable latency and speech quality.

Key features:

  • 🚀 Lightweight & Fast: Only 30M parameters, delivering speech with end-to-end latency as low as 300ms
  • 🔌 LLM-Agnostic: Just plug with any existing LLM and Vision-Language Models without requiring fine-tuning or architectural modifications.
  • 🌊 Multi-Queue Streaming: Enables continuous, low-latency speech generation and infinite-length dialogues
  • 🌐 Multilingual Support: Easily adaptable to new languages with only dataset adaptation

Requirements

# System requirements
# - CUDA 11.7 or higher
# - Flash Attention 2.0+ compatible GPU (Ampere architecture or newer)

# Clone the repository
git clone https://github.com/mbzuai-oryx/LLMVoX.git
cd LLMVoX

# Create and activate a conda environment
conda create -n llmvox python=3.9
conda activate llmvox

# Install PyTorch with CUDA 11.8 support
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install Flash Attention
pip install flash-attn --no-build-isolation

# Install remaining dependencies
pip install -r requirements.txt


# Add path to wavtokenizer to avoid importing errors
export PYTHONPATH=./WavTokenizer/:$PYTHONPATH

# Download checkpoints (if not already in the repository)
mkdir -p CHECKPOINTS
# Download wavtokenizer_large_speech_320_24k.ckpt and ckpt_english_tiny.pt
# and place them in the CHECKPOINTS directory

Quick Start

Download Required Checkpoints

Download the necessary model checkpoints from Hugging Face:

🤗 Hugging Face Repository: MBZUAI/LLMVoX

Configuration Basics

LLMVoX requires a few base paths to be set correctly in the inference configuration file at configs/inference_config.py:

  • wavtokenizer_model_path: Path to the pretrained WavTokenizer model checkpoint
  • llmvox_checkpoint_path: Path to the trained LLMVoX model checkpoint

Running with Different Configurations

Voice Chat Configuration Guide

LLMVoX supports voice-based conversations through its streaming server. Here's how to configure and use the voice chat functionality:

Basic Usage
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct"
Configuration Parameters Explained
GPU Resource Allocation

LLMVoX uses a multi-queue approach with two TTS model replicas. You can specify which GPUs to use:

# Run TTS models on separate GPUs
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --tts_device_1 1 --tts_device_2 2

# Or run both on the same GPU (if memory allows)
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --tts_device_1 0 --tts_device_2 0

# Specify GPU for LLM separately
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --llm_device "cuda:0" --tts_device_1 1 --tts_device_2 2
Streaming Chunk Size Parameters

Control the balance between latency and quality:

# Lower latency setup (faster initial response but potentially lower quality)
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --initial_dump_size_1 5 --initial_dump_size_2 40 --max_dump_size 320

# Higher quality setup (slightly higher latency but better speech)
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --initial_dump_size_1 20 --initial_dump_size_2 320 --max_dump_size 2560

# Default balanced setup
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --initial_dump_size_1 10 --initial_dump_size_2 160 --max_dump_size 1280
  • initial_dump_size_1: Number of speech tokens for the first chunk (smaller = faster first response)
  • initial_dump_size_2: Initial chunk size for the second TTS model (can be larger as it runs while first chunk plays)
  • max_dump_size: Maximum chunk size that the system will scale up to (larger = better quality)
LLM-Specific Parameters

Different LLMs use different end-of-sequence tokens:

# For LLaMA models
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --eos_token "<|eot_id|>" --llm_max_tokens 1000

# For Mistral models
python streaming_server.py --chat_type voice --llm_checkpoint "mistralai/Mistral-7B-Instruct-v0.2" --eos_token "<|im_end|>" --llm_temperature 0.7

# For other models (check your model's documentation)
python streaming_server.py --chat_type voice --llm_checkpoint "your-model-name" --eos_token "<|end|>"
ASR Configuration (for Speech Input)

LLMVoX uses Whisper for converting speech to text:

# Use a larger Whisper model for better transcription
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --asr_model "medium" --asr_device "cuda:3"

# Use a smaller model for faster processing
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --asr_model "tiny" --asr_device "cuda:0"
System Prompt Customization

Control the LLM's response style:

# For concise responses
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --system_prompt "You are a friendly voicebot that answers questions in a concise way and do not use abbreviation. Keep responses brief."

# For more detailed explanations
python streaming_server.py --chat_type voice --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --system_prompt "You are a helpful AI assistant that provides detailed, thorough explanations. Avoid abbreviations when speaking."
Complete Example

Here's a complete example with all key parameters configured:

python streaming_server.py \
  --chat_type voice \
  --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" \
  --llm_device "cuda:0" \
  --tts_device_1 1 \
  --tts_device_2 2 \
  --asr_model "small" \
  --asr_device "cuda:3" \
  --initial_dump_size_1 10 \
  --initial_dump_size_2 160 \
  --max_dump_size 1280 \
  --max_audio_length 8000 \
  --eos_token "<|eot_id|>" \
  --system_prompt "You are a friendly voicebot that answers questions concisely without abbreviations."
How it Works

When you run voice chat:

  1. The ASR model transcribes your speech input
  2. The LLM generates a response text stream
  3. Two LLMVoX instances alternate processing text chunks at sentence boundaries
  4. Initial chunks are smaller for faster response, while later chunks are larger for better quality
  5. Audio is played in real-time while the rest of the response is still being generated

This multi-queue architecture enables both low latency (as fast as 300ms) and high-quality speech output.

Text Chat (Text-to-Speech)

# Basic text chat with LLaMA 3.1 8B
python streaming_server.py --chat_type text --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --llm_device "cuda:0"

# Customize LLM generation parameters
python streaming_server.py --chat_type text --llm_checkpoint "meta-llama/Llama-3.1-8B-Instruct" --llm_temperature 0.5 --llm_top_p 0.9 --llm_top_k 30

Visual Speech (Speech + Image → Speech)

# Using Qwen 2.5 VL as the vision-language model
python streaming_server.py --chat_type visual_speech --llm_checkpoint "Qwen/Qwen2.5-VL-7B-Instruct" --llm_device "cuda:0"  --asr_model "small" --eos_token "<|im_end|>"

Multimodal Chat with no ASR for models like Phi-4-multimodal-instruct (Speech + Image → Speech)

# Using Phi-4-multimodal-instruct which has multimodal input with speech , images and text 
python streaming_server.py --chat_type multimodal --llm_checkpoint "microsoft/Phi-4-multimodal-instruct" --llm_device "cuda:0" --sytem_prompt  ""Answer the question in short responses." --eos_token "<|end|>"

# Using LLaVA
python streaming_server.py --chat_type multimodal --llm_checkpoint "llava-hf/llava-1.5-7b-hf" --llm_device "cuda:0"
``
View on GitHub
GitHub Stars300
CategoryContent
Updated3d ago
Forks40

Languages

Python

Security Score

85/100

Audited on Mar 27, 2026

No findings