SkillAgentSearch skills...

Vui

100M parameter lightweight conversational text-to-speech model with breaths, laughter, multi-speaker dialogue, voice cloning, and streaming. Llama-based, on-device.

Install / Use

/learn @fluxions-ai/Vui

README

Vui - 100M Parameter On-Device Conversational Text-to-Speech

Vui is a lightweight, open-source text-to-speech model with only 100M parameters, designed for natural conversational speech synthesis. Built on a Llama-style transformer architecture, it generates expressive multi-speaker dialogue with breaths, laughter, hesitations, and other non-verbal sounds.

Trained on 40,000 hours of real audio conversations. Runs on consumer GPUs.

Try the live demo on Hugging Face Spaces

Features

  • 100M parameters - small enough for on-device and edge deployment
  • Conversational speech - trained on real conversations, not studio recordings
  • Non-verbal sounds - generates [breath], [laugh], [sigh], [hesitate], [tut] naturally
  • Multi-speaker - COHOST model handles two-speaker dialogues
  • Voice cloning - clone from audio samples with the base model
  • Streaming - real-time streaming synthesis with CUDA graph acceleration
  • Custom audio codec - Fluac, a modified DAC with FSQ that reduces token rate from 86Hz to 21.5Hz (4x reduction)

Quick Start

from vui.model import Vui
from vui.inference import render
from torchcodec.encoders import AudioEncoder

model = Vui.from_pretrained(Vui.ABRAHAM).cuda().eval()

audio = render(model, """
So [breath] the thing about this is, it's not what you'd expect, right?
Um, it's actually [hesitate] completely different.
""")

encoder = AudioEncoder(audio[0], sample_rate=22050)
encoder.to_file("output.wav")

Comparison with Other Small TTS Models

| Model | Params | Conversational | Multi-Speaker | Voice Cloning | Breaths & Non-Verbal | Streaming | |---|---|---|---|---|---|---| | Vui | 100M | Yes | Yes | Yes | Yes | Yes | | Kokoro | 82M | No | No | No | No | No | | Pocket TTS | 100M | No | No | Yes | No | No | | KittenTTS | 14-80M | No | No | No | No | No | | Orpheus | 150M+ | Partial | No | No | Partial | No |

Models

| Model | Description | |---|---| | Vui.BASE | Base checkpoint trained on 40k hours of audio conversations | | Vui.ABRAHAM | Single-speaker model with context-aware replies | | Vui.COHOST | Two-speaker model for multi-speaker dialogue |

Architecture

Vui is a Llama-style causal transformer that predicts audio tokens from text:

  • Text encoder: ByT5 byte-level tokenizer
  • Decoder: 6-layer transformer, 512 dim, 8 heads, RMSNorm, SiLU, RoPE
  • Audio codec: Fluac - a modified Descript Audio Codec using Finite Scalar Quantization (FSQ) with 9 codebooks at 1000 entries each
  • Token rate: ~21.5 Hz (vs 86 Hz for standard DAC), enabling longer context windows
  • Inference: KV caching + CUDA graphs for fast autoregressive generation

Non-Verbal Sound Tags

Vui understands these inline tags for expressive speech:

[breath]    - breathing sounds
[laugh]     - laughter
[sigh]      - sighing
[hesitate]  - filled pauses / um / uh
[tut]       - tutting

Example:

And I'm Jamie um [laugh] today, we're diving into a [hesitate] topic
that's transforming customer service [breath] voice technology for agents.

Installation

Before running demo.py, you must accept model terms for Voice Activity Detection and Segmentation on Hugging Face.

Linux

uv pip install -e .

Windows

uv venv
.venv\Scripts\activate
uv pip install -e .
uv pip install triton_windows

Demo

python demo.py

Or try it on Hugging Face Spaces.

Voice Cloning

You can clone voices with the base model. Pass an audio sample and the model will adapt to the speaker's characteristics. Quality varies as the model hasn't been extensively trained for this task.

FAQ

  1. Developed on two 4090s: https://x.com/harrycblum/status/1752698806184063153
  2. The model does hallucinate occasionally - this is the best achievable with limited compute resources.
  3. VAD slows things down but is needed to remove silence regions.

Attributions

Citation

@software{vui_2025,
  author = {Coultas Blum, Harry},
  month = {01},
  title = {{Vui: 100M Parameter Conversational Text-to-Speech}},
  url = {https://github.com/fluxions-ai/vui},
  version = {1.0.0},
  year = {2025}
}
View on GitHub
GitHub Stars645
CategoryDevelopment
Updated1d ago
Forks63

Languages

Python

Security Score

100/100

Audited on Apr 1, 2026

No findings