VibeVoiceFusion

A Complete Web Application for Multi-Speaker Voice Generation

Built on Microsoft's VibeVoice Model

English | 简体中文

Features • Demo Samples • Get Started • Documentation • Community • Contributing

</div>

Overview

Purpose

VibeVoiceFusion is a web application for generating high-quality, multi-speaker synthetic speech with voice cloning capabilities. Built on Microsoft's VibeVoice model (AR + diffusion architecture), this project provides a complete full-stack solution with voice generation, LoRA fine-tuning, dataset management, batch generation, and advanced VRAM optimization features.

Key Goals:

Provide a user-friendly interface for voice generation without requiring coding knowledge
Enable efficient multi-speaker dialog synthesis with distinct voice characteristics
Support LoRA fine-tuning for custom voice adaptation and style transfer
Generate multiple audio variations in batch with different random seeds
Optimize memory usage for consumer-grade GPUs (10GB+ VRAM)
Support bilingual workflows (English/Chinese)
Offer both web UI and CLI interfaces for different use cases

Principle

VibeVoice combines autoregressive (AR) and diffusion techniques for text-to-speech synthesis:

Text Processing: Input text is tokenized and processed through a Qwen-based language model backbone
Voice Encoding: Reference voice samples are encoded into acoustic and semantic embeddings
AR Generation: The model autoregressively generates speech tokens conditioned on text and voice embeddings
Diffusion Refinement: A DPM-Solver-based diffusion head converts tokens to high-quality audio waveforms
Voice Cloning: The unified processor preserves speaker characteristics from reference audio samples

Technical Highlights:

Model Architecture: Qwen backbone + VAE acoustic tokenizer + semantic encoder + diffusion head
Quantization: Float8 (FP8 E4M3FN) support for ~50% VRAM reduction with minimal quality loss
Layer Offloading: Dynamic CPU/GPU memory management for running on limited VRAM
Attention Mechanism: PyTorch native SDPA for maximum compatibility

Features

Quick Generation

One-Click Generation: Generate voice without creating projects, speakers, or sessions
Voice Source Options:
- Upload custom audio files (WAV, MP3, M4A, FLAC, WebM) - up to 4 files
- Select from preset voice samples with language/gender filters
Auto Mode Detection: Automatically detects dialogue vs narration format
Multi-Voice Support: Use up to 4 voice prompts for generation
Generation History: Persistent history with expandable details, bulk delete
Per-Item Progress: Real-time progress tracking for each generating voice

Complete Web Application

Project Management: Organize voice generation projects with metadata and descriptions
Speaker/Voice Management:
- Upload and manage reference voice samples (WAV, MP3, M4A, FLAC, WebM)
- Audio preview with playback controls
- Voice file replacement with automatic cache-busting
- Audio trimming functionality
Dialog Editor:
- Visual editor with drag-and-drop line reordering
- Text editor mode for bulk editing
- Support for multi-speaker dialogs (up to 4+ speakers)
- Narration mode for single-speaker content (audiobooks, articles, podcasts)
- Real-time preview and validation
Generation System:
- Queue-based task management (prevents GPU conflicts)
- Real-time progress monitoring with live updates
- Configurable parameters (CFG scale, random seed, model precision)
- Multi-Generation: Generate 2-20 audio variations in a single batch with different seeds
- LoRA model support with configurable weight (0-1]
- Generation history with filtering, sorting, and pagination
- Audio playback and download for completed generations

LoRA Fine-Tuning

Dataset Management:
- Create and manage training datasets with audio/text pairs
- Import datasets from ZIP archives or local folders
- JSONL format for efficient data handling
- Pagination and search for large datasets
- Export datasets for backup or sharing
Training System:
- LoRA (Low-Rank Adaptation) fine-tuning for voice customization
- Configurable training parameters (epochs, learning rate, LoRA rank, batch size)
- Layer offloading support for training on consumer GPUs
- Real-time training progress with tqdm-style progress bar
- Live training metrics charts (Loss, Learning Rate, Timing)
- TensorBoard integration for detailed metrics
- Training history with status tracking (Prepare, Training, Completed, Failed)
- OOM detection with helpful suggestions for recovery
LoRA Model Usage:
- Select trained LoRA models during voice generation
- Configurable LoRA weight for blending with base model
- Multiple LoRA files per training job (epoch checkpoints + final)

VRAM Optimization

Layer Offloading: Move transformer layers between CPU/GPU to reduce VRAM requirements
- Balanced (12 GPU / 16 CPU layers): ~5GB VRAM savings, ~2.0x slower - RTX 3060 16GB, 4070
- Aggressive (8 GPU / 20 CPU layers): ~6GB VRAM savings, ~2.5x slower - RTX 3060 12GB, 4060
- Extreme (4 GPU / 24 CPU layers): ~7GB VRAM savings, ~3.5x slower - RTX 3060 10GB (minimum)
Float8 Quantization: Reduce model size from ~14GB to ~7GB with comparable quality. (Supported by RTX 40 series and above graphics cards.)
Adaptive Configuration: Automatic VRAM estimation and optimal layer distribution

VRAM Requirements:

| Configuration | GPU Layers | VRAM Usage | Speed | Target Hardware | |--------------|-----------|------------|-------|-----------------| | No offloading | 28 | 11-14GB | 1.0x | RTX 4090, A100, 3090 | | Balanced | 12 | 6-8GB | 0.70x | RTX 4070, 3080 16GB | | Aggressive | 8 | 5-7GB | 0.55x | RTX 3060 12GB | | Extreme | 4 | 4-5GB | 0.40x | RTX 3080 10GB |

Float8 Quantization only supports by RTX 40XX or 50XX serial nvidia card.

Internationalization

Full Bilingual Support: Complete English/Chinese UI with 360+ translation keys
Auto-Detection: Automatically detects browser language on first visit
Persistent Preference: Language selection saved in localStorage
Backend i18n: API error messages and responses translated to user's language

Docker Deployment

Multi-Stage Build: Optimized Dockerfile with frontend build, Python venv, and model download
Self-Contained: Clones from GitHub and builds entirely from source
HuggingFace Integration: Automatically downloads model file (~3-4GB) during build

Additional Features

Responsive Design: Mobile-friendly interface with Tailwind CSS
Real-Time Updates: WebSocket-free polling with smart update intervals (2s active, 60s background)
Audio Cache-Busting: Ensures audio updates are immediately reflected
Toast Notifications: User-friendly feedback for all operations
Dark Mode Ready: Modern UI with consistent styling
Accessibility: Keyboard navigation and ARIA labels

Demo Samples

Listen to voice generation samples created with VibeVoiceFusion. Click the links below to download and play:

Single Speaker

🎧 Pandora's Box Story (BFloat16 Model)

Generated with bfloat16 precision model - Full quality, 14GB VRAM

🎧 Pandora's Box Story (Float8 Model)

Generated with float8 quantization - Optimized for 7GB VRAM with comparable quality

Multi-Speaker (3 Speakers)

🎭 东邪西毒 - 西游版 (Journey to the West Version)

Multi-speaker dialog with distinct voice characteristics for each character

Get Started

Prerequisites

Python: 3.9 or higher
Node.js: 16.x or higher (for frontend development)
CUDA: Compatible GPU with CUDA support (recommended)
VRAM: Minimum 6GB for extreme offloading, 14GB recommended for best performance
Docker: Optional, for containerized deployment

Installation

Option 1: Docker (Recommended for Production)

Build docker image

# Clone the repository
git clone https://github.com/zhao-kun/vibevoicefusion.git
cd vibevoicefusion
# Build and the docker image
docker compose build vibevoice

After build successfully, run command:

docker run -d \
  --name vibevoicefusion \
  --gpus all \
  -p 9527:9527 \
  -v $(pwd)/workspace:/wo

VibeVoiceFusion

Install / Use

README

VibeVoiceFusion

Overview

Purpose

Principle

Features

Quick Generation

Complete Web Application

LoRA Fine-Tuning

VRAM Optimization

Internationalization

Docker Deployment

Additional Features

Demo Samples

Single Speaker

Multi-Speaker (3 Speakers)

Get Started

Prerequisites

Installation

Option 1: Docker (Recommended for Production)