VibeVoiceFusion
VibeVoiceFusion is a full-stack, multi-speaker voice generation web system featuring LoRA fine-tuning, batch generation, and VRAM optimization. Based on Microsoft's VibeVoice (AR + diffusion architecture)
Install / Use
/learn @zhao-kun/VibeVoiceFusionREADME
VibeVoiceFusion
<div align="center"> <img src="frontend/public/icon-rect-pulse.svg" alt="VibeVoiceFusion Logo" width="120"/>A Complete Web Application for Multi-Speaker Voice Generation
Built on Microsoft's VibeVoice Model
Features • Demo Samples • Get Started • Documentation • Community • Contributing
</div>Overview
Purpose
VibeVoiceFusion is a web application for generating high-quality, multi-speaker synthetic speech with voice cloning capabilities. Built on Microsoft's VibeVoice model (AR + diffusion architecture), this project provides a complete full-stack solution with voice generation, LoRA fine-tuning, dataset management, batch generation, and advanced VRAM optimization features.
Key Goals:
- Provide a user-friendly interface for voice generation without requiring coding knowledge
- Enable efficient multi-speaker dialog synthesis with distinct voice characteristics
- Support LoRA fine-tuning for custom voice adaptation and style transfer
- Generate multiple audio variations in batch with different random seeds
- Optimize memory usage for consumer-grade GPUs (10GB+ VRAM)
- Support bilingual workflows (English/Chinese)
- Offer both web UI and CLI interfaces for different use cases
Principle
VibeVoice combines autoregressive (AR) and diffusion techniques for text-to-speech synthesis:
- Text Processing: Input text is tokenized and processed through a Qwen-based language model backbone
- Voice Encoding: Reference voice samples are encoded into acoustic and semantic embeddings
- AR Generation: The model autoregressively generates speech tokens conditioned on text and voice embeddings
- Diffusion Refinement: A DPM-Solver-based diffusion head converts tokens to high-quality audio waveforms
- Voice Cloning: The unified processor preserves speaker characteristics from reference audio samples
Technical Highlights:
- Model Architecture: Qwen backbone + VAE acoustic tokenizer + semantic encoder + diffusion head
- Quantization: Float8 (FP8 E4M3FN) support for ~50% VRAM reduction with minimal quality loss
- Layer Offloading: Dynamic CPU/GPU memory management for running on limited VRAM
- Attention Mechanism: PyTorch native SDPA for maximum compatibility
Features
Quick Generation
- One-Click Generation: Generate voice without creating projects, speakers, or sessions
- Voice Source Options:
- Upload custom audio files (WAV, MP3, M4A, FLAC, WebM) - up to 4 files
- Select from preset voice samples with language/gender filters
- Auto Mode Detection: Automatically detects dialogue vs narration format
- Multi-Voice Support: Use up to 4 voice prompts for generation
- Generation History: Persistent history with expandable details, bulk delete
- Per-Item Progress: Real-time progress tracking for each generating voice
Complete Web Application
- Project Management: Organize voice generation projects with metadata and descriptions
- Speaker/Voice Management:
- Upload and manage reference voice samples (WAV, MP3, M4A, FLAC, WebM)
- Audio preview with playback controls
- Voice file replacement with automatic cache-busting
- Audio trimming functionality
- Dialog Editor:
- Visual editor with drag-and-drop line reordering
- Text editor mode for bulk editing
- Support for multi-speaker dialogs (up to 4+ speakers)
- Narration mode for single-speaker content (audiobooks, articles, podcasts)
- Real-time preview and validation
- Generation System:
- Queue-based task management (prevents GPU conflicts)
- Real-time progress monitoring with live updates
- Configurable parameters (CFG scale, random seed, model precision)
- Multi-Generation: Generate 2-20 audio variations in a single batch with different seeds
- LoRA model support with configurable weight (0-1]
- Generation history with filtering, sorting, and pagination
- Audio playback and download for completed generations
LoRA Fine-Tuning
-
Dataset Management:
- Create and manage training datasets with audio/text pairs
- Import datasets from ZIP archives or local folders
- JSONL format for efficient data handling
- Pagination and search for large datasets
- Export datasets for backup or sharing
-
Training System:
- LoRA (Low-Rank Adaptation) fine-tuning for voice customization
- Configurable training parameters (epochs, learning rate, LoRA rank, batch size)
- Layer offloading support for training on consumer GPUs
- Real-time training progress with tqdm-style progress bar
- Live training metrics charts (Loss, Learning Rate, Timing)
- TensorBoard integration for detailed metrics
- Training history with status tracking (Prepare, Training, Completed, Failed)
- OOM detection with helpful suggestions for recovery
-
LoRA Model Usage:
- Select trained LoRA models during voice generation
- Configurable LoRA weight for blending with base model
- Multiple LoRA files per training job (epoch checkpoints + final)
VRAM Optimization
- Layer Offloading: Move transformer layers between CPU/GPU to reduce VRAM requirements
- Balanced (12 GPU / 16 CPU layers): ~5GB VRAM savings, ~2.0x slower - RTX 3060 16GB, 4070
- Aggressive (8 GPU / 20 CPU layers): ~6GB VRAM savings, ~2.5x slower - RTX 3060 12GB, 4060
- Extreme (4 GPU / 24 CPU layers): ~7GB VRAM savings, ~3.5x slower - RTX 3060 10GB (minimum)
- Float8 Quantization: Reduce model size from ~14GB to ~7GB with comparable quality. (Supported by RTX 40 series and above graphics cards.)
- Adaptive Configuration: Automatic VRAM estimation and optimal layer distribution
VRAM Requirements:
| Configuration | GPU Layers | VRAM Usage | Speed | Target Hardware | |--------------|-----------|------------|-------|-----------------| | No offloading | 28 | 11-14GB | 1.0x | RTX 4090, A100, 3090 | | Balanced | 12 | 6-8GB | 0.70x | RTX 4070, 3080 16GB | | Aggressive | 8 | 5-7GB | 0.55x | RTX 3060 12GB | | Extreme | 4 | 4-5GB | 0.40x | RTX 3080 10GB |
Float8 Quantization only supports by RTX 40XX or 50XX serial nvidia card.
Internationalization
- Full Bilingual Support: Complete English/Chinese UI with 360+ translation keys
- Auto-Detection: Automatically detects browser language on first visit
- Persistent Preference: Language selection saved in localStorage
- Backend i18n: API error messages and responses translated to user's language
Docker Deployment
- Multi-Stage Build: Optimized Dockerfile with frontend build, Python venv, and model download
- Self-Contained: Clones from GitHub and builds entirely from source
- HuggingFace Integration: Automatically downloads model file (~3-4GB) during build
Additional Features
- Responsive Design: Mobile-friendly interface with Tailwind CSS
- Real-Time Updates: WebSocket-free polling with smart update intervals (2s active, 60s background)
- Audio Cache-Busting: Ensures audio updates are immediately reflected
- Toast Notifications: User-friendly feedback for all operations
- Dark Mode Ready: Modern UI with consistent styling
- Accessibility: Keyboard navigation and ARIA labels
Demo Samples
Listen to voice generation samples created with VibeVoiceFusion. Click the links below to download and play:
Single Speaker
🎧 Pandora's Box Story (BFloat16 Model)
Generated with bfloat16 precision model - Full quality, 14GB VRAM
🎧 Pandora's Box Story (Float8 Model)
Generated with float8 quantization - Optimized for 7GB VRAM with comparable quality
Multi-Speaker (3 Speakers)
🎭 东邪西毒 - 西游版 (Journey to the West Version)
Multi-speaker dialog with distinct voice characteristics for each character
Get Started
Prerequisites
- Python: 3.9 or higher
- Node.js: 16.x or higher (for frontend development)
- CUDA: Compatible GPU with CUDA support (recommended)
- VRAM: Minimum 6GB for extreme offloading, 14GB recommended for best performance
- Docker: Optional, for containerized deployment
Installation
Option 1: Docker (Recommended for Production)
Build docker image
# Clone the repository
git clone https://github.com/zhao-kun/vibevoicefusion.git
cd vibevoicefusion
# Build and the docker image
docker compose build vibevoice
After build successfully, run command:
docker run -d \
--name vibevoicefusion \
--gpus all \
-p 9527:9527 \
-v $(pwd)/workspace:/wo
