VoiceAssistant
A functioning Sesame CSM project with a desktop GUI - Real-time factor: 0.6x with 4070 Ti Super - Requires only 8GB VRAM
Install / Use
/learn @ReisCook/VoiceAssistantREADME
Sesame CSM Voice Assistant
Overview
A high-performance, local voice assistant with real-time transcription, LLM reasoning, and text-to-speech. Runs fully offline after setup and features Sesame CSM for expressive speech synthesis. Real-time factor: 0.6x with NVIDIA 4070 Ti Super.
Features
- Real-time Speech-to-Text using
distil-whisper - On-device LLM using Llama 3.2 1B
- Natural TTS via Sesame CSM (
senstella/csm-expressiva-1b) - Desktop GUI with Tauri/React
- Conversation history and speaking animations
- GPU acceleration with CUDA
- Modular Docker-based backend
Tech Stack
- Frontend: Tauri 2.5.1, React 18+, TypeScript
- Backend: Python 3.10, FastAPI, Uvicorn
- Models:
distil-whisper(large-v3.5), Llama 3.2 1B (GGUF), Sesame CSM
Requirements
- NVIDIA GPU: 8GB+ VRAM
- 32GB RAM
- Docker Desktop
- NVIDIA GPU Drivers (CUDA 12.1+)
- NVIDIA Container Toolkit
- Node.js & npm (v18+)
- Rust & Cargo
- Hugging Face access to Llama 3.2 1B
Setup
-
Prerequisites:
- Install Docker Desktop and ensure it's running
- Install Rust, Tauri, and NVIDIA Container Toolkit
- Request access to Llama 3.2 1B on Hugging Face
-
Configuration:
- Edit
.envfile and setHUGGING_FACE_TOKEN=hf_yourTokenHere
- Edit
-
Backend:
- Build:
docker compose build - Run:
docker compose up -d
- Build:
-
Frontend:
- Install dependencies:
cd frontend && npm install && npm install uuid - Start:
npm run tauri dev
- Install dependencies:
Usage
- Add your huggingface token and request access to the models (need to add links)
- Build backend:
docker compose build - Start backend:
docker compose up -d - Build frontend:
npm install && npm install uuid - Start frontend:
cd frontend && npm run tauri dev - View logs:
docker compose logs -f - Stop:
docker compose down
Related Skills
node-connect
347.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
