VoiceAssistant

A functioning Sesame CSM project with a desktop GUI - Real-time factor: 0.6x with 4070 Ti Super - Requires only 8GB VRAM

Generate Convert Improve

Install / Use

/learn @ReisCook/VoiceAssistant

About this skill

Quality Score

0/100

README

Sesame CSM Voice Assistant

Overview

A high-performance, local voice assistant with real-time transcription, LLM reasoning, and text-to-speech. Runs fully offline after setup and features Sesame CSM for expressive speech synthesis. Real-time factor: 0.6x with NVIDIA 4070 Ti Super.

Features

Real-time Speech-to-Text using distil-whisper
On-device LLM using Llama 3.2 1B
Natural TTS via Sesame CSM (senstella/csm-expressiva-1b)
Desktop GUI with Tauri/React
Conversation history and speaking animations
GPU acceleration with CUDA
Modular Docker-based backend

Tech Stack

Frontend: Tauri 2.5.1, React 18+, TypeScript
Backend: Python 3.10, FastAPI, Uvicorn
Models: distil-whisper (large-v3.5), Llama 3.2 1B (GGUF), Sesame CSM

Requirements

NVIDIA GPU: 8GB+ VRAM
32GB RAM
Docker Desktop
NVIDIA GPU Drivers (CUDA 12.1+)
NVIDIA Container Toolkit
Node.js & npm (v18+)
Rust & Cargo
Hugging Face access to Llama 3.2 1B

Setup

Prerequisites:
- Install Docker Desktop and ensure it's running
- Install Rust, Tauri, and NVIDIA Container Toolkit
- Request access to Llama 3.2 1B on Hugging Face
Configuration:
- Edit .env file and set HUGGING_FACE_TOKEN=hf_yourTokenHere
Backend:
- Build: docker compose build
- Run: docker compose up -d
Frontend:
- Install dependencies: cd frontend && npm install && npm install uuid
- Start: npm run tauri dev

Usage

Add your huggingface token and request access to the models (need to add links)
Build backend: docker compose build
Start backend: docker compose up -d
Build frontend: npm install && npm install uuid
Start frontend: cd frontend && npm run tauri dev
View logs: docker compose logs -f
Stop: docker compose down

Related Skills

node-connect

347.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。