Susurrus
speech to text gui for different (mostly Whisper, also Voxtral) models and backends, including whisper.cpp, mlx-whisper, faster-whisper, ctranslate2; applies pyannote for diarization
Install / Use
/learn @CrispStrobe/SusurrusREADME
Susurrus: Audio Transcription Suite
Susurrus is a professional, modular audio transcription application that leverages various AI models and backends to convert speech to text. Built with a clean architecture, it supports multiple Whisper implementations, speaker diarization, and extensive customization options.
✨ Features
Core Transcription
- Multiple Backend Support: mlx-whisper, OpenAI Whisper, faster-whisper, transformers, whisper.cpp, ctranslate2, whisper-jax, insanely-fast-whisper, Voxtral
- Flexible Input: Local files, URLs, also of videos
- Audio Format Support: MP3, WAV, FLAC, M4A, AAC, OGG, OPUS, WebM, MP4, WMA
- Language Detection: Automatic or manual language selection
- Time-based Trimming: Transcribe specific portions of audio
- Word-level Timestamps: Precise timing information (backend-dependent)
Speaker Diarization
- Multi-speaker Identification: Automatically detect and label different speakers
- Language-specific Models: Optimized models for English, German, Chinese, Spanish, Japanese
- Configurable Parameters: Set min/max speaker counts
- Multiple Output Formats: TXT, SRT, VTT, JSON with speaker labels
- PyAnnote.audio Integration: State-of-the-art diarization engine
Voxtral Support (New!)
- Voxtral Local: On-device inference with Mistral's speech model
- Voxtral API: Cloud-based inference via Mistral AI API
- 8 Language Support: EN, FR, ES, DE, IT, PT, PL, NL
- Long Audio Processing: Automatic chunking for files over 25 minutes
Advanced Features
- Proxy Support: HTTP/SOCKS5 proxy for network requests
- Device Selection: Auto-detect or manually choose CPU/GPU/MPS
- Model Conversion: Automatic CTranslate2 model conversion
- Progress Tracking: Real-time progress with ETA estimation
- Settings Persistence: Save your preferences between sessions
- Dependency Management: Built-in installer for missing components
- CUDA Diagnostics: Detailed GPU/CUDA troubleshooting tools
📦 Installation
Quick Start
# Clone the repository
git clone https://github.com/CrispStrobe/Susurrus.git
cd Susurrus
# Create virtual environment
python -m venv venv
# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run the application
python main.py
# Or as a module:
python -m susurrus
Prerequisites
- Python 3.8+
- FFmpeg (for audio format conversion)
- Git
- C++ compiler (for whisper.cpp, optional)
- CUDA Toolkit (for GPU acceleration, optional)
Platform-Specific Setup
Windows
# Install Chocolatey (if not installed)
Set-ExecutionPolicy Bypass -Scope Process -Force
iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))
# Install dependencies
choco install cmake ffmpeg git python
# For GPU support
choco install cuda
macOS
# Install Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install dependencies
brew install ffmpeg cmake python git
# For Apple Silicon optimization
pip install mlx mlx-whisper
Linux (Ubuntu/Debian)
# Install dependencies
sudo apt update
sudo apt install ffmpeg cmake build-essential python3 python3-pip git
# For GPU support
# Follow CUDA installation guide for your distribution
Optional Backend Installation
# MLX (Apple Silicon only)
pip install mlx-whisper
# Faster Whisper (recommended)
pip install faster-whisper
# Transformers
pip install transformers torch torchaudio
# Whisper.cpp (manual build required)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && mkdir build && cd build
cmake .. && make
# CTranslate2
pip install ctranslate2
# Whisper-JAX
pip install whisper-jax
# Insanely Fast Whisper
pip install insanely-fast-whisper
# Voxtral (requires dev transformers)
pip uninstall transformers -y
pip install git+https://github.com/huggingface/transformers.git
pip install mistral-common[audio] soundfile
Speaker Diarization Setup
# Install pyannote.audio
pip install pyannote.audio
# Get Hugging Face token
# 1. Sign up at https://huggingface.co
# 2. Create token at https://huggingface.co/settings/tokens
# 3. Accept license at https://huggingface.co/pyannote/speaker-diarization
# Set token (choose one method):
# Method 1: Environment variable
export HF_TOKEN="your_token_here" # Linux/macOS
setx HF_TOKEN "your_token_here" # Windows
# Method 2: Config file
mkdir -p ~/.huggingface
echo "your_token_here" > ~/.huggingface/token
# Method 3: Enter in GUI
Voxtral API Setup
# Get Mistral API key from https://console.mistral.ai/
# Set API key (choose one method):
# Method 1: Environment variable
export MISTRAL_API_KEY="your_key_here" # Linux/macOS
setx MISTRAL_API_KEY "your_key_here" # Windows
# Method 2: Config file
mkdir -p ~/.mistral
echo "your_key_here" > ~/.mistral/api_key
# Method 3: Enter in GUI
🚀 Usage
GUI Application
# Start the application
python main.py
# Or as a module
python -m susurrus
Basic Workflow:
- Select Audio Source: Choose file or enter URL
- Choose Backend: Select transcription engine
- Configure Options: Set language, model, device
- Enable Diarization (optional): Identify speakers
- Start Transcription: Click "Transcribe"
- Save Results: Export to TXT, SRT, or VTT
Command Line Workers
Transcription Worker
python workers/transcribe_worker.py \
--audio-input audio.mp3 \
--backend faster-batched \
--model-id large-v3 \
--language en \
--device auto
Diarization Worker
python workers/diarize_worker.py \
--audio-input audio.mp3 \
--hf-token YOUR_TOKEN \
--transcribe \
--model-id base \
--backend faster-batched \
--output-formats txt,srt,vtt
Python API
# Transcription backend example
from workers.transcription.backends import get_backend
backend = get_backend(
'faster-batched',
model_id='large-v3',
device='auto',
language='en'
)
for start, end, text in backend.transcribe('audio.mp3'):
print(f"[{start:.2f}s -> {end:.2f}s] {text}")
# Diarization example
from backends.diarization import DiarizationManager
manager = DiarizationManager(hf_token="YOUR_TOKEN")
segments, files = manager.diarize_and_split('audio.mp3')
for segment in segments:
print(f"{segment['speaker']}: {segment['text']}")
🧪 Development
Architecture Overview
susurrus/
├── main.py # Application entry point
├── config.py # Central configuration
├── backends/ # Transcription & diarization backends
│ ├── diarization/ # Speaker diarization module
│ │ ├── manager.py # Diarization orchestration
│ │ └── progress.py # Enhanced progress tracking
│ └── transcription/ # Transcription backends
│ ├── voxtral_local.py # Voxtral local inference
│ └── voxtral_api.py # Voxtral API integration
├── gui/ # User interface components
│ ├── main_window.py # Main application window
│ ├── widgets/ # Custom widgets
│ │ ├── collapsible_box.py
│ │ ├── diarization_settings.py
│ │ ├── voxtral_settings.py
│ │ └── advanced_options.py
│ └── dialogs/ # Dialog windows
│ ├── dependencies_dialog.py
│ ├── installer_dialog.py
│ └── cuda_diagnostics_dialog.py
├── workers/ # Background processing
│ ├── transcription_thread.py # GUI thread wrapper
│ ├── transcribe_worker.py # Standalone transcription worker
│ ├── diarize_worker.py # Standalone diarization worker
│ └── transcription/ # Transcription backend implementations
│ ├── backends/
│ │ ├── base.py # Base backend interface
│ │ ├── mlx_backend.py
│ │ ├── faster_whisper_backend.py
│ │ ├── transformers_backend.py
│ │ ├── whisper_cpp_backend.py
│ │ ├── ctranslate2_backend.py
│ │ ├── whisper_jax_backend.py
│ │ ├── insanely_fast_backend.py
│ │ ├── openai_whisper_backend.py
│ │ └── voxtral_backend.py
│ └── utils.py
├── utils/ # Utility modules
│ ├── device_detection.py # CUDA/MPS/CPU detection
│ ├── audio_utils.py # Audio processing utilities
│ ├── download_utils.py # URL downloading
│ ├── dependency_check.py # Dependency verification
│ └── format_utils.py # Time formatting utilities
├── models/ # Model configuration
│ └── model_config.py # Model mappings & utilities
└── scripts/ # Standalone utility scripts
├── test_voxtral.py # Voxtral testing
└── pyannote_torch26.py # PyTorch 2.6+ compatibility
Running Tests
# Run all tests
pytest tests/
# Run specific test file
pytest tests/test_backends.py
# Run with coverage
pytest --cov=. --cov-report=html
Code Quality
# Format code
black .
# Lint
flake8 .
pylint susurrus/
# Type checking
mypy .
Adding a New Backend
- Create a new file in
workers/transcription/backends/ - Inherit from
TranscriptionBackend - Implement required methods:
class MyBackend(TranscriptionBackend): def transcribe(self, audio_path): # Yield (start, end, text) tuples pass def preprocess_audio(self, audio_path): # Optional preprocessing return audio_path def cleanup(self): # Optional cleanup pass - Register in
workers/transcription/backends/__init__.py - Add to
config.pyBACKEND_MODEL_MAP
🔧 Configuration
Settings Location
Related Skills
node-connect
344.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
99.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
