Susurrus

speech to text gui for different (mostly Whisper, also Voxtral) models and backends, including whisper.cpp, mlx-whisper, faster-whisper, ctranslate2; applies pyannote for diarization

Generate Convert Improve

Install / Use

/learn @CrispStrobe/Susurrus

About this skill

Quality Score

0/100

README

Susurrus: Audio Transcription Suite

Susurrus is a professional, modular audio transcription application that leverages various AI models and backends to convert speech to text. Built with a clean architecture, it supports multiple Whisper implementations, speaker diarization, and extensive customization options.

✨ Features

Core Transcription

Multiple Backend Support: mlx-whisper, OpenAI Whisper, faster-whisper, transformers, whisper.cpp, ctranslate2, whisper-jax, insanely-fast-whisper, Voxtral
Flexible Input: Local files, URLs, also of videos
Audio Format Support: MP3, WAV, FLAC, M4A, AAC, OGG, OPUS, WebM, MP4, WMA
Language Detection: Automatic or manual language selection
Time-based Trimming: Transcribe specific portions of audio
Word-level Timestamps: Precise timing information (backend-dependent)

Speaker Diarization

Multi-speaker Identification: Automatically detect and label different speakers
Language-specific Models: Optimized models for English, German, Chinese, Spanish, Japanese
Configurable Parameters: Set min/max speaker counts
Multiple Output Formats: TXT, SRT, VTT, JSON with speaker labels
PyAnnote.audio Integration: State-of-the-art diarization engine

Voxtral Support (New!)

Voxtral Local: On-device inference with Mistral's speech model
Voxtral API: Cloud-based inference via Mistral AI API
8 Language Support: EN, FR, ES, DE, IT, PT, PL, NL
Long Audio Processing: Automatic chunking for files over 25 minutes

Advanced Features

Proxy Support: HTTP/SOCKS5 proxy for network requests
Device Selection: Auto-detect or manually choose CPU/GPU/MPS
Model Conversion: Automatic CTranslate2 model conversion
Progress Tracking: Real-time progress with ETA estimation
Settings Persistence: Save your preferences between sessions
Dependency Management: Built-in installer for missing components
CUDA Diagnostics: Detailed GPU/CUDA troubleshooting tools

📦 Installation

Quick Start

# Clone the repository
git clone https://github.com/CrispStrobe/Susurrus.git
cd Susurrus

# Create virtual environment
python -m venv venv

# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run the application
python main.py
# Or as a module:
python -m susurrus

Prerequisites

Python 3.8+
FFmpeg (for audio format conversion)
Git
C++ compiler (for whisper.cpp, optional)
CUDA Toolkit (for GPU acceleration, optional)

Platform-Specific Setup

Windows

# Install Chocolatey (if not installed)
Set-ExecutionPolicy Bypass -Scope Process -Force
iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))

# Install dependencies
choco install cmake ffmpeg git python

# For GPU support
choco install cuda

macOS

# Install Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install dependencies
brew install ffmpeg cmake python git

# For Apple Silicon optimization
pip install mlx mlx-whisper

Linux (Ubuntu/Debian)

# Install dependencies
sudo apt update
sudo apt install ffmpeg cmake build-essential python3 python3-pip git

# For GPU support
# Follow CUDA installation guide for your distribution

Optional Backend Installation

# MLX (Apple Silicon only)
pip install mlx-whisper

# Faster Whisper (recommended)
pip install faster-whisper

# Transformers
pip install transformers torch torchaudio

# Whisper.cpp (manual build required)
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && mkdir build && cd build
cmake .. && make

# CTranslate2
pip install ctranslate2

# Whisper-JAX
pip install whisper-jax

# Insanely Fast Whisper
pip install insanely-fast-whisper

# Voxtral (requires dev transformers)
pip uninstall transformers -y
pip install git+https://github.com/huggingface/transformers.git
pip install mistral-common[audio] soundfile

Speaker Diarization Setup

# Install pyannote.audio
pip install pyannote.audio

# Get Hugging Face token
# 1. Sign up at https://huggingface.co
# 2. Create token at https://huggingface.co/settings/tokens
# 3. Accept license at https://huggingface.co/pyannote/speaker-diarization

# Set token (choose one method):
# Method 1: Environment variable
export HF_TOKEN="your_token_here"  # Linux/macOS
setx HF_TOKEN "your_token_here"    # Windows

# Method 2: Config file
mkdir -p ~/.huggingface
echo "your_token_here" > ~/.huggingface/token

# Method 3: Enter in GUI

Voxtral API Setup

# Get Mistral API key from https://console.mistral.ai/

# Set API key (choose one method):
# Method 1: Environment variable
export MISTRAL_API_KEY="your_key_here"  # Linux/macOS
setx MISTRAL_API_KEY "your_key_here"    # Windows

# Method 2: Config file
mkdir -p ~/.mistral
echo "your_key_here" > ~/.mistral/api_key

# Method 3: Enter in GUI

🚀 Usage

GUI Application

# Start the application
python main.py

# Or as a module
python -m susurrus

Basic Workflow:

Select Audio Source: Choose file or enter URL
Choose Backend: Select transcription engine
Configure Options: Set language, model, device
Enable Diarization (optional): Identify speakers
Start Transcription: Click "Transcribe"
Save Results: Export to TXT, SRT, or VTT

Command Line Workers

Transcription Worker

python workers/transcribe_worker.py \
  --audio-input audio.mp3 \
  --backend faster-batched \
  --model-id large-v3 \
  --language en \
  --device auto

Diarization Worker

python workers/diarize_worker.py \
  --audio-input audio.mp3 \
  --hf-token YOUR_TOKEN \
  --transcribe \
  --model-id base \
  --backend faster-batched \
  --output-formats txt,srt,vtt

Python API

# Transcription backend example
from workers.transcription.backends import get_backend

backend = get_backend(
    'faster-batched',
    model_id='large-v3',
    device='auto',
    language='en'
)

for start, end, text in backend.transcribe('audio.mp3'):
    print(f"[{start:.2f}s -> {end:.2f}s] {text}")

# Diarization example
from backends.diarization import DiarizationManager

manager = DiarizationManager(hf_token="YOUR_TOKEN")
segments, files = manager.diarize_and_split('audio.mp3')

for segment in segments:
    print(f"{segment['speaker']}: {segment['text']}")

🧪 Development

Architecture Overview

susurrus/
├── main.py                    # Application entry point
├── config.py                  # Central configuration
├── backends/                  # Transcription & diarization backends
│   ├── diarization/          # Speaker diarization module
│   │   ├── manager.py        # Diarization orchestration
│   │   └── progress.py       # Enhanced progress tracking
│   └── transcription/        # Transcription backends
│       ├── voxtral_local.py  # Voxtral local inference
│       └── voxtral_api.py    # Voxtral API integration
├── gui/                       # User interface components
│   ├── main_window.py        # Main application window
│   ├── widgets/              # Custom widgets
│   │   ├── collapsible_box.py
│   │   ├── diarization_settings.py
│   │   ├── voxtral_settings.py
│   │   └── advanced_options.py
│   └── dialogs/              # Dialog windows
│       ├── dependencies_dialog.py
│       ├── installer_dialog.py
│       └── cuda_diagnostics_dialog.py
├── workers/                   # Background processing
│   ├── transcription_thread.py    # GUI thread wrapper
│   ├── transcribe_worker.py       # Standalone transcription worker
│   ├── diarize_worker.py          # Standalone diarization worker
│   └── transcription/             # Transcription backend implementations
│       ├── backends/
│       │   ├── base.py           # Base backend interface
│       │   ├── mlx_backend.py
│       │   ├── faster_whisper_backend.py
│       │   ├── transformers_backend.py
│       │   ├── whisper_cpp_backend.py
│       │   ├── ctranslate2_backend.py
│       │   ├── whisper_jax_backend.py
│       │   ├── insanely_fast_backend.py
│       │   ├── openai_whisper_backend.py
│       │   └── voxtral_backend.py
│       └── utils.py
├── utils/                     # Utility modules
│   ├── device_detection.py   # CUDA/MPS/CPU detection
│   ├── audio_utils.py        # Audio processing utilities
│   ├── download_utils.py     # URL downloading
│   ├── dependency_check.py   # Dependency verification
│   └── format_utils.py       # Time formatting utilities
├── models/                    # Model configuration
│   └── model_config.py       # Model mappings & utilities
└── scripts/                   # Standalone utility scripts
    ├── test_voxtral.py       # Voxtral testing
    └── pyannote_torch26.py   # PyTorch 2.6+ compatibility

Running Tests

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_backends.py

# Run with coverage
pytest --cov=. --cov-report=html

Code Quality

# Format code
black .

# Lint
flake8 .
pylint susurrus/

# Type checking
mypy .

Adding a New Backend

Create a new file in workers/transcription/backends/
Inherit from TranscriptionBackend

Implement required methods:

class MyBackend(TranscriptionBackend):
    def transcribe(self, audio_path):
        # Yield (start, end, text) tuples
        pass
    
    def preprocess_audio(self, audio_path):
        # Optional preprocessing
        return audio_path
    
    def cleanup(self):
        # Optional cleanup
        pass

Register in workers/transcription/backends/__init__.py
Add to config.py BACKEND_MODEL_MAP

🔧 Configuration

Settings Location

Related Skills

node-connect

344.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

99.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。