Whisperize

Whisperize - A Python application for real-time audio transcription and speaker diarization using Faster-Whisper and PyAnnote.

Generate Convert Improve

Install / Use

/learn @francescopace/Whisperize

About this skill

Quality Score

0/100

README

Whisperize

A Python application for real-time audio transcription and speaker diarization using Faster-Whisper and PyAnnote.

Features

Real-time audio transcription with Apple Silicon support (MPS)
Advanced speaker diarization using PyAnnote
Support for microphone and audio file input
Multiple Whisper model sizes and quantization options
Configurable via JSON with text or JSON output formats
Thread-safe parallel processing of transcription and diarization

Requirements

Python 3.10

FFmpeg (required for audio processing)

# On macOS using Homebrew
brew install ffmpeg

# On Ubuntu/Debian
sudo apt-get install ffmpeg

# On Windows using Chocolatey
choco install ffmpeg

Apple Silicon Mac recommended for optimal performance with MPS acceleration

Installation

1. Clone the Repository

git clone https://github.com/francescopace/whisperize.git
cd whisperize

2. Create and Activate Virtual Environment

python -m venv .venv
source .venv/bin/activate  # On Unix/macOS
# or
.venv\Scripts\activate  # On Windows

3. Install Dependencies

pip install -r requirements.txt

4. Configure the Application

Create a HuggingFace account at https://huggingface.co/
Generate an access token at https://huggingface.co/settings/tokens
Edit config.json and update with your settings:

{
    "huggingface_token": "your_token_here",
    "output_folder": "transcripts/",
    "output_format": "text",
    "model": "turbo",
    "whisper_force_cpu": false,
    "language": "it",
    "buffer_duration": 4
}

Configuration Parameters

huggingface_token (required): Your HuggingFace API token for accessing PyAnnote models
output_folder (required): Directory where transcripts will be saved
output_format (optional): Output format - "text" or "json" (default: "text")
- text: Creates a human-readable transcript with timestamps
- json: Creates both a text file and a structured JSON file with metadata and word-level timestamps
model (optional): Whisper model size - "tiny", "base", "small", "medium", "large", "turbo" (default: "base")
whisper_force_cpu (optional): Force CPU usage even if GPU/MPS is available (default: false)
language (optional): Language code (e.g., "it", "en", "es"). If not specified, language is auto-detected
buffer_duration (optional): Audio buffer duration in seconds (default: 5.0)

Supported Models

Whisper Models

The application uses Faster-Whisper for transcription. Available models:

tiny - Fastest, lowest accuracy
base - Good balance of speed and accuracy
small - Better accuracy, slower
medium - High accuracy
large - Highest accuracy, slowest
turbo - Optimized large model

See Whisper documentation for language support and model details.

Diarization Model

The application uses PyAnnote speaker-diarization-3.1 for speaker identification. This model is automatically loaded and requires a HuggingFace token for access.

Usage

Basic Usage

Microphone Input (default):

python whisperize.py
# or explicitly
python whisperize.py microphone

Audio File Input:

python whisperize.py path/to/audio.wav

Note: Only WAV files (16-bit, mono or stereo) are currently supported.

Output

Transcripts are saved in the output_folder specified in config.json:

Text Format (output_format: "text"):

# Transcript started at 2025-02-11 18:30:00

[00:00:02.500-00:00:05.300] [SPEAKER_00]: Hello, this is a test transcription.
[00:00:06.100-00:00:09.800] [SPEAKER_01]: Yes, I can hear you clearly.

JSON Format (output_format: "json"):

Creates both a .txt file (for real-time monitoring) and a .json file
JSON includes metadata, speaker labels, timestamps, and word-level details with confidence scores

Example JSON structure:

{
  "metadata": {
    "start_time": "2025-02-11T18:30:00",
    "duration": 120.5,
    "model": "turbo",
    "language": "it",
    "source": "microphone"
  },
  "segments": [
    {
      "speaker": "SPEAKER_00",
      "start": 2.500,
      "end": 5.300,
      "text": "Hello, this is a test transcription.",
      "words": [
        {
          "word": "Hello",
          "start": 2.500,
          "end": 2.800,
          "probability": 0.95
        }
      ]
    }
  ]
}

Related Skills

node-connect

345.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

104.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

345.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

345.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。