Whisperer
Go from raw audio files to a text-audio dataset automatically with OpenAI's Whisper.
Install / Use
/learn @miguelvalente/WhispererREADME
whisperer
Go from raw audio files to a speaker separated text-audio datasets automatically.

Table of Contents
Summary
This repo takes a directory of audio files and converts them to a text-audio dataset with normalized distribution of audio lengths. See AnalyzeDataset.ipynb for examples of the dataset distributions across audio and text length
The output is a text-audio dataset that can be used for training a speech-to-text model or text-to-speech. The dataset structure is as follows:
│── /dataset
│ ├── metadata.txt
│ └── wavs/
│ ├── audio1.wav
│ └── audio2.wav
metadata.txt
peters_0.wav|Beautiful is better than ugly.
peters_1.wav|Explicit is better than implicit.
Key Features
- Audio files are automatically split by speakers
- Speakers are auto-labeled across the files
- Audio splits on silences
- Audio splitting is configurable
- The dataset creation is done so that it follows Gaussian-like distributions on clip length. Which, in turn, can lead to Gaussian-like distributions on the rest of the dataset statistics. Of course, this is highly dependent on your audio sources.
- Leverages the GPUs available on your machine. GPUs also be set explicitly if you only want to use some.
Instalation
You have two options
- Install from PyPi with pip
pip install whisperer-ml
- User Friendly WebApp Whisperer Web
Take a look at the Demo on your browser.
Note: Under Development but ready to be used
How to use:
- Create data folder and move audio files to it
mkdir data data/raw_files
-
There are four commands
- Convert
whisperer_ml convert path/to/data/raw_files - Diarize
whisperer_ml diarize path/to/data/raw_files - Auto-Label
whisperer_ml auto-label path/to/data/raw_files number_speakers - Transcribe
whisperer_ml transcribe path/to/data/raw_files your_dataset_name - Help lists all commands
whisperer_ml --help - You can run help on a specific command
whisperer_ml convert --help - Convert
-
Use the
AnalyseDataset.ipynbnotebook to visualize the distribution of the dataset -
Use the
AnalyseSilence.ipynbnotebook to experiment with silence detection configuration
Using Multiple-GPUS
The code automatically detects how many GPU's are available and distributes the audio files in data/wav_files evenly across the GPUs.
The automatic detection is done through nvidia-smi.
You can to make the available GPU's explicit by setting the environment variable CUDA_AVAILABLE_DEVICES.
Configuration
Modify config.py file to change the parameters of the dataset creation. Including silence detection.
To Do
- [x] Speech Diarization
- [x] Replace click with typer
Acknowledgements
Related Skills
node-connect
348.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
348.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
348.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
