Video Whisper
ποΈ Local video/audio transcription on Apple Silicon using MLX Whisper. No API keys, no cloud, no cost.
Install / Use
/learn @ylongw/Video WhisperREADME
ποΈ Video Whisper
Local video/audio transcription on Apple Silicon using MLX Whisper.
No API keys. No cloud. No cost. Runs entirely on your Mac.
Supports YouTube, Bilibili (Bη«), Xiaohongshu (ε°ηΊ’δΉ¦), Douyin (ζι³), podcasts, and local files.
Quick Start
# Install dependencies
brew install yt-dlp ffmpeg
python3 -m venv ~/.openclaw/venvs/whisper
~/.openclaw/venvs/whisper/bin/pip install mlx-whisper
# Transcribe
bash scripts/transcribe.sh "https://www.youtube.com/watch?v=..."
Output: /tmp/whisper_output.txt (text) + /tmp/whisper_output.json (with timestamps)
Features
- π Apple Silicon optimized β MLX framework, fast inference on M1/M2/M3/M4
- π Multi-language β auto-detects language, strong Chinese/English/Japanese support
- πΊ Multi-platform β YouTube, Bilibili, Xiaohongshu, Douyin, and 1000+ sites
- π Local files β MP4, MP3, WAV, M4A, etc.
- β±οΈ Timestamps β JSON output includes per-segment timing
- π€ OpenClaw ready β drop into
skills/and let your AI agent transcribe & summarize
Performance
On Mac mini M4 (16GB):
| Video Length | Time (medium model) | |-------------|-------------------| | 5 min | ~30-40s | | 10 min | ~60-90s | | 30 min | ~3-4 min | | 60 min | ~6-8 min |
See SKILL.md for full docs, model options, and integration guide.
License
MIT
