Meeting To Text
Local Meeting Transcription & Speaker Diarization AI Skill (SenseVoice + 3D-Speaker). Offline Video/Audio to Text with Speaker Separation (< 3GB). 本地纯离线会议转写与说话人分离智能体技能,集成音视频提取及高精度 ASR。
Install / Use
/learn @henrCh1/Meeting To TextQuality Score
Category
Development & EngineeringSupported Platforms
README
Meeting-to-Text: Local Speaker-Separated Transcription Skill
The Problem
There is a frequent need to transcribe meeting audio or video recordings into text, especially when using AI assistants like OpenClaw that require structured, machine-readable formats. However:
- Commercial services (like Feishu/Lark Minutes or iFlytek) can be expensive.
- Open-source models are notoriously difficult to deploy.
- Video-to-audio extraction, speech-to-text (ASR), and speaker diarization (Speaker Separation) usually belong to entirely different modules and repositories, making integration frustrating.
- It is difficult for AI agents to call these fragmented tools seamlessly.
The Solution
Meeting-to-Text solves this by packaging these complex capabilities into a single, cohesive Agent Skill. By following the setup instructions to deploy the local dependencies and models, you can easily perform local speaker-separated transcription. Furthermore, the entire system is highly lightweight: after full deployment, the entire system (including all models) takes up less than 3GB of disk space.
Read this document in other languages: 中文(简体)
Features
- True Offline Processing: 100% local. No data leaves your machine.
- Auto Audio Extraction: Uses FFmpeg to extract audio streams from MP4, MKV, MOV, and more.
- Accurate ASR: Leverages Alibaba's
SenseVoiceSmallfor fast, accurate multilingual transcription. - Speaker Diarization: Uses
3D-Speakerto cluster voices and generate separated transcripts with speaker labels (e.g.,Speaker 1,Speaker 2). - Universal Agent & CLI Ready: Structured as a
.mdskill definition and a single Python entrypoint. While we use Codex/OpenClaw as examples, it can be attached to any AI agent, copilot, or CLI automation flow.
Supported Formats
- Video:
.mp4,.mkv,.mov,.avi,.webm - Audio:
.wav,.mp3,.m4a,.aac,.flac,.ogg
Quick Start (Agent Automated Install)
Instead of manually installing dependencies and models, simply copy and paste the following prompt to your AI assistant (e.g. Codex, OpenClaw, Cursor, etc.):
Prompt: Please follow the instructions in this repository to install this skill for me, including its required models and dependencies: https://github.com/henrCh1/meeting-to-text
Manual Installation Guide
1. Prerequisites
- Python 3.10+
- FFmpeg: Download a pre-compiled Windows build (e.g., from gyan.dev) and extract it. You will need the path to
ffmpeg.exe.
2. Install Python Dependencies
It is highly recommended to use a virtual environment (venv or conda).
# Create and activate your environment
python -m venv envs\asr
.\envs\asr\Scripts\activate
# Install requirements
pip install -r requirements.txt
3. Clone Required Repositories
You need the 3D-Speaker repository available locally for speaker diarization.
mkdir repos
cd repos
git clone https://github.com/alibaba-damo-academy/3D-Speaker.git
4. Download Models
The system requires two local models from ModelScope/HuggingFace. Download them into your models/ directory:
(Note: 3D-Speaker checkpoints will automatically cache to the hub directory on first run).
Configuration & Usage
The skill relies on absolute paths to function deterministically. You must configure the paths before using the skill.
Standalone Python Setup
You can run the script manually by injecting the paths as Environment Variables, or by modifying the fallback paths in skills/meeting-to-text/scripts/meeting_to_text.py:
$env:MEETING_TO_TEXT_FFMPEG="C:\path\to\ffmpeg\bin\ffmpeg.exe"
$env:MEETING_TO_TEXT_SENSEVOICE="C:\path\to\models\SenseVoiceSmall"
$env:MEETING_TO_TEXT_VAD="C:\path\to\models\fsmn-vad"
$env:MEETING_TO_TEXT_3D_SPEAKER="C:\path\to\repos\3D-Speaker"
python skills/meeting-to-text/scripts/meeting_to_text.py --input "C:\path\to\video.mp4" --output "C:\path\to\output.txt"
Using as an Agent Skill (Codex, OpenClaw, AutoGPT, etc.)
- Open
skills/meeting-to-text/SKILL.md. - Find the
<YOUR_CONDA_ENV_PYTHON_PATH>placeholder and replace it with the absolute path to your virtual environment'spython.exe. - Find the
C:\path\to\your\meeting-to-text\scripts\meeting_to_text.pyplaceholder and replace it with the absolute path to the script. - Replace
<YOUR_WORKSPACE_TEMP_PATH>with a temporary directory path (e.g.,C:\temp\meeting_workspace). - Register the
SKILL.mdwith your agent platform.
Output Format
The resulting .txt file will look like this:
[00:00:00 - 00:00:05] Speaker 1: Hello everyone, let's start the meeting.
[00:00:05 - 00:00:10] Speaker 2: Yes, I can hear you clearly. Let's begin.
License
Provided under the MIT License. See LICENSE for details.
