GeminiASR
A Python tool that uses Google Gemini API to transcribe video or audio files into SRT subtitle files.
Install / Use
/learn @cxyfer/GeminiASRREADME
🎙️ Gemini ASR Transcription Tool
A Python tool that uses Google Gemini API to transcribe video or audio files into SRT subtitle files.
</div>✨ Features
- 🎥 Supports various video (mp4, avi, mkv) and audio (mp3, wav) formats.
- ✂️ Automatically splits long files into smaller chunks for processing.
- 🧵 Uses multi-threading for parallel processing to speed up transcription.
- 🔄 Optionally rotates between multiple Google API Keys to improve request success rate.
- ⏱️ Generates SRT subtitle files with precise timestamps (millisecond accuracy).
- 🎬 Option to clip specific time segments of videos or audio for transcription.
- 📄 Option to save original transcription text returned by Gemini API.
- 💬 Supports passing additional prompts (text or file path) to guide the transcription model.
- 📁 Can process a single file or all supported files in an entire directory.
- ⏩ Provides
--skip-existingoption to avoid reprocessing files that already have subtitles. - 🐞 Supports DEBUG mode for detailed logging output.
- 🌈 Uses colored logs for easy distinction between different message levels.
- 🔗 Supports proxy or custom server side like gemini-balance.
- If you want to use gemini-balance, you need to set the
BASE_URLenvironment variable tohttps://your-custom-url.com/. - Note: code execution should be closed if you use gemini-balance.
- If you want to use gemini-balance, you need to set the
- ⚙️ TOML Configuration Support: Comprehensive configuration management system with multiple sources.
- 📝 Configuration file support with automatic search in multiple locations
- 🔄 Multi-source configuration merging (CLI > Environment > TOML > Defaults)
- 🎛️ Easy management of all settings in a single configuration file
🔧 Installation
🛠️ Environment Setup
- Install Python: Python 3.10 or higher is recommended.
- Install uv: If you haven't installed
uvyet, please refer to the uv official documentation for installation.uvis an extremely fast Python package installer and manager.
📦 Install
Option A: Editable install (recommended for development)
pip install -e .
Then run:
geminiasr -i video.mp4
Option B: One-shot run with uv
uv run gemini_asr.py -i video.mp4
This will automatically install all required dependencies and execute the script.
🔑 API Keys Configuration
-
Get Google API Key: Go to Google AI Studio to obtain your API key. You can get multiple keys to improve processing efficiency.
-
Configuration Methods (choose one):
Option A: TOML Configuration File (Recommended)
# Copy the example configuration file cp config.example.toml config.toml # Edit config.toml and add your API keys nano config.tomlIn
config.toml:[api] source = "gemini" # "gemini" or "openai" google_api_keys = ["YOUR_API_KEY_1", "YOUR_API_KEY_2", "YOUR_API_KEY_3"]Option B: Environment Variables
# Set environment variable with comma-separated keys export GOOGLE_API_KEY=YOUR_API_KEY_1,YOUR_API_KEY_2,YOUR_API_KEY_3Option C: .env File
# Create .env file in project root echo "GOOGLE_API_KEY=YOUR_API_KEY_1,YOUR_API_KEY_2,YOUR_API_KEY_3" > .env
⚙️ Configuration System
GeminiASR supports a flexible configuration system with the following priority order:
- Command-line arguments (highest priority)
- Environment variables
- TOML configuration file
- Default values (lowest priority)
Configuration File Locations (searched in order):
./config.toml(current directory)./.geminiasr/config.toml~/.geminiasr/config.toml~/.config/geminiasr/config.toml
Environment Variable Whitelist:
GOOGLE_API_KEY(comma-separated keys)GEMINIASR_LANG,GEMINIASR_MODEL,GEMINIASR_DURATIONGEMINIASR_MAX_WORKERS,GEMINIASR_IGNORE_KEYS_LIMIT,GEMINIASR_DEBUGGEMINIASR_SAVE_RAW,GEMINIASR_SKIP_EXISTING,GEMINIASR_PREVIEWGEMINIASR_MAX_SEGMENT_RETRIES,GEMINIASR_EXTRA_PROMPTGEMINIASR_API_SOURCEGEMINIASR_BASE_URLorBASE_URL
OpenAI-Compatible Endpoint:
- Set
api.source = "openai"(orGEMINIASR_API_SOURCE=openai). - If
advanced.base_urlstays at the Gemini default, it will switch tohttps://generativelanguage.googleapis.com/v1beta/openai/.
Example Configuration (config.toml):
# Transcription Settings
[transcription]
duration = 900 # Segment duration in seconds
lang = "zh-TW" # Language code
model = "gemini-2.5-flash" # Gemini model
skip_existing = true # Skip files with existing SRT
max_segment_retries = 3 # Max retries per segment
# Processing Settings
[processing]
max_workers = 24 # Max concurrent threads
ignore_keys_limit = true # Ignore API key limits
# Logging Settings
[logging]
debug = true # Enable debug logging
# API Settings
[api]
source = "gemini" # "gemini" or "openai"
google_api_keys = ["key1", "key2", "key3"]
# Advanced Settings
[advanced]
extra_prompt = "prompt.md" # Path to prompt file
base_url = "https://generativelanguage.googleapis.com/"
# base_url = "https://generativelanguage.googleapis.com/v1beta/openai/"
📋 Usage
⌨️ Command Line Arguments
geminiasr [-h] -i INPUT [-d DURATION] [-l LANG] [-m MODEL]
[--start START] [--end END] [--save-raw]
[--skip-existing | --no-skip-existing]
[--debug] [--max-workers MAX_WORKERS]
[--extra-prompt EXTRA_PROMPT]
[--ignore-keys-limit] [--preview]
[--max-segment-retries MAX_SEGMENT_RETRIES]
[--config CONFIG]
arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Input video, audio file, or folder containing media files
-d DURATION, --duration DURATION
Duration of each segment in seconds (default: 900)
-l LANG, --lang LANG Language code (default: zh-TW)
-m MODEL, --model MODEL
Gemini model (default: gemini-2.5-flash)
--start START Start time in seconds
--end END End time in seconds
--save-raw Save raw transcription results
--skip-existing Skip processing if SRT subtitle file already exists
--no-skip-existing Overwrite existing SRT files
--debug Enable DEBUG level logging
--max-workers MAX_WORKERS
Maximum number of worker threads (default: based on CPU and API keys)
--extra-prompt EXTRA_PROMPT
Additional prompt or path to a file containing prompts
--ignore-keys-limit Ignore the API key quantity limit on maximum worker threads
--preview Print a preview of raw transcription content
--max-segment-retries MAX_SEGMENT_RETRIES
Max retries per segment
--config CONFIG Path to config.toml
💡 Usage Examples
-
Using TOML Configuration (Recommended):
# All settings from config.toml - just specify input geminiasr -i video.mp4 # Process entire directory with TOML settings geminiasr -i /path/to/media/folder -
Traditional Command-Line Usage:
# Basic transcription geminiasr -i video.mp4 # With custom settings geminiasr -i video.mp4 -d 300 --debug
🔍 Technical Details About Audio Processing
[!NOTE] The old default model (
gemini-2.5-pro) is free but has some limits. Now the default model isgemini-2.5-flash.
[!IMPORTANT] Although
gemini-3-pro-previewandgemini-3-flash-previewhave been released, under the current prompt template, their timestamp accuracy is far inferior togemini-2.5-proand evengemini-2.5-flash. Therefore, considering all factors, we still recommend using thegemini-2.5-flashmodel.
- 🧮 Token Usage: Gemini uses 32 tokens per second of audio (1,920 tokens/minute). For more details on audio processing capabilities, see Gemini Audio Documentation.
- 📈 Output Tokens: Gemini 2.5 Pro/Flash has a limit of 65,536 output tokens per request, which affects the maximum duration of processable audio. See Gemini Models Documentation for details.
- 📊 Rate Limits: The default model (
gemini-2.5-pro) is free during the preview period but subject to specific limits: 250,000 TPM (tokens per minute), 5 RPM (requests per minute) and 100 RPD (requests per day). See Rate Limits Documentation for details. - 💰 Pricing: Paid tier costs $1.25 per million tokens (≤200k tokens) or $2.50 per million tokens (>200k tokens). For audio longer than 2 hours, it is recommended to split the file to avoid excessive token usage and potential cost overruns. See Gemini Developer API Pricing for complete pricing information.
🤝 Contributing
Thanks for your interest in GeminiASR! This guide keeps contributions simple and consistent.
Quick Start
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
Development Notes
- Target Python: 3.10+
- Configuration: use
config.example.tomlas a template - Avoid committing secrets: use
.envor untrackedconfig.toml
Lint & Test
ruff check .
pytest
Pull Request Checklist
- [ ] Code is formatted and linted
- [ ] Tests added or updated when applicable
- [ ] README or docs updated if behavior changes
- [ ] No secrets or credentials included
📄 License
MIT License. See LICENSE.
