Voxt
🎙️Voice input and translation app for macOS. Press to talk, release to paste.
Install / Use
/learn @hehehai/VoxtREADME
Voxt
A macOS menu bar voice input and translation app. Hold to speak, release to paste. <br>AI transcription with different rules for different apps and URLs.
English · 简体中文 · [Report Issues][github-issues-link] · Prompt · Meeting · Rewrite
[![][github-release-shield]][github-release-link] [![][macos-version-shield]][macos-version-link] [![][license-shield]][license-link] [![][release-date-shield]][release-date-link]
<img width="2028" height="1460" alt="image" src="https://github.com/user-attachments/assets/ee90a432-746a-457a-96b7-b67713dd49d9" /> </div>✨ Feature Overview Speak, don't type
Speak and turn voice into text fn
- Live transcription while you speak, with real-time text preview.
- Result enhancement: remove filler words, add punctuation automatically, and customize prompts your own way.
- App Branch groups let different apps or URLs use different enhancement rules and prompts, for coding, chat, email, and more.
- Personal dictionary support can inject exact terms into prompts and optionally auto-correct high-confidence near matches before output.
- Multilingual support with smooth mixed-language input.
Speak and translate right away fn+shift
- AI translation immediately after transcription.
- Selected-text translation: highlight text and translate it directly with a shortcut.
- Custom translation prompts and terminology guidance, so output matches your habits.
- Separate model selection for translation, so you can pick the strongest or fastest model for the job.
Use voice as a prompt fn+control
- Example: "Help me write a 200-word self-introduction." Your speech becomes the prompt, and the result is inserted automatically.
- Rewrite selected text by voice, for example: "Make this shorter and smoother."
- Optional rewrite answer card keeps generated content visible even when no writable input is focused.
- More than voice input: it also works like a voice-driven AI assistant.
Meeting Notes (Beta) fn+option
- A dedicated floating meeting card for long-running conversation capture.
- Current beta uses dual-source capture:
- microphone is labeled as
Me - system audio is labeled as
Them
- microphone is labeled as
- Meeting mode follows the current ASR engine:
WhisperMLX AudioRemote ASR
- Realtime behavior follows the current engine/model/provider configuration when available.
- The live meeting card is configured as non-shareable at the window level so it should stay out of normal screen sharing / window sharing output.
Download / Install
-
Install via Homebrew:
brew tap hehehai/tap
brew install --cask voxt
Model Support
<img width="1041" height="744" alt="image" src="https://github.com/user-attachments/assets/30d9e4fa-d88e-44db-8ab7-9d216c6a03d8" />Voxt separates ASR provider models and LLM provider models. They are used for speech-to-text, text enhancement, translation, and rewrite flows respectively.
System dictation is also supported through Apple Dictation, though multilingual coverage is more limited.
Local Models
With macOS 15.0 or later and local model support, Voxt currently ships with:
MLX Audiolocal ASR modelsWhispervia WhisperKit, as a separate local ASR engine- a set of downloadable local LLM models for enhancement, translation, and rewriting
Whisper is not a sub-mode of MLX Audio. In the main window's Model page it appears as its own engine, with its own model list, download flow, and runtime options.
[!NOTE] "Current status / errors" below comes from the current project code. "Language support / speed / recommendation" is summarized from model cards plus project descriptions. Speed and recommendation are for model selection guidance, not a unified benchmark.
Voxt also supports Direct Dictation via Apple SFSpeechRecognizer:
- Best for: quick setup when you do not want to download local models yet.
- Limitation: relatively limited multilingual support.
- Requirements: microphone permission plus speech recognition permission.
- Common error:
Speech Recognition permission is required for Direct Dictation.
Local ASR Models
Voxt's current MLX Audio catalog is broader than the short default picker suggests. The app currently exposes the following local STT families:
| Family | Built-in Variants | Language / Runtime Notes | Recommendation |
| --- | --- | --- | --- |
| Qwen3-ASR 0.6B | 4bit, 6bit, 8bit, bf16 | Multilingual general-purpose ASR with the lowest Qwen3 footprint | Default local ASR family; best overall balance |
| Qwen3-ASR 1.7B | 4bit, 6bit, 8bit, bf16 | Larger multilingual Qwen3 family with higher accuracy and memory cost | Accuracy-first local ASR |
| Voxtral Realtime Mini 4B | 4bit, 6bit, fp16 | Multilingual realtime-oriented family; these are the MLX Audio models Voxt currently treats as realtime-capable | Best when you want local realtime behavior |
| Parakeet | tdt_ctc-110m, tdt-0.6b-v2, tdt-0.6b-v3, ctc-0.6b, rnnt-0.6b, tdt-1.1b, tdt_ctc-1.1b, ctc-1.1b, rnnt-1.1b | English-first family with both lightweight and higher-capacity options | Best for English-heavy workflows and fast local iteration |
| GLM-ASR Nano | 2512-4bit | Smallest current footprint; model card positions it around Chinese and English usage | Good low-friction starter model |
| Granite Speech 4.0 | 1b-speech-5bit | Compact multilingual speech model between nano-tier and larger multilingual stacks | Balanced alternative when you want more quality than nano-tier |
| FireRed ASR 2 | AED-mlx | Offline-focused beam-search ASR path | Use when offline quality matters more than lightness |
| SenseVoice | SenseVoiceSmall | Fast multilingual model with language and event detection | Good utility choice for mixed-language or event-heavy audio |
Notes for the current MLX Audio integration:
- Voxt stores MLX Audio downloads under its
mlx-audiomodel storage root and checks canonical model identifiers before deciding whether a model is already installed. - Older saved model IDs are auto-migrated to the current canonical IDs for
Parakeet,GLM-ASR Nano,Voxtral Realtime, andFireRed ASR 2, so existing settings should continue working after upgrade. - Alignment-only repositories are rejected explicitly; for example,
Qwen3-ForcedAligneris not treated as a transcription model. - The current package source is the Voxt mirror fork
hehehai/mlx-audio-swiftpinned to0.1.2-voxt.1. See docs/MLXAudioDependency.md for the tag policy.
Whisper (WhisperKit)
Voxt also supports Whisper as a separate on-device ASR engine through WhisperKit.
- Built-in model list:
tiny,base,small,medium,large-v3 - Current download source: Hugging Face style model paths via
argmaxinc/whisperkit-coreml - China mirror: supported through the app's mirror setting
- Common runtime options:
Realtimetoggle, enabled by defaultVADTimestampsTemperature
- Current behavior:
- standard transcription uses Whisper
transcribe - translation hotkey can optionally use Whisper's built-in
translate-to-Englishtask when Translation provider is set toWhisper - if Whisper translation is unavailable for the current case, Voxt falls back to the selected LLM translation provider
- standard transcription uses Whisper
Curated Whisper model list in Voxt:
| Model | Approx. Download Size | Recommendation | Notes | | --- | --- | --- | --- | | Whisper Tiny | about 76.6 MB | Medium | Smallest footprint, best for quick local drafts | | Whisper Base | about 146.7 MB | High | Default Whisper balance for quality and speed | | Whisper Small | about 486.5 MB | High | Better recognition quality with moderate local cost | | Whisper Medium | about 1.53 GB | Very high | Accuracy-first local option with heavier download and memory use | | Whisper Large-v3 | about 3.09 GB | Very high | Largest local Whisper option, best suited to Apple Silicon Macs with enough disk and memory headroom |
Whisper-specific notes:
- Whisper follows your selected main language for simplified/traditional Chinese output normalization.
- Whisper translation is only direct for speech-to-English scenarios; selected-text translation still uses the normal text translation flow.
- If a Whisper model download is interrupted or corrupted, Voxt now treats it as incomplete and requires a clean re-download instead of trying to load a broken model.
Common local ASR errors / states:
Invalid model identifierModel repository unavailable (..., HTTP 401/404)Download failed (...)Model load failed (...)Size unavailable- If you accidentally point to an alignment-only repo, Voxt will show
alignment-only and not supported by Voxt transcription - Whisper may additionally surface incomplete-download or broken-model errors if required Core ML weight files are missing
Local LLM Models
| Model | Repository ID | Size | Language Bias | Speed | Recommendation | Best For |
| --- | --- | --- | --- | --- | --- | --- |
| Qwen2 1.5B Instruct | Qwen/Qwen2-1.5B-Instruct | 1.5B | Balanced Chinese / English | Fast | High | Lightweight cleanup and simple translation |
| Qwen2.5 3B Instruct | Qwen/Qwen2.5-3B-Instruct | 3B | Balanced Chinese / English | Medium-fast | High | More stable enhancement and formatting |
| Qwen3 4B (4bit) | mlx-community/Qwen3-4B-4bit | 4B / 4bit | Chinese / English / multilingual | Medium-fast | Very high | Best overall local balance for enhancement and translation |
| Qwen3 8B (4bit) | mlx-community/Qwen3-8B-4bit | 8B / 4bit | Chinese / English / multilingual | Medium-slow | Very high | Stronger rewriting, translation, and structured output |
| GLM-4 9B (4bit) | `mlx
Related Skills
node-connect
354.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
112.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
354.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
354.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
