Voxt

🎙️Voice input and translation app for macOS. Press to talk, release to paste.

Generate Convert Improve

Install / Use

/learn @hehehai/Voxt

About this skill

Quality Score

0/100

README

Voxt

A macOS menu bar voice input and translation app. Hold to speak, release to paste. <br>AI transcription with different rules for different apps and URLs.

English · 简体中文 · [Report Issues][github-issues-link] · Prompt · Meeting · Rewrite

[![][github-release-shield]][github-release-link] [![][macos-version-shield]][macos-version-link] [![][license-shield]][license-link] [![][release-date-shield]][release-date-link]

✨ Feature Overview Speak, don't type

Speak and turn voice into text fn

Live transcription while you speak, with real-time text preview.
Result enhancement: remove filler words, add punctuation automatically, and customize prompts your own way.
App Branch groups let different apps or URLs use different enhancement rules and prompts, for coding, chat, email, and more.
Personal dictionary support can inject exact terms into prompts and optionally auto-correct high-confidence near matches before output.
Multilingual support with smooth mixed-language input.

Speak and translate right away fn+shift

AI translation immediately after transcription.
Selected-text translation: highlight text and translate it directly with a shortcut.
Custom translation prompts and terminology guidance, so output matches your habits.
Separate model selection for translation, so you can pick the strongest or fastest model for the job.

Use voice as a prompt fn+control

Example: "Help me write a 200-word self-introduction." Your speech becomes the prompt, and the result is inserted automatically.
Rewrite selected text by voice, for example: "Make this shorter and smoother."
Optional rewrite answer card keeps generated content visible even when no writable input is focused.
More than voice input: it also works like a voice-driven AI assistant.

Meeting Notes (Beta) fn+option

A dedicated floating meeting card for long-running conversation capture.
Current beta uses dual-source capture:
- microphone is labeled as Me
- system audio is labeled as Them
Meeting mode follows the current ASR engine:
- Whisper
- MLX Audio
- Remote ASR
Realtime behavior follows the current engine/model/provider configuration when available.
The live meeting card is configured as non-shareable at the window level so it should stay out of normal screen sharing / window sharing output.

![][back-to-top]

Download / Install

Latest release
Install via Homebrew:

brew tap hehehai/tap
brew install --cask voxt

Model Support

Voxt separates ASR provider models and LLM provider models. They are used for speech-to-text, text enhancement, translation, and rewrite flows respectively.

System dictation is also supported through Apple Dictation, though multilingual coverage is more limited.

Local Models

With macOS 15.0 or later and local model support, Voxt currently ships with:

MLX Audio local ASR models
Whisper via WhisperKit, as a separate local ASR engine
a set of downloadable local LLM models for enhancement, translation, and rewriting

Whisper is not a sub-mode of MLX Audio. In the main window's Model page it appears as its own engine, with its own model list, download flow, and runtime options.

[!NOTE] "Current status / errors" below comes from the current project code. "Language support / speed / recommendation" is summarized from model cards plus project descriptions. Speed and recommendation are for model selection guidance, not a unified benchmark.

Voxt also supports Direct Dictation via Apple SFSpeechRecognizer:

Best for: quick setup when you do not want to download local models yet.
Limitation: relatively limited multilingual support.
Requirements: microphone permission plus speech recognition permission.
Common error: Speech Recognition permission is required for Direct Dictation.

Local ASR Models

Voxt's current MLX Audio catalog is broader than the short default picker suggests. The app currently exposes the following local STT families:

| Family | Built-in Variants | Language / Runtime Notes | Recommendation | | --- | --- | --- | --- | | Qwen3-ASR 0.6B | 4bit, 6bit, 8bit, bf16 | Multilingual general-purpose ASR with the lowest Qwen3 footprint | Default local ASR family; best overall balance | | Qwen3-ASR 1.7B | 4bit, 6bit, 8bit, bf16 | Larger multilingual Qwen3 family with higher accuracy and memory cost | Accuracy-first local ASR | | Voxtral Realtime Mini 4B | 4bit, 6bit, fp16 | Multilingual realtime-oriented family; these are the MLX Audio models Voxt currently treats as realtime-capable | Best when you want local realtime behavior | | Parakeet | tdt_ctc-110m, tdt-0.6b-v2, tdt-0.6b-v3, ctc-0.6b, rnnt-0.6b, tdt-1.1b, tdt_ctc-1.1b, ctc-1.1b, rnnt-1.1b | English-first family with both lightweight and higher-capacity options | Best for English-heavy workflows and fast local iteration | | GLM-ASR Nano | 2512-4bit | Smallest current footprint; model card positions it around Chinese and English usage | Good low-friction starter model | | Granite Speech 4.0 | 1b-speech-5bit | Compact multilingual speech model between nano-tier and larger multilingual stacks | Balanced alternative when you want more quality than nano-tier | | FireRed ASR 2 | AED-mlx | Offline-focused beam-search ASR path | Use when offline quality matters more than lightness | | SenseVoice | SenseVoiceSmall | Fast multilingual model with language and event detection | Good utility choice for mixed-language or event-heavy audio |

Notes for the current MLX Audio integration:

Voxt stores MLX Audio downloads under its mlx-audio model storage root and checks canonical model identifiers before deciding whether a model is already installed.
Older saved model IDs are auto-migrated to the current canonical IDs for Parakeet, GLM-ASR Nano, Voxtral Realtime, and FireRed ASR 2, so existing settings should continue working after upgrade.
Alignment-only repositories are rejected explicitly; for example, Qwen3-ForcedAligner is not treated as a transcription model.
The current package source is the Voxt mirror fork hehehai/mlx-audio-swift pinned to 0.1.2-voxt.1. See docs/MLXAudioDependency.md for the tag policy.

Whisper (WhisperKit)

Voxt also supports Whisper as a separate on-device ASR engine through WhisperKit.

Built-in model list: tiny, base, small, medium, large-v3
Current download source: Hugging Face style model paths via argmaxinc/whisperkit-coreml
China mirror: supported through the app's mirror setting
Common runtime options:
- Realtime toggle, enabled by default
- VAD
- Timestamps
- Temperature
Current behavior:
- standard transcription uses Whisper transcribe
- translation hotkey can optionally use Whisper's built-in translate-to-English task when Translation provider is set to Whisper
- if Whisper translation is unavailable for the current case, Voxt falls back to the selected LLM translation provider

Curated Whisper model list in Voxt:

| Model | Approx. Download Size | Recommendation | Notes | | --- | --- | --- | --- | | Whisper Tiny | about 76.6 MB | Medium | Smallest footprint, best for quick local drafts | | Whisper Base | about 146.7 MB | High | Default Whisper balance for quality and speed | | Whisper Small | about 486.5 MB | High | Better recognition quality with moderate local cost | | Whisper Medium | about 1.53 GB | Very high | Accuracy-first local option with heavier download and memory use | | Whisper Large-v3 | about 3.09 GB | Very high | Largest local Whisper option, best suited to Apple Silicon Macs with enough disk and memory headroom |

Whisper-specific notes:

Whisper follows your selected main language for simplified/traditional Chinese output normalization.
Whisper translation is only direct for speech-to-English scenarios; selected-text translation still uses the normal text translation flow.
If a Whisper model download is interrupted or corrupted, Voxt now treats it as incomplete and requires a clean re-download instead of trying to load a broken model.

Common local ASR errors / states:

Invalid model identifier
Model repository unavailable (..., HTTP 401/404)
Download failed (...)
Model load failed (...)
Size unavailable
If you accidentally point to an alignment-only repo, Voxt will show alignment-only and not supported by Voxt transcription
Whisper may additionally surface incomplete-download or broken-model errors if required Core ML weight files are missing

Local LLM Models

| Model | Repository ID | Size | Language Bias | Speed | Recommendation | Best For | | --- | --- | --- | --- | --- | --- | --- | | Qwen2 1.5B Instruct | Qwen/Qwen2-1.5B-Instruct | 1.5B | Balanced Chinese / English | Fast | High | Lightweight cleanup and simple translation | | Qwen2.5 3B Instruct | Qwen/Qwen2.5-3B-Instruct | 3B | Balanced Chinese / English | Medium-fast | High | More stable enhancement and formatting | | Qwen3 4B (4bit) | mlx-community/Qwen3-4B-4bit | 4B / 4bit | Chinese / English / multilingual | Medium-fast | Very high | Best overall local balance for enhancement and translation | | Qwen3 8B (4bit) | mlx-community/Qwen3-8B-4bit | 8B / 4bit | Chinese / English / multilingual | Medium-slow | Very high | Stronger rewriting, translation, and structured output | | GLM-4 9B (4bit) | `mlx

Related Skills

node-connect

354.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

112.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

354.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

354.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。