Vibevoice

Fast local speech-to-text for any app using faster-whisper

Generate Convert Improve

Install / Use

/learn @mpaepper/Vibevoice

About this skill

Quality Score

0/100

README

Vibevoice 🎙️

Hi, I'm Marc Päpper and I wanted to vibe code like Karpathy ;D, so I looked around and found the cool work of Vlad. I extended it to run with a local whisper model, so I don't need to pay for OpenAI tokens. I hope you have fun with it!

What it does 🚀

Demo Video

Simply run cli.py and start dictating text anywhere in your system:

Hold down right control key (Ctrl_r)
Speak your text
Release the key
Watch as your spoken words are transcribed and automatically typed!

Works in any application or window - your text editor, browser, chat apps, anywhere you can type!

NEW: LLM voice command mode:

Hold down the scroll_lock key (I think it's normally not used anymore that's why I chose it)
Speak what you want the LLM to do
The LLM receives your transcribed text and a screenshot of your current view
The LLM answer is typed into your keyboard (streamed)

Works everywhere on your system and the LLM always has the screen context

Installation 🛠️

git clone https://github.com/mpaepper/vibevoice.git
cd vibevoice
pip install -r requirements.txt
python src/vibevoice/cli.py

Requirements 📋

Python Dependencies

Python 3.13 or higher

System Requirements

CUDA-capable GPU (recommended) -> in server.py you can enable cpu use
CUDA 12.x
cuBLAS
cuDNN 9.x
In case you get this error: OSError: PortAudio library not found run sudo apt install libportaudio2
Ollama for AI command mode (with multimodal models for screenshot support)

Setting up Ollama

Install Ollama by following the instructions at ollama.com

Pull a model that supports both text and images for best results:

ollama pull gemma3:27b  # Great model which can run on RTX 3090 or similar

Make sure Ollama is running in the background:
```
ollama serve
```

Handling the CUDA requirements

Make sure that you have CUDA >= 12.4 and cuDNN >= 9.x
I had some trouble at first with Ubuntu 24.04, so I did the following:
Attention: DO NOT do this if your are a WSL user (https://docs.nvidia.com/cuda/wsl-user-guide/index.html)

sudo apt update && sudo apt upgrade
sudo apt autoremove nvidia* --purge
ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb && sudo apt update
sudo apt install cuda-toolkit-12-8

or alternatively:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cudnn9-cuda-12

Then after rebooting, it worked well.

Usage 💡

Start the application:

python src/vibevoice/cli.py

Hold down right control key (Ctrl_r) while speaking
Release to transcribe
Your text appears wherever your cursor is!

Configuration

You can customize various aspects of VibeVoice with the following environment variables:

Keyboard Controls

VOICEKEY: Change the dictation activation key (default: "ctrl_r")
```
export VOICEKEY="ctrl"  # Use left control instead
```

VOICEKEY_CMD: Set the key for AI command mode (default: "scroll_lock")

export VOICEKEY_CMD="ctsl"  # Use left control instead of Scroll Lock key

AI and Screenshot Features

OLLAMA_MODEL: Specify which Ollama model to use (default: "gemma3:27b")

export OLLAMA_MODEL="gemma3:4b"  # Use a smaller VLM in case you have less GPU RAM

INCLUDE_SCREENSHOT: Enable or disable screenshots in AI command mode (default: "true")

export INCLUDE_SCREENSHOT="false"  # Disable screenshots (but they are local only anyways)

SCREENSHOT_MAX_WIDTH: Set the maximum width for screenshots (default: "1024")
```
export SCREENSHOT_MAX_WIDTH="800"  # Smaller screenshots
```

Screenshot Dependencies

To use the screenshot functionality:

sudo apt install gnome-screenshot

Usage Modes 💡

VibeVoice supports two modes:

1. Dictation Mode

Hold down the dictation key (default: right Control)
Speak your text
Release to transcribe
Your text appears wherever your cursor is!

2. AI Command Mode

Hold down the command key (default: Scroll Lock)
Ask a question or give a command
Release the key
The AI will analyze your request (and current screen if enabled) and type a response

Credits 🙏

Original inspiration: whisper-keyboard by Vlad
Faster Whisper for the optimized Whisper implementation
Built by Marc Päpper

Related Skills

claude-opus-4-5-migration

83.0k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

model-usage

336.9k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

mcp-for-beginners

15.6k

This open-source curriculum introduces the fundamentals of Model Context Protocol (MCP) through real-world, cross-language examples in .NET, Java, TypeScript, JavaScript, Rust and Python. Designed for developers, it focuses on practical techniques for building modular, scalable, and secure AI workflows from session setup to service orchestration.

TrendRadar

49.8k

⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载，你的 AI 舆情监控助手与热点筛选工具！聚合多平台热点 + RSS 订阅，支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机，也支持接入 MCP 架构，赋能 AI 自然语言对话分析、情感洞察与趋势预测等。支持 Docker ，数据本地/云端自持。集成微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 等渠道智能推送。

mpaepper

View profile

View on GitHub

GitHub Stars152

CategoryEducation

Updated10d ago

Forks23

mpaepper/vibevoice

Languages

Python

Security Score

85/100

Audited on Mar 15, 2026

No findings