Pywhispercpp
Python bindings for whisper.cpp
Install / Use
/learn @absadiki/PywhispercppREADME
pywhispercpp
Python bindings for whisper.cpp with a simple Pythonic API on top of it.
Table of contents
<!-- TOC --> <!-- TOC -->Installation
From source
- For the best performance, you need to install the package from source:
pip install git+https://github.com/absadiki/pywhispercpp
Pre-built wheels
- Otherwise, Basic Pre-built CPU wheels are available on PYPI
pip install pywhispercpp # or pywhispercpp[examples] to install the extra dependencies needed for the examples
[Optional] To transcribe files other than wav, you need to install ffmpeg:
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Arch Linux
sudo pacman -S ffmpeg
# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
NVIDIA GPU support
To Install the package with CUDA support, make sure you have cuda installed and use GGML_CUDA=1:
GGML_CUDA=1 pip install git+https://github.com/absadiki/pywhispercpp
CoreML support
Install the package with WHISPER_COREML=1:
WHISPER_COREML=1 pip install git+https://github.com/absadiki/pywhispercpp
Vulkan support
Install the package with GGML_VULKAN=1:
GGML_VULKAN=1 pip install git+https://github.com/absadiki/pywhispercpp
OpenBLAS support
If OpenBLAS is installed, you can use GGML_BLAS=1. The other flags ensure you're installing fresh with the correct flags, and printing output for sanity checking.
GGML_BLAS=1 pip install git+https://github.com/absadiki/pywhispercpp --no-cache --force-reinstall -v
OpenVINO support
Follow the the steps to download correct OpenVINO package (https://github.com/ggerganov/whisper.cpp?tab=readme-ov-file#openvino-support).
Then init the OpenVINO environment and build.
source ~/l_openvino_toolkit_ubuntu22_2023.0.0.10926.b4452d56304_x86_64/setupvars.sh
WHISPER_OPENVINO=1 pip install git+https://github.com/absadiki/pywhispercpp --no-cache --force-reinstall
Note that the toolkit for Ubuntu22 works on Ubuntu24
** Feel free to update this list and submit a PR if you tested the package on other backends.
Quick start
from pywhispercpp.model import Model
model = Model('base.en')
segments = model.transcribe('file.wav')
for segment in segments:
print(segment.text)
You can also assign a custom new_segment_callback
from pywhispercpp.model import Model
model = Model('base.en', print_realtime=False, print_progress=False)
segments = model.transcribe('file.mp3', new_segment_callback=print)
- The model will be downloaded automatically, or you can use the path to a local model.
- You can pass any
whisper.cppparameter as a keyword argument to theModelclass or to thetranscribefunction. - Check the Model class documentation for more details.
Examples
CLI
Just a straightforward example Command Line Interface. You can use it as follows:
pwcpp file.wav -m base --output-srt --print_realtime true
Run pwcpp --help to get the help message
usage: pwcpp [-h] [-m MODEL] [--version] [--processors PROCESSORS] [-otxt] [-ovtt] [-osrt] [-ocsv] [--strategy STRATEGY]
[--n_threads N_THREADS] [--n_max_text_ctx N_MAX_TEXT_CTX] [--offset_ms OFFSET_MS] [--duration_ms DURATION_MS]
[--translate TRANSLATE] [--no_context NO_CONTEXT] [--single_segment SINGLE_SEGMENT] [--print_special PRINT_SPECIAL]
[--print_progress PRINT_PROGRESS] [--print_realtime PRINT_REALTIME] [--print_timestamps PRINT_TIMESTAMPS]
[--token_timestamps TOKEN_TIMESTAMPS] [--thold_pt THOLD_PT] [--thold_ptsum THOLD_PTSUM] [--max_len MAX_LEN]
[--split_on_word SPLIT_ON_WORD] [--max_tokens MAX_TOKENS] [--audio_ctx AUDIO_CTX]
[--prompt_tokens PROMPT_TOKENS] [--prompt_n_tokens PROMPT_N_TOKENS] [--language LANGUAGE] [--suppress_blank SUPPRESS_BLANK]
[--suppress_non_speech_tokens SUPPRESS_NON_SPEECH_TOKENS] [--temperature TEMPERATURE] [--max_initial_ts MAX_INITIAL_TS]
[--length_penalty LENGTH_PENALTY] [--temperature_inc TEMPERATURE_INC] [--entropy_thold ENTROPY_THOLD]
[--logprob_thold LOGPROB_THOLD] [--no_speech_thold NO_SPEECH_THOLD] [--greedy GREEDY] [--beam_search BEAM_SEARCH]
media_file [media_file ...]
positional arguments:
media_file The path of the media file or a list of filesseparated by space
options:
-h, --help show this help message and exit
-m MODEL, --model MODEL
Path to the `ggml` model, or just the model name
--version show program's version number and exit
--processors PROCESSORS
number of processors to use during computation
-otxt, --output-txt output result in a text file
-ovtt, --output-vtt output result in a vtt file
-osrt, --output-srt output result in a srt file
-ocsv, --output-csv output result in a CSV file
--strategy STRATEGY Available sampling strategiesGreefyDecoder -> 0BeamSearchDecoder -> 1
--n_threads N_THREADS
Number of threads to allocate for the inferencedefault to min(4, available hardware_concurrency)
--n_max_text_ctx N_MAX_TEXT_CTX
max tokens to use from past text as prompt for the decoder
--offset_ms OFFSET_MS
start offset in ms
--duration_ms DURATION_MS
audio duration to process in ms
--translate TRANSLATE
whether to translate the audio to English
--no_context NO_CONTEXT
do not use past transcription (if any) as initial prompt for the decoder
--single_segment SINGLE_SEGMENT
force single segment output (useful for streaming)
--print_special PRINT_SPECIAL
print special tokens (e.g. <SOT>, <EOT>, <BEG>, etc.)
--print_progress PRINT_PROGRESS
print progress information
--print_realtime PRINT_REALTIME
print results from within whisper.cpp (avoid it, use callback instead)
--print_timestamps PRINT_TIMESTAMPS
print timestamps for each text segment when printing realtime
--token_timestamps TOKEN_TIMESTAMPS
enable token-level timestamps
--thold_pt THOLD_PT timestamp token probability threshold (~0.01)
--thold_ptsum THOLD_PTSUM
timestamp token sum probability threshold (~0.01)
--max_len MAX_LEN max segment length in characters
--split_on_word SPLIT_ON_WORD
split on word rather than on token (when used with max_len)
--max_tokens MAX_TOKENS
max tokens per segment (0 = no limit)
--audio_ctx AUDIO_CTX
overwrite the audio context size (0 = use default)
--prompt_tokens PROMPT_TOKENS
tokens to provide to the whisper decoder as initial prompt
--prompt_n_tokens PROMPT_N_TOKENS
tokens to provide to the whisper decoder as initial prompt
--language LANGUAGE for auto-detection, set to None, "" or "auto"
--suppress_blank SUPPRESS_BLANK
common decoding parameters
--suppress_non_speech_tokens SUPPRESS_NON_SPEECH_TOKENS
common decoding parameters
--temperature TEMPERATURE
initial decoding temperature
--max_initial_ts MAX_INITIAL_TS
max_initial_ts
--length_penalty LENGTH_PENALTY
length_penalty
--temperature_inc TEMPERATURE_INC
temperature_inc
--entropy_thold ENTROPY_THOLD
similar to OpenAI's "compression_ratio_threshold"
--logprob_thold LOGPROB_THOLD
logprob_thold
--no_speech_thold NO_SPEECH_THOLD
no_speech_thold
--greedy GREEDY greedy
--beam_search BEAM_SEARCH
beam_search
GUI
If you prefer a Graphical User Interface, you can use the pwcpp-gui command which will launch A simple graphical interface built with PyQt5.
- First you need to install the GUI dependencies:
pip install pywhispercpp[gui]
- Then you can run the GUI with:
pwcpp-gui
The GUI provides a user-friendly way to:
- Select audio files
- Choose models
- Adjust basic transcription settings
- View and export transcription results
Assistant
This is a simple example showcasing the use of pywhispercpp to create an assistant like example.
The idea is to use a Voice Activity Detect
