pywhispercpp

Python bindings for whisper.cpp with a simple Pythonic API on top of it.

Installation
Quick start
Examples
- CLI
- GUI
- Assistant
Advanced usage
Discussions and contributions
License

Installation

From source

For the best performance, you need to install the package from source:

pip install git+https://github.com/absadiki/pywhispercpp

Pre-built wheels

Otherwise, Basic Pre-built CPU wheels are available on PYPI

pip install pywhispercpp # or pywhispercpp[examples] to install the extra dependencies needed for the examples

[Optional] To transcribe files other than wav, you need to install ffmpeg:

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

NVIDIA GPU support

To Install the package with CUDA support, make sure you have cuda installed and use GGML_CUDA=1:

GGML_CUDA=1 pip install git+https://github.com/absadiki/pywhispercpp

CoreML support

Install the package with WHISPER_COREML=1:

WHISPER_COREML=1 pip install git+https://github.com/absadiki/pywhispercpp

Vulkan support

Install the package with GGML_VULKAN=1:

GGML_VULKAN=1 pip install git+https://github.com/absadiki/pywhispercpp

OpenBLAS support

If OpenBLAS is installed, you can use GGML_BLAS=1. The other flags ensure you're installing fresh with the correct flags, and printing output for sanity checking.

GGML_BLAS=1 pip install git+https://github.com/absadiki/pywhispercpp --no-cache --force-reinstall -v

OpenVINO support

Follow the the steps to download correct OpenVINO package (https://github.com/ggerganov/whisper.cpp?tab=readme-ov-file#openvino-support).

Then init the OpenVINO environment and build.

source ~/l_openvino_toolkit_ubuntu22_2023.0.0.10926.b4452d56304_x86_64/setupvars.sh 
WHISPER_OPENVINO=1 pip install git+https://github.com/absadiki/pywhispercpp --no-cache --force-reinstall

Note that the toolkit for Ubuntu22 works on Ubuntu24

** Feel free to update this list and submit a PR if you tested the package on other backends.

Quick start

from pywhispercpp.model import Model

model = Model('base.en')
segments = model.transcribe('file.wav')
for segment in segments:
    print(segment.text)

You can also assign a custom new_segment_callback

from pywhispercpp.model import Model

model = Model('base.en', print_realtime=False, print_progress=False)
segments = model.transcribe('file.mp3', new_segment_callback=print)

The model will be downloaded automatically, or you can use the path to a local model.
You can pass any whisper.cpp parameter as a keyword argument to the Model class or to the transcribe function.
Check the Model class documentation for more details.

Examples

CLI

Just a straightforward example Command Line Interface. You can use it as follows:

pwcpp file.wav -m base --output-srt --print_realtime true

Run pwcpp --help to get the help message

usage: pwcpp [-h] [-m MODEL] [--version] [--processors PROCESSORS] [-otxt] [-ovtt] [-osrt] [-ocsv] [--strategy STRATEGY]
             [--n_threads N_THREADS] [--n_max_text_ctx N_MAX_TEXT_CTX] [--offset_ms OFFSET_MS] [--duration_ms DURATION_MS]
             [--translate TRANSLATE] [--no_context NO_CONTEXT] [--single_segment SINGLE_SEGMENT] [--print_special PRINT_SPECIAL]
             [--print_progress PRINT_PROGRESS] [--print_realtime PRINT_REALTIME] [--print_timestamps PRINT_TIMESTAMPS]
             [--token_timestamps TOKEN_TIMESTAMPS] [--thold_pt THOLD_PT] [--thold_ptsum THOLD_PTSUM] [--max_len MAX_LEN]
             [--split_on_word SPLIT_ON_WORD] [--max_tokens MAX_TOKENS] [--audio_ctx AUDIO_CTX]
             [--prompt_tokens PROMPT_TOKENS] [--prompt_n_tokens PROMPT_N_TOKENS] [--language LANGUAGE] [--suppress_blank SUPPRESS_BLANK]
             [--suppress_non_speech_tokens SUPPRESS_NON_SPEECH_TOKENS] [--temperature TEMPERATURE] [--max_initial_ts MAX_INITIAL_TS]
             [--length_penalty LENGTH_PENALTY] [--temperature_inc TEMPERATURE_INC] [--entropy_thold ENTROPY_THOLD]
             [--logprob_thold LOGPROB_THOLD] [--no_speech_thold NO_SPEECH_THOLD] [--greedy GREEDY] [--beam_search BEAM_SEARCH]
             media_file [media_file ...]

positional arguments:
  media_file            The path of the media file or a list of filesseparated by space

options:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        Path to the `ggml` model, or just the model name
  --version             show program's version number and exit
  --processors PROCESSORS
                        number of processors to use during computation
  -otxt, --output-txt   output result in a text file
  -ovtt, --output-vtt   output result in a vtt file
  -osrt, --output-srt   output result in a srt file
  -ocsv, --output-csv   output result in a CSV file
  --strategy STRATEGY   Available sampling strategiesGreefyDecoder -> 0BeamSearchDecoder -> 1
  --n_threads N_THREADS
                        Number of threads to allocate for the inferencedefault to min(4, available hardware_concurrency)
  --n_max_text_ctx N_MAX_TEXT_CTX
                        max tokens to use from past text as prompt for the decoder
  --offset_ms OFFSET_MS
                        start offset in ms
  --duration_ms DURATION_MS
                        audio duration to process in ms
  --translate TRANSLATE
                        whether to translate the audio to English
  --no_context NO_CONTEXT
                        do not use past transcription (if any) as initial prompt for the decoder
  --single_segment SINGLE_SEGMENT
                        force single segment output (useful for streaming)
  --print_special PRINT_SPECIAL
                        print special tokens (e.g. <SOT>, <EOT>, <BEG>, etc.)
  --print_progress PRINT_PROGRESS
                        print progress information
  --print_realtime PRINT_REALTIME
                        print results from within whisper.cpp (avoid it, use callback instead)
  --print_timestamps PRINT_TIMESTAMPS
                        print timestamps for each text segment when printing realtime
  --token_timestamps TOKEN_TIMESTAMPS
                        enable token-level timestamps
  --thold_pt THOLD_PT   timestamp token probability threshold (~0.01)
  --thold_ptsum THOLD_PTSUM
                        timestamp token sum probability threshold (~0.01)
  --max_len MAX_LEN     max segment length in characters
  --split_on_word SPLIT_ON_WORD
                        split on word rather than on token (when used with max_len)
  --max_tokens MAX_TOKENS
                        max tokens per segment (0 = no limit)
  --audio_ctx AUDIO_CTX
                        overwrite the audio context size (0 = use default)
  --prompt_tokens PROMPT_TOKENS
                        tokens to provide to the whisper decoder as initial prompt
  --prompt_n_tokens PROMPT_N_TOKENS
                        tokens to provide to the whisper decoder as initial prompt
  --language LANGUAGE   for auto-detection, set to None, "" or "auto"
  --suppress_blank SUPPRESS_BLANK
                        common decoding parameters
  --suppress_non_speech_tokens SUPPRESS_NON_SPEECH_TOKENS
                        common decoding parameters
  --temperature TEMPERATURE
                        initial decoding temperature
  --max_initial_ts MAX_INITIAL_TS
                        max_initial_ts
  --length_penalty LENGTH_PENALTY
                        length_penalty
  --temperature_inc TEMPERATURE_INC
                        temperature_inc
  --entropy_thold ENTROPY_THOLD
                        similar to OpenAI's "compression_ratio_threshold"
  --logprob_thold LOGPROB_THOLD
                        logprob_thold
  --no_speech_thold NO_SPEECH_THOLD
                        no_speech_thold
  --greedy GREEDY       greedy
  --beam_search BEAM_SEARCH
                        beam_search

GUI

If you prefer a Graphical User Interface, you can use the pwcpp-gui command which will launch A simple graphical interface built with PyQt5.

First you need to install the GUI dependencies:

pip install pywhispercpp[gui]

Then you can run the GUI with:

pwcpp-gui

The GUI provides a user-friendly way to:

Select audio files
Choose models
Adjust basic transcription settings
View and export transcription results

Assistant

This is a simple example showcasing the use of pywhispercpp to create an assistant like example. The idea is to use a Voice Activity Detect

Pywhispercpp

Install / Use

README