whisper.cpp

High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model:

Plain C/C++ implementation without dependencies
Apple Silicon first-class citizen - optimized via ARM NEON, Accelerate framework, Metal and Core ML
AVX intrinsics support for x86 architectures
VSX intrinsics support for POWER architectures
Mixed F16 / F32 precision
Integer quantization support
Zero memory allocations at runtime
Vulkan support
Support for CPU-only inference
Efficient GPU support for NVIDIA
OpenVINO Support
Ascend NPU Support
Moore Threads GPU Support
C-style API
Voice Activity Detection (VAD)

Supported platforms:

[x] Mac OS (Intel and Arm)
[x] iOS
[x] Android
[x] Java
[x] Linux / FreeBSD
[x] WebAssembly
[x] Windows (MSVC and MinGW)
[x] Raspberry Pi
[x] Docker

The entire high-level implementation of the model is contained in whisper.h and whisper.cpp. The rest of the code is part of the ggml machine learning library.

Having such a lightweight implementation of the model allows to easily integrate it in different platforms and applications. As an example, here is a video of running the model on an iPhone 13 device - fully offline, on-device: whisper.objc

https://user-images.githubusercontent.com/1991296/197385372-962a6dea-bca1-4d50-bf96-1d8c27b98c81.mp4

You can also easily make your own offline voice assistant application: command

https://user-images.githubusercontent.com/1991296/204038393-2f846eae-c255-4099-a76d-5735c25c49da.mp4

On Apple Silicon, the inference runs fully on the GPU via Metal:

https://github.com/ggml-org/whisper.cpp/assets/1991296/c82e8f86-60dc-49f2-b048-d2fdbd6b5225

Quick start

First clone the repository:

git clone https://github.com/ggml-org/whisper.cpp.git

Navigate into the directory:

cd whisper.cpp

Then, download one of the Whisper models converted in ggml format. For example:

sh ./models/download-ggml-model.sh base.en

Now build the whisper-cli example and transcribe an audio file like this:

# build the project
cmake -B build
cmake --build build -j --config Release

# transcribe an audio file
./build/bin/whisper-cli -f samples/jfk.wav

For a quick demo, simply run make base.en.

The command downloads the base.en model converted to custom ggml format and runs the inference on all .wav samples in the folder samples.

For detailed usage instructions, run: ./build/bin/whisper-cli -h

Note that the whisper-cli example currently runs only with 16-bit WAV files, so make sure to convert your input before running the tool. For example, you can use ffmpeg like this:

ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav

More audio samples

If you want some extra audio samples to play with, simply run:

make -j samples

This will download a few more audio files from Wikipedia and convert them to 16-bit WAV format via ffmpeg.

You can download and run the other models as follows:

make -j tiny.en
make -j tiny
make -j base.en
make -j base
make -j small.en
make -j small
make -j medium.en
make -j medium
make -j large-v1
make -j large-v2
make -j large-v3
make -j large-v3-turbo

Memory usage

| Model | Disk | Mem | | ------ | ------- | ------- | | tiny | 75 MiB | ~273 MB | | base | 142 MiB | ~388 MB | | small | 466 MiB | ~852 MB | | medium | 1.5 GiB | ~2.1 GB | | large | 2.9 GiB | ~3.9 GB |

POWER VSX Intrinsics

whisper.cpp supports POWER architectures and includes code which significantly speeds operation on Linux running on POWER9/10, making it capable of faster-than-realtime transcription on underclocked Raptor Talos II. Ensure you have a BLAS package installed, and replace the standard cmake setup with:

# build with GGML_BLAS defined
cmake -B build -DGGML_BLAS=1
cmake --build build -j --config Release
./build/bin/whisper-cli [ .. etc .. ]

Quantization

whisper.cpp supports integer quantization of the Whisper ggml models. Quantized models require less memory and disk space and depending on the hardware can be processed more efficiently.

Here are the steps for creating and using a quantized model:

# quantize a model with Q5_0 method
cmake -B build
cmake --build build -j --config Release
./build/bin/quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0

# run the examples as usual, specifying the quantized model file
./build/bin/whisper-cli -m models/ggml-base.en-q5_0.bin ./samples/gb0.wav

Core ML support

On Apple Silicon devices, the Encoder inference can be executed on the Apple Neural Engine (ANE) via Core ML. This can result in significant speed-up - more than x3 faster compared with CPU-only execution. Here are the instructions for generating a Core ML model and using it with whisper.cpp:

Install Python dependencies needed for the creation of the Core ML model:
```
pip install ane_transformers
pip install openai-whisper
pip install coremltools
```
- To ensure coremltools operates correctly, please confirm that Xcode is installed and execute xcode-select --install to install the command-line tools.
- Python 3.11 is recommended.
- MacOS Sonoma (version 14) or newer is recommended, as older versions of MacOS might experience issues with transcription hallucination.
- [OPTIONAL] It is recommended to utilize a Python version management system, such as Miniconda for this step:
  - To create an environment, use: conda create -n py311-whisper python=3.11 -y
  - To activate the environment, use: conda activate py311-whisper
Generate a Core ML model. For example, to generate a base.en model, use:
```
./models/generate-coreml-model.sh base.en
```
This will generate the folder models/ggml-base.en-encoder.mlmodelc

Build whisper.cpp with Core ML support:

# using CMake
cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release

Run the examples as usual. For example:

$ ./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav

...

whisper_init_state: loading Core ML model from 'models/ggml-base.en-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |

...

The first run on a device is slow, since the ANE service compiles the Core ML model to some device-specific format. Next runs are faster.

For more information about the Core ML implementation please refer to PR #566.

OpenVINO support

On platforms that support OpenVINO, the Encoder inference can be executed on OpenVINO-supported devices including x86 CPUs and Intel GPUs (integrated & discrete).

This can result in significant speedup in encoder performance. Here are the instructions for generating the OpenVINO model and using it with whisper.cpp:

First, setup python virtual env. and install python dependencies. Python 3.10 is recommended.

Windows:

cd models
python -m venv openvino_conv_env
openvino_conv_env\Scripts\activate
python -m pip install --upgrade pip
pip install -r requirements-openvino.txt

Linux and macOS:

cd models
python3 -m venv openvino_conv_env
source openvino_conv_env/bin/activate
python -m pip install --upgrade pip
pip install -r requirements-openvino.txt

Generate an OpenVINO encoder model. For example, to generate a base.en model, use:
```
python convert-whisper-to-openvino.py --model base.en
```
This will produce ggml-base.en-encoder-openvino.xml/.bin IR model files. It's recommended to relocate these to the same folder as ggml models, as that is the default location that the OpenVINO extension will search at runtime.
Build whisper.cpp with OpenVINO support:

Download OpenVINO package from release page.

Whisper.cpp

Install / Use

README