OpenWakeWord
An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity.
Install / Use
/learn @dscripka/OpenWakeWordREADME
openWakeWord
openWakeWord is an open-source wakeword library that can be used to create voice-enabled applications and interfaces. It includes pre-trained models for common words & phrases that work well in real-world environments.
Quick Links
Updates
2024/02/11
- v0.6.0 of openWakeWord released. See the releases for a full descriptions of new features and changes.
2023/11/09
- Added example scripts under
examples/webthat demonstrate streaming audio from a web application into openWakeWord.
2023/10/11
- Significant improvements to the process of training new models, including an example Google Colab notebook demonstrating how to train a basic wake word model in <1 hour.
2023/06/15
- v0.5.0 of openWakeWord released. See the releases for a full descriptions of new features and changes.
Demo
You can try an online demo of the included pre-trained models via HuggingFace Spaces right here!
Note that real-time detection of a microphone stream can occasionally behave strangely in Spaces. For the most reliable testing, perform a local installation as described below.
Installation
Installing openWakeWord is simple and has minimal dependencies:
pip install openwakeword
On Linux systems, both the onnxruntime package and tflite-runtime packages will be installed as dependencies since both inference frameworks are supported. On Windows, only onnxruntime is installed due to a lack of support for modern versions of tflite.
To (optionally) use Speex noise suppression on Linux systems to improve performance in noisy environments, install the Speex dependencies and then the pre-built Python package (see the assets here for all .whl versions), adjusting for your python version and system architecture as needed.
sudo apt-get install libspeexdsp-dev
pip install https://github.com/dscripka/openWakeWord/releases/download/v0.1.1/speexdsp_ns-0.1.2-cp38-cp38-linux_x86_64.whl
Many thanks to TeaPoly for their Python wrapper of the Speex noise suppression libraries.
Usage
For quick local testing, clone this repository and use the included example script to try streaming detection from a local microphone. You can individually download pre-trained models from current and past releases, or you can download them using Python (see below).
Adding openWakeWord to your own Python code requires just a few lines:
import openwakeword
from openwakeword.model import Model
# One-time download of all pre-trained models (or only select models)
openwakeword.utils.download_models()
# Instantiate the model(s)
model = Model(
wakeword_models=["path/to/model.tflite"], # can also leave this argument empty to load all of the included pre-trained models
)
# Get audio data containing 16-bit 16khz PCM audio data from a file, microphone, network stream, etc.
# For the best efficiency and latency, audio frames should be multiples of 80 ms, with longer frames
# increasing overall efficiency at the cost of detection latency
frame = my_function_to_get_audio_frame()
# Get predictions for the frame
prediction = model.predict(frame)
Additionally, openWakeWord provides other useful utility functions. For example:
# Get predictions for individual WAV files (16-bit 16khz PCM)
from openwakeword.model import Model
model = Model()
model.predict_clip("path/to/wav/file")
# Get predictions for a large number of files using multiprocessing
from openwakeword.utils import bulk_predict
bulk_predict(
file_paths = ["path/to/wav/file/1", "path/to/wav/file/2"],
wakeword_models = ["hey jarvis"],
ncpu=2
)
See openwakeword/utils.py and openwakeword/model.py for the full specification of class methods and utility functions.
Recommendations for Usage
Noise Suppression and Voice Activity Detection (VAD)
While the default settings for openWakeWord will work well in many cases, there are adjustable parameters in openWakeWord that can improve performance in some deployment scenarios.
On supported platforms (currently only X86 and Arm64 linux), Speex noise suppression can be enabled by setting the enable_speex_noise_suppression=True when instantiating an openWakeWord model. This can improve performance when relatively constant background noise is present.
Second, a voice activity detection (VAD) model from Silero is included with openWakeWord, and can be enabled by setting the vad_threshold argument to a value between 0 and 1 when instantiating an openWakeWord model. This will only allow a positive prediction from openWakeWord when the VAD model simultaneously has a score above the specified threshold, which can significantly reduce false-positive activations in the present of non-speech noise.
Threshold Scores for Activation
All of the included openWakeWord models were trained to work well with a default threshold of 0.5 for a positive prediction, but you are encouraged to determine the best threshold for your environment and use-case through testing. For certain deployments, using a lower or higher threshold in practice may result in significantly better performance.
User-specific models
If the baseline performance of openWakeWord models is not sufficient for a given application (specifically, if the false activation rate is unacceptably high), it is possible to train custom verifier models for specific voices that act as a second-stage filter on predictions (i.e., only allow activations through that were likely spoken by a known set of voices). This can greatly improve performance, at the cost of making the openWakeWord system less likely to respond to new voices.
Project Goals
openWakeWord has four high-level goals, which combine to (hopefully!) produce a framework that is simple to use and extend.
-
Be fast enough for real-world usage, while maintaining ease of use and development. For example, a single core of a Raspberry Pi 3 can run 15-20 openWakeWord models simultaneously in real-time. However, the models are likely still too large for less powerful systems or micro-controllers. Commercial options like Picovoice Porcupine or Fluent Wakeword are likely better suited for highly constrained hardware environments.
-
Be accurate enough for real-world usage. The included models are typically have false-accept and false-reject rates below the annoyance threshold for the average user. This is obviously subjective, by a false-accept rate of <0.5 per hour and a false-reject rate of <5% is often reasonable in practice. See the Performance & Evaluation section for details about how well the included models can be expected to perform in practice.
-
Have a simple model architecture and inference process. Models process a stream of audio data in 80 ms frames, and return a score between 0 and 1 for each frame indicating the confidence that a wake word/phrase has been detected. All models also have a shared feature extraction backbone, so that each additional model only has a small impact to overall system complexity and resource requirements.
-
Require little to no manual data collection to train new models. The included models (see the Pre-trained Models section for more details) were all trained with 100% synthetic speech generated from text-to-speech models. Training new models is a simple as generating new clips for the target wake word/phrase and training a small model on top of of the frozen shared feature extractor. See the Training New Models section for more details.
Future releases of openWakeWord will aim to stay aligned with these goals, even when adding new functionality.
Pre-Trained Models
openWakeWord comes with pre-trained models for common words & phrases. Currently, only English models are supported, but they should be reasonably robust across different types speaker accents and pronunciation.
The table below lists each model, examples of the word/phrases it is trained to recognize, and the associated documentation page for additional detail. Many of these models are trained on multiple variations of the same word/phrase; see the individual documentation pages for each model to see all supported word & phrase variations.
| Model | Detected Speech | Documentation Page | | ------------- | ------------- | ------------- | | alexa | "alexa"| docs | | hey mycroft | "hey mycroft" | docs | | hey jarvis | "hey jarvis" | docs | | hey rhasspy | "hey rhasspy" | TBD | current weather | "what's the weather" | docs | | timers | "set a 10 minute timer" | docs |
Based on the methods discussed in performance testing, each included model aims to meet the target performance criteria of <5% false-reject rates and <0.5/hour false-accept rates with appropriate threshold tuning. These levels are subjective, but hopefully are below the annoyance threshold where the average user becomes frustrated with a system that often misses intended activations and/or c
Related Skills
node-connect
346.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
346.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
346.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
