Chaplin

A visual speech recognition (VSR) tool that reads your lips in real-time and types whatever you silently mouth. Runs fully locally.

Relies on a model trained on the Lip Reading Sentences 3 dataset as part of the Auto-AVSR project.

Watch a demo of Chaplin here.

Setup

Clone the repository, and cd into it:

git clone https://github.com/amanvirparhar/chaplin
cd chaplin

Run the setup script...

./setup.sh

...which will automatically download the required model files from Hugging Face Hub and place them in the appropriate directories:

chaplin/
├── benchmarks/
    ├── LRS3/
        ├── language_models/
            ├── lm_en_subword/
        ├── models/
            ├── LRS3_V_WER19.1/
├── ...

Install and run ollama, and pull the qwen3:4b model.
Install uv.

Usage

Run the following command:

uv run --with-requirements requirements.txt --python 3.12 main.py config_filename=./configs/LRS3_V_WER19.1.ini detector=mediapipe

Once the camera feed is displayed, you can start "recording" by pressing the option key (Mac) or the alt key (Windows/Linux), and start mouthing words.
To stop recording, press the option key (Mac) or the alt key (Windows/Linux) again. The raw VSR output will get logged in your terminal, and the LLM-corrected version will be typed at your cursor.
To exit gracefully, focus on the window displaying the camera feed and press q.

Chaplin

Install / Use

README

Chaplin

Setup

Usage