SkillAgentSearch skills...

HeadAudio

HeadAudio: An audio node/processor for real-time audio-driven viseme detection and lip-sync in browsers.

Install / Use

/learn @met4citizen/HeadAudio
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

HeadAudio

Introduction

HeadAudio is an audio worklet node/processor for audio-driven, real-time viseme detection and lip-sync in browsers. It uses MFCC feature vectors and Gaussian prototypes with a Mahalanobis-distance classifier. As output, it generates Oculus viseme blend-shape values in real time and can be integrated into an existing 3D animation loop.

  • Pros: Audio-driven lip-sync works with any audio stream or TTS output without requiring text transcripts or timestamps. It is fast, fully in-browser, and requires no server.

  • Cons: Voice activity detection (VAD) and prediction accuracy are far from optimal, especially when the signal-to-noise ratio (SNR) is low. In general, the audio-driven approach is less accurate and computationally more demanding than TalkingHead's text-driven approach.

The solution is fully compatible with the TalkingHead. It doesn't have any external dependencies, and it is MIT licensed.

HeadTTS, webpack, and jest were used during development, training, and testing.

The implementation has been tested with the latest versions of Chrome, Edge, Firefox, and Safari desktop browsers, as well as on iPad/iPhone.

[!IMPORTANT] The model's accuracy will hopefully improve over time. However, since all audio processing occurs fully in-browser and in real time, it will never be perfect and may not be suitable for all use cases. Some precision will always need to be sacrificed to stay within the real-time processing budget.


Demo / Test App

App | Description --- | --- <span style="display: block; min-width:400px"><img src="images/openai.jpg" width="400"/></span> | A demo web app using HeadAudio, TalkingHead, and OpenAI Realtime API (WebRTC). It supports speech-to-speech, moods, hand gestures, and facial expressions through function calling. [Run] [Code]<br/><br/>Note: The app uses OpenAI's gpt-realtime-mini model and requires an OpenAI API key. The “mini” model is a cost-effective version of GPT Realtime, but still relatively expensive for extended use. <img src="images/tester.jpg" width="400"/> | A test app for HeadAudio that lets you experiment with audio-stream processing and various parameters using HeadTTS (in-browser neural text-to-speech engine), your own audio file(s), or microphone input. [Run] [Code]


Using the HeadAudio Worklet Node/Processor

The steps needed to setup and use HeadAudio:

  1. Import the Audio Worklet Node HeadAudio from "./modules/headaudio.mjs". Alternatively, use the minified version "./dist/headaudio.min.mjs" or a CDN build.

  2. Register the Audio Worklet Processor from "./modules/headworklet.mjs". Alternatively, use the minified version "./dist/headworklet.min.mjs" or a CDN build.

  3. Create a new HeadAudio instance.

  4. Load a pre-trained viseme model containing Gaussian prototypes, e.g., "./dist/model-en-mixed.bin".

  5. Connect your speech audio node to the HeadAudio node. The node node has a single mono input and does not output any audio.

  6. Optional: To compensate for processing latency (50–100 ms), add delay to your speech-audio path using the browser's standard DelayNode.

  7. Assing onvalue callback function (key, value) that updates your avatar's blend shape key (Oculus viseme name, e.g, "viseme_aa") to the given value in the range [0,1].

  8. Call the node's update method inside your 3D animation loop, passing the delta time (in milliseconds).

  9. Optional: Set up any additional user event handlers as needed.

Here is a simplified code example using the above steps with a TalkingHead class instance head:

// 1. Import
import { TalkingHead } from "talkinghead";
import { HeadAudioNode } from "./modules/headaudio.mjs";

// 2. Register processor
const head = new TalkingHead( /* Your normal parameters */ );
await head.audioCtx.audioWorklet.addModule("./modules/headworklet.mjs");

// 3. Create new HeadAudio node
const headaudio = new HeadAudio(head.audioCtx, {
  processorOptions: { },
  parameterData: {
    vadGateActiveDb: -40,
    vadGateInactiveDb: -60
  }
});

// 4. Load a pre-trained model
await headaudio.loadModel("./dist/model-en-mixed.mjs");

// 5. Connect TalkingHead's speech gain node to HeadAudio node
head.audioSpeechGainNode.connect(headaudio);

// 6. OPTIONAL: Add some delay between gain and reverb nodes
const delayNode = new DelayNode( head.audioCtx, { delayTime: 0.1 });
head.audioSpeechGainNode.disconnect(head.audioReverbNode);
head.audioSpeechGainNode.connect(delayNode);
delayNode.connect(head.audioReverbNode);

// 7. Register callback function to set blend shape values
headaudio.onvalue = (key,value) => {
  Object.assign( head.mtAvatar[ key ],{ newvalue: value, needsUpdate: true });
};

// 8. Link node's `update` method to TalkingHead's animation loop
head.opt.update = headaudio.update.bind(headaudio);

// 9. OPTIONAL: Take eye contact and make a hand gesture when new sentence starts
let lastEnded = 0;
headaudio.onended = () => {
  lastEnded = Date.now();
};

headaudio.onstarted = () => {
  const duration = Date.now() - lastEnded;
  if ( duration > 150 ) { // New sentence, if 150 ms pause (adjust, if needed)
    head.lookAtCamera(500);
    head.speakWithHands();
  }
};

See the test app source code for more details.

The supported processerOptions are:

Option | Description | Default --- | --- | --- frameEventsEnabled | If true, sends frame user-event objects containing a downsampled samples array and timestamp: { event: 'frame', frame, t }. NOTE: Mainly for testing. | false vadEventsEnabled | If true, sends vad user-event objects with status counters and current log-energy in decibels: { event: 'vad', active, inactive, db, t }. NOTE: Mainly for testing. | false featureEventsEnabled | If true, send feature user-event objects with the normalized feature vector, log-energy, timestamp, and duration: { event: 'feature', vector, le, t, d }. NOTE: Mainly for testing. | false visemeEventsEnabled | If true, sends viseme user-event objects containing extended viseme information, including the predicted viseme, feature vector, distance array, timestamp, and duration: { event: 'viseme', viseme, vector, distances, t, d }. NOTE: Mainly for testing. | false

The supported parameterData are:

Parameter | Description | Default --- | --- | --- vadMode | 0 = Disabled, 1 = Gate. If disabled, processing relies only on silence prototypes (see silMode). Gate mode is a simple energy-based VAD suitable for low and stable noise floors with high SNR. | 1 vadGateActiveDb | Decibel threshold above which audio is classified as active. | -40 vadGateActiveMs | Duration (ms) required before switching from inactive to active. | 10 vadGateInactiveDb | Decibel threshold below which audio is classified as inactive. | -50 vadGateInactiveMs | Duration (ms) required before switching from active to inactive. | 10 silMode | 0 = Disabled, 1 = Manual calibration, 2 = Auto (NOT IMPLEMENTED). If disabled, only trained SIL prototypes are used. In manual mode, the app must perform silence calibration. Auto mode is currently not implemented. | 1 silCalibrationWindowSec | Silence-calibration window in seconds. | 3.0 silSensitivity | Sensitivity to silence. | 1.2 speakerMeanHz | Estimated speaker mean frequency in Hz [50–500]. Adjusting this gently stretches/compresses the Mel spacing and frequency range to better match the speaker’s vocal-tract resonances and harmonic structure. Typical values: adult male 100–130, adult female 200–250, child 300–400. EXPERIMENTAL | 150

[!TIP] All audio parameters can be changed in real-time, e.g.: headaudio.parameters.get("vadMode").value = 0;

Supported HeadAudio class events:

Event | Description --- | --- onvalue(key, value) | Called when a viseme blend-shape value is updated. key is one of: 'viseme_aa', 'viseme_E', 'viseme_I', 'viseme_O', 'viseme_U', 'viseme_PP', 'viseme_SS', 'viseme_TH' 'viseme_DD', 'viseme_FF', 'viseme_kk', 'viseme_nn', 'viseme_RR', 'viseme_CH', 'viseme_sil'. value is in the range [0,1]. onstarted(data) | Speech start event { event: "start", t }. onended(data) | Speech end event { event: "end", t }. onframe(data) | Frame event { event: "frame", frame, t }. Contains 32-bit float 16 kHz mono samples. Requires frameEventEnabled to be true. onvad(data) | VAD event { event: "vad", t, db, active, inactive }. Requires vadEventEnabled to be true. onfeature(data) | Feature event { event: "feature", vector, t, d }. Requires featureEventEnabled to be true. onviseme(data) | Viseme event { event: "viseme", viseme, t, d, vector, distances }. Requires visemeEventEnabled to be true. oncalibrated(data) | Calibration event { event: "calibrated", t, \[error] }. onprocessorerror(event) | Fired when an internal processor error occurs.


Training

[!IMPORTANT] You do NOT need to train your own model as a pre-trained model is provided. However, if you want to train a custom model, the process below d

View on GitHub
GitHub Stars16
CategoryDevelopment
Updated10d ago
Forks3

Languages

JavaScript

Security Score

95/100

Audited on Mar 26, 2026

No findings