SkillAgentSearch skills...

HeadTTS

HeadTTS: Free neural text-to-speech (Kokoro) with timestamps and visemes for lip-sync. Runs in-browser (WebGPU/WASM) or on local Node.js WebSocket/REST server (CPU).

Install / Use

/learn @met4citizen/HeadTTS
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<img src="logo.png" width="100"/>  HeadTTS

HeadTTS is a free JavaScript text-to-speech (TTS) solution that provides phoneme-level timestamps and Oculus visemes for lip-sync, in addition to audio output (WAV/PCM). It uses Kokoro neural model and voices, and inference can run entirely in the browser (WebGPU or WASM), or alternatively on a Node.js WebSocket/RESTful server (WebGPU or CPU).

  • Pros: Free. Doesn't require a server in in-browser mode. WebGPU support. Uses neural voices with a StyleTTS 2 model. Great for lip-sync use cases and fully compatible with the TalkingHead. MIT licensed, doesn't use eSpeak or any other GPL-licensed module.

  • Cons: Only the latest desktop browsers have WebGPU support enabled by default, the WASM fallback is much slower. Kokoro is a lightweight model, but it still takes time to load the first time and consumes a lot of memory. English is currently the only supported language.

👉 If you're using a desktop browser, check out the IN-BROWSER DEMO! - If your browser doesn't have WebGPU support enabled, the demo app uses WASM as a fallback.

The project uses websockets/ws (MIT License), hugginface/transformers.js (with ONNX Runtime) (Apache 2.0 License) and onnx-community/Kokoro-82M-v1.0-ONNX-timestamped (Apache 2.0 License) as runtime dependencies. For information on language modules and dictionaries, see Appendix B. Using jest for testing.

You can find the list of supported English voices and voice samples here.


In-browser Module: headtts.mjs

The HeadTTS JavaScript module enables in-browser text-to-speech using Module Web Workers and WebGPU/WASM inference. Alternatively, it can connect to and use the HeadTTS Node.js WebSocket/RESTful server.

Create a new HeadTTS class instance:

import { HeadTTS } from "./modules/headtts.mjs";

const headtts = new HeadTTS({
  endpoints: ["ws://127.0.0.1:8882", "webgpu"], // Endpoints in order of priority
  languages: ['en-us'], // Language modules to pre-load (in-browser)
  voices: ["af_bella", "am_fenrir"] // Voices to pre-load (in-browser)
});

Beware that if you import the HeadTTS module from a CDN, you may need to set the workerModule and dictionaryURL options explicitly, as the default relative paths will likely not work:

import { HeadTTS } from "https://cdn.jsdelivr.net/npm/@met4citizen/headtts@1.3/+esm";

const headtts = new HeadTTS({
  /* ... */
  workerModule: "https://cdn.jsdelivr.net/npm/@met4citizen/headtts@1.3/modules/worker-tts.mjs",
  dictionaryURL: "https://cdn.jsdelivr.net/npm/@met4citizen/headtts@1.3/dictionaries/"
});
<details> <summary>CLICK HERE to see all the OPTIONS.</summary>

Option | Description | Default value --- | --- | --- endpoints | List of WebSocket/RESTful servers or backends webgpu or wasm, in order of priority. If one fails, the next is used. | ["webgpu",<br> "wasm"] audioCtx | Audio context for creating audio buffers. If null, a new one is created. | null workerModule | URL of the HeadTTS Web Worker module. Enables use from a CDN. If set to null, the relative path/file ./worker-tts.mjs is used. | null transformersModule | URL of the transformers.js module to load. | "https://cdn.jsdelivr.net/npm/<br>@huggingface/transformers@4.0.0<br>/dist/transformers.min.js" model | Kokoro text-to-speech ONNX model (timestamped) used for in-browser inference. | "onnx-community/<br>Kokoro-82M-v1.0-ONNX-timestamped" dtypeWebgpu | Data type precision for WebGPU inference: "fp32" (recommended), "fp16", "q8", "q4", or "q4f16". | "fp32" dtypeWasm | Data type precision for WASM inference: "fp32", "fp16", "q8", "q4", or "q4f16". | "q4" styleDim | Style embedding dimension for inference. | 256 audioSampleRate | Audio sample rate in Hz for inference. | 24000 frameRate | Frame rate in FPS for inference. | 40 languages | Language modules to be pre-loaded. | ["en-us"] dictionaryURL | URL to language dictionaries. Set to null to disable dictionaries. | "../dictionaries" voiceURL | URL for loading voices. If the given value is a relative URL, it should be relative to the worker file location. | "https://huggingface.co/<br>onnx-community/<br>Kokoro-82M-v1.0-ONNX/<br>resolve/main/voices" voices | Voices to preload (e.g., ["af_bella", "am_fenrir"]). | [] splitSentences | Whether to split text into sentences. | true splitLength | Maximum length (in characters) of each text chunk. | 500 deltaStart | Adjustment (in ms) to viseme start times. | -10 deltaEnd | Adjustment (in ms) to viseme end times. | 10 defaultVoice | Default voice to use. | "af_bella" defaultLanguage | Default language to use. | "en-us" defaultSpeed | Speaking speed. Range: 0.25–4. | 1 defaultAudioEncoding | Default audio format: "wav" or "pcm" (PCM 16-bit LE). | "wav" trace | Bitmask for debugging subsystems (0=none, 255=all):<br><ul><li>Bit 0 (1): Connection</li><li>Bit 1 (2): Messages</li><li>Bit 2 (4): Events</li><li>Bit 3 (8): G2P</li><li>Bit 4 (16): Language modules</li></ul> | 0

Note: Model related options apply only to in-browser inference. If inference is performed on a server, server-specific settings will apply instead.

</details>

Connect to the first supported/available endpoint:

try {
  await headtts.connect();
} catch(error) {
  console.error(error);
}

Make an onmessage event handler to handle response messages. In this example, we use TalkingHead instance head to play the incoming audio and lip-sync data:

// Speak and lipsync
headtts.onmessage = (message) => {
  if ( message.type === "audio" ) {
    try {
      head.speakAudio( message.data, {}, (word) => {
        console.log(word);
      });
    } catch(error) {
      console.error(error);
    }
  } else if ( message.type === "custom" ) {
    console.log("Received custom message, data=", message.data);
  } else if ( message.type === "error" ) {
    console.error("Received error message, error=", message.data.error);
  }
}
<details> <summary>CLICK HERE to see all the available class EVENTS.</summary>

Event handler | Description --- | --- onstart | Triggered when the first message is added and all message queues were previously empty. onmessage | Handles incoming messages of type audio, error and custom. For details, see the API section. onend | Triggered when all message queues become empty. onerror | Handles system or class-level errors. If this handler is not set, such errors are thrown as exceptions. Note: Errors related to TTS conversion are sent to the onmessage handler (if defined) as messages of type error.

</details>

Setup the voice:

headtts.setup({
  voice: "af_bella",
  language: "en-us",
  speed: 1,
  audioEncoding: "wav"
});

The HeadTTS client is stateful, so you don't need to call setup again unless you want to change a setting. For example, if you want to increase the speed, simply call headtts.setup({ speed: 1.5 }).

Synthesize speech using the current voice setup:

headtts.synthesize({
  input: "Test sentence."
});

The above approach relies on onmessage event handler to receive and handle response messages and it is the recommended approach for real-time use cases. An alternative approach is to await for all the related audio messages:

try {
  const messages = await headtts.synthesize({
    input: "Some long text..."
  });
  console.log(messages); // [{type: 'audio', data: {…}, ref: 1}, {…}, ...]
} catch(error) {
  console.error(error);
}

The input property can be a string or, alternatively, an array of strings or inputs items.

<details> <summary>CLICK HERE to see the available input ITEM TYPES.</summary>

Type | Description | Example ---|---|--- text |  Speak the text in value. This is equivalent to giving a pure string input. | <pre>{<br> type: "text",<br> value: "This is an example."<br>}</pre> speech |  Speak the text in value with corresponding subtitles in subtitles (optional). This type allows the spoken words to be different that the subtitles. | <pre>{<br> type: "speech",<br> value: "One two three",<br> subtitles: "123"<br>}</pre> phonetic | Speak the model specific phonetic alphabets in value with corresponding subtitles (optional). | <pre>{<br> type: "phonetic",<br> value: "mˈɜɹʧəndˌIz",<br> subtitles: "merchandise"<br>}</pre> characters | Speak the value character-by-character with corresponding subtitles (optional). Supports also numbers that are read digit-by-digit. | <pre>{<br> type: "characters",<br> value: "ABC-123-8",<br> subtitles: "ABC-123-8"<br>}</pre> number | Speak the number in value with corresponding subtitles (optional). The number should presented as a string. | <pre>{<br> type: "number",<br> value: "123.5",<br> subtitles: "123.5"<br>}</pre> date | Speak the date in value with corresponding subtitles (optional). The date is presented as milliseconds from epoch. | <pre>{<br> type: "date",<br> value: Date.now(),<br> subtitles: "02/05/2025"<br>}</pre> time | Speak the time in value with corresponding subtitles (optional). The time is presented as milliseconds from epoch. | <pre>{<br> type: "time",<br> value: Date.now(),<br> subtitles: "6:45 PM"<br>}</pre> break | The length of the break in milliseconds in value with corresponding `subtit

View on GitHub
GitHub Stars122
CategoryDevelopment
Updated3d ago
Forks16

Languages

JavaScript

Security Score

100/100

Audited on Apr 3, 2026

No findings