SkillAgentSearch skills...

Moshi

Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.

Install / Use

/learn @kyutai-labs/Moshi
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Moshi: a speech-text foundation model for real time dialogue

precommit badge rust ci badge

[[Read the paper]][moshi] [Demo] [Hugging Face]

[Moshi][moshi] is a speech-text foundation model and full-duplex spoken dialogue framework. It uses [Mimi][moshi], a state-of-the-art streaming neural audio codec. Talk to Moshi now in our live demo.

Organisation of the repository

There are three separate versions of the Moshi inference stack in this repo.

  • PyTorch: for research and tinkering. The code is in the moshi/ directory.
  • MLX: for on-device inference on iPhone and Mac. The code is in the moshi_mlx/ directory.
  • Rust: for production. The code is in the rust/ directory. This contains in particular a Mimi implementation in Rust, with Python bindings available as rustymimi.

Finally, the code for the web UI client used in the Moshi demo is provided in the client/ directory.

If you want to fine tune Moshi, head out to kyutai-labs/moshi-finetune.

Other Kyutai models

The Moshi codebase is also used to run related models from Kyutai that use a multi-stream architecture similar to Moshi:

Model architecture

Moshi models two streams of audio: one corresponds to Moshi speaking, and the other one to the user speaking. Along with these two audio streams, Moshi predicts text tokens corresponding to its own speech, its inner monologue, which greatly improves the quality of its generation. A small Depth Transformer models inter-codebook dependencies for a given time step, while a large, 7B-parameter Temporal Transformer models the temporal dependencies. Moshi achieves a theoretical latency of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms on an L4 GPU.

<p align="center"> <img src="./moshi.png" alt="Schema representing the structure of Moshi. Moshi models two streams of audio: one corresponds to Moshi, and the other one to the user. At inference, the audio stream of the user is taken from the audio input, and the audio stream for Moshi is sampled from the model's output. Along that, Moshi predicts text tokens corresponding to its own speech for improved accuracy. A small Depth Transformer models inter codebook dependencies for a given step." width="650px"></p>

Mimi

Mimi is a neural audio codec that processes 24 kHz audio, down to a 12.5 Hz representation with a bandwidth of 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size), yet performs better than existing, non-streaming, codecs like SpeechTokenizer (50 Hz, 4kbps), or SemantiCodec (50 Hz, 1.3kbps).

Mimi builds on previous neural audio codecs such as SoundStream and EnCodec, adding a Transformer both in the encoder and decoder, and adapting the strides to match an overall frame rate of 12.5 Hz. This allows Mimi to get closer to the average frame rate of text tokens (~3-4 Hz), and limit the number of autoregressive steps in Moshi. Similarly to SpeechTokenizer, Mimi uses a distillation loss so that the first codebook tokens match a self-supervised representation from WavLM, which allows modeling semantic and acoustic information with a single model. Finally, and similarly to EBEN, Mimi uses only an adversarial training loss, along with feature matching, showing strong improvements in terms of subjective quality despite its low bitrate.

<p align="center"> <img src="./mimi.png" alt="Schema representing the structure of Mimi, our proposed neural codec. Mimi contains a Transformer in both its encoder and decoder, and achieves a frame rate closer to that of text tokens. This allows us to reduce the number of auto-regressive steps taken by Moshi, thus reducing the latency of the model." width="800px"></p>

Models

We release three models:

  • Moshi fine-tuned on a male synthetic voice (Moshiko),
  • Moshi fine-tuned on a female synthetic voice (Moshika),
  • Mimi, our speech codec.

Depending on the backend, the file format and quantization available will vary. Here is the list of the HuggingFace repo with each model. Mimi is bundled in each of those, and always use the same checkpoint format.

All models are released under the CC-BY 4.0 license.

Requirements

You will need at least Python 3.10, with 3.12 recommended. For specific requirements, please check the individual backends directories. You can install the PyTorch and MLX clients with the following:

pip install -U moshi      # moshi PyTorch, from PyPI
pip install -U moshi_mlx  # moshi MLX, from PyPI, best with Python 3.12.
# Or the bleeding edge versions for Moshi and Moshi-MLX.
pip install -U -e "git+https://git@github.com/kyutai-labs/moshi.git#egg=moshi&subdirectory=moshi"
pip install -U -e "git+https://git@github.com/kyutai-labs/moshi.git#egg=moshi_mlx&subdirectory=moshi_mlx"

pip install rustymimi  # mimi, rust implementation with Python bindings from PyPI

If you are not using Python 3.12, you might get an error when installing moshi_mlx or rustymimi (which moshi_mlx depends on). Then, you will need to install the Rust toolchain, or switch to Python 3.12.

While we hope that the present codebase will work on Windows, we do not provide official support for it. We have tested the MLX version on a MacBook Pro M3. At the moment, we do not support quantization for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB).

For using the Rust backend, you will need a recent version of the Rust toolchain. To compile GPU support, you will also need the CUDA properly installed for your GPU, in particular with nvcc.

PyTorch implementation

The PyTorch based API can be found in the moshi directory. It provides a streaming version of the audio tokenizer (mimi) and the language model (moshi).

In order to run in interactive mode, you need to start a server which will run the model, you can then use either the web UI or a command line client.

Start the server with:

python -m moshi.server [--gradio-tunnel] [--hf-repo kyutai/moshika-pytorch-bf16]

And then access the web UI on localhost:8998. If your GPU is on a distant machine this will not work because for security reasons, websites using HTTP are not allowed to use the microphone. There are two ways to get around this:

  • Forward the remote 8998 port to your localhost using ssh -L flag. Then connects to localhost:8998 as mentioned previously.
  • Use the --gradio-tunnel argument, setting up a tunnel with a URL accessible from anywhere. Keep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe). You can use --gradio-tunnel-token to set a fixed secret token and reuse the same address over time.

You can use --hf-repo to select a different pretrained model, by setting the proper Hugging Face repository.

Accessing a server that is not localhost via http may cause issues with using the microphone in the web UI (in some browsers this is only allowed using https).

A command-line client is also available, as

python -m moshi.client [--url URL_TO_GRADIO]

However note that, unlike the web browser, this client is barebones: it does not perform any echo cancellation, nor does it try to compensate for a growing lag by skipping frames.

For more information, in particular on how to use the API directly, please checkout moshi/README.md.

MLX implementation for local inference on macOS

Once you have inst

View on GitHub
GitHub Stars9.9k
CategoryDevelopment
Updated6h ago
Forks917

Languages

Python

Security Score

95/100

Audited on Mar 23, 2026

No findings