SkillAgentSearch skills...

OpenOmni

(NIPS 2025) OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

Install / Use

/learn @RainBowLuoCS/OpenOmni

README

<div align=center> <img src="assets/logo.png" width="140px"> </div>

OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

<font size=5><div align='center' > [📖 arXiv Paper] [📊 Datasets] [🏆 Models] </div></font> OpenOmni is the end-to-end fully open-sourced pioneering method that successfully incorporates image,speech and text into the omni large language model. OpenOmni's design for speech generation through language bridging and text-guided speech can be quickly trained in situations where omni-modal data and VRAM resources are scarce. OpenOmni not only supports omni-modal nderstanding, but also supports two real-time emotional speech generation modes, CTC mode and AR mode, so that users can flexibly choose according to their needs to achieve a balance between generation speed and quality. The flexible framework design allows OpenOmni to be easily and quickly applied to a variety of downstream tasks, such as speech embodied navigation, multi-role-playing speech dialogue, etc. Everyone is welcome to come and experience it now!

🔥 Update

  • [2025/09/22]🔥After a year of community evaluation, our work has been accepted by NIPS 2025. Congratulations!
  • [2025/05/26]🔥Our [OmniCharacter]—built on MMEvol and OpenOmni series—has been accepted to the main track of ACL 2025. You’re all welcome to give it a try!
  • [2025/05/15]🔥Two paper has beed accepted by ACL 2025 main based on our findings (LLaMA-Omin2 and OmniCharacter). We warmly welcome everyone to use our work.
  • [2025/05/05]🔥Our gate fusion technology for more acurrate speech content generation is adopted by LLaMA-Omni2
  • [2025/02/12]🔥Add some missing file and fix all possible bug
  • [2025/01/13]🔥OpenOmni is coming! We release the code, model and data
  • [2025/01/09]🔥After two months of company audit! We release the paper
  • [2024/11/14]🔥We submit the paper for peer review openreview
  • [2024/09/15]🔥We write the first line of OpenOmni project for fully open-sourced pioneering OmniLLM in end-to-end manner.

<font style="color:rgb(31, 35, 40);">👀</font><font style="color:rgb(31, 35, 40);"> Contents</font>

  • <font style="color:rgb(31, 35, 40);">Setup</font>
  • <font style="color:rgb(31, 35, 40);">Model</font>
  • <font style="color:rgb(31, 35, 40);">Preparation</font>
  • <font style="color:rgb(31, 35, 40);">Train</font>
  • <font style="color:rgb(31, 35, 40);">Evaluation</font>
  • <font style="color:rgb(31, 35, 40);">Example</font>
  • <font style="color:rgb(31, 35, 40);">Citation</font>

<font style="color:rgb(31, 35, 40);">📷</font><font style="color:rgb(31, 35, 40);"> Setup</font>

<font style="color:rgb(31, 35, 40);">Please follow the instructions below to install the required packages.</font>

  1. <font style="color:rgb(31, 35, 40);">Clone this repository</font>
git clone https://github.com/RainBowLuoCS/OpenOmni.git
cd OpenOmni
  1. <font style="color:rgb(31, 35, 40);">Install Package</font>
conda create -n openomni python=3.10 -y
conda activate openomni
pip install --upgrade pip  # enable PEP 660 support
pip install -e ".[train]"
pip install -r requirements.txt
  1. <font style="color:rgb(31, 35, 40);">Install additional packages for training</font>
pip install flash-attn --no-build-isolation

🔥 Fast Usage

After downloading the weights and configuring the paths properly. Two open-sourced speech tokenizer are needed for speech discretization and reconstruction with different vocabulary size! CosVoice for 6K CTC Mode and GLM4Voice for 16K AR Mode

Fast inference for omnimodal input (speech,text,image and video)

python inference.py

Fast interation for omnimodal input (speech,text,image and video)

python demo.py

<font style="color:rgb(31, 35, 40);">Model</font>

<font style="color:rgb(31, 35, 40);">Here are the pretrained weights and instruction tuning weights</font>

| Stage | <font style="color:rgb(31, 35, 40);">Model</font> | <font style="color:rgb(31, 35, 40);">Speech Projector</font> | <font style="color:rgb(31, 35, 40);">Image</font><br/><font style="color:rgb(31, 35, 40);">Projector</font> | <font style="color:rgb(31, 35, 40);">IT Data</font> | <font style="color:rgb(31, 35, 40);">Download</font> | | --- | --- | --- | --- | --- | --- | | 1-1 | <font style="color:rgb(31, 35, 40);">OpenOMNI-Qwen2-7B-Stage1-1</font> | ckpt | ckpt | <font style="color:rgb(31, 35, 40);">openomni_stage1-1.json</font> | ckpt | | 2-1 | <font style="color:rgb(31, 35, 40);">OpenOMNI-Qwen2-7B-Stage2-1</font> | ckpt | ckpt | <font style="color:rgb(31, 35, 40);">openomni_stage2-1.json</font> | ckpt | | 2-2 | <font style="color:rgb(31, 35, 40);">OpenOMNI-Qwen2-7B-Stage2-2</font> | ckpt | ckpt | <font style="color:rgb(31, 35, 40);">openomni_stage2-2.json</font> | ckpt | | 3-1 | <font style="color:rgb(31, 35, 40);">OpenOMNI-Qwen2-7B-Stage3-1</font> | ckpt | ckpt | <font style="color:rgb(31, 35, 40);">openomni_stage3-1.json</font> | ckpt | | 3-2 | <font style="color:rgb(31, 35, 40);">OpenOMNI-Qwen2-7B-Stage3-2</font> | ckpt | ckpt | <font style="color:rgb(31, 35, 40);">openomni_stage3-2.json</font> | ckpt |

<font style="color:rgb(31, 35, 40);">Preparation</font>

<font style="color:rgb(31, 35, 40);">Dataset</font>

<font style="color:rgb(31, 35, 40);">Please follow MMEvol to prepare the corresponding images-text datasets. Here we only provide the details of speech-text datasets.</font>

The following is the data directory tree of OpenOmni

<font style="color:rgb(31, 35, 40);">data structure</font>

datasets
├── json # data receipe
│   ├── openomni_stage1-1.json # speech2text pretraining
│   ├── openomni_stage2-1.json # image2text pretraining
│   ├── openomni_stage2-2.json # image2text instruction tuning
│   ├── openomni_stage3-1.json # text2speech pretraining
│   ├── openomni_stage3-2.json # text2speech emotional injection
├── asr # classic bilingual speech corpus
│   ├── AISHELL-4
│   ├── LibriSPeech
│   ├── WeNetSpeech
├── audio_en # synthetic english speech corpus for question
├── audio_llava # synthetic bilingual speech corpus for answer
├── audio_zh # synthetic chinese speech corpus for question
├── audio_unit # synthetic bilingual speech corpus for answer
├── audio_prefer # synthetic emotional bilingual speech corpus for answer
├── audio_reject # synthetic emotional bilingual speech corpus for answer
├── audio_ultrachat # synthetic bilingual speech corpus for answer
├── ai2d
│   ├── abc_images
│   ├── annotations
│   ├── images
│   ├── questions
│   └── categories.json
......


  • All file/path starting with "audio" are self-synthesized.
  • DPO contains approximately 9k entries for "prefer" and "reject," covering 9 types of emotions.

More details about data curation can be found in our paper.

<font style="color:rgb(31, 35, 40);">Train</font>

<font style="color:rgb(31, 35, 40);">Speech2Text Pretrain</font>

<font style="color:rgb(31, 35, 40);">Please download the MMEvol, AIShell-4, LibriSPeech, WeNetSpeech, OpenOmni Data and organize the data following Preparation before training . Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)</font>

bash scripts/train/llama3/speech2text_pretrain.sh
bash scripts/train/qwen2/speech2text_pretrain.sh

<font style="color:rgb(31, 35, 40);">Image2Text Pretrain</font>

<font style="color:rgb(31, 35, 40);">Please make sure you download and organize the data following</font><font style="color:rgb(31, 35, 40);"> </font><font style="color:rgb(31, 35, 40);">Preparation</font><font style="color:rgb(31, 35, 40);"> </font><font style="color:rgb(31, 35, 40);">before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)</font>

bash scripts/train/llama3/image2text_pretrain.sh
bash scripts/train/qwen2/image2text_pretrain.sh

<font style="color:rgb(31, 35, 40);">Image2Text Instruction Tuning</font>

<font style="color:rgb(31, 35, 40);">Please make sure you download and organize the data following</font><font style="color:rgb(31, 35, 40);"> </font><font style="color:rgb(31, 35, 40);">Preparation</font><font style="color:rgb(31, 35, 40);"> </font><font style="color:rgb(31, 35, 40);">before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)</font>

bash scripts/train/llama3/image2text_finetune.sh
bash scripts/train/qwen2/image2text_finetue.sh

<font style="color:rgb(31, 35, 40);">Text2Speech Pretrain</font>

<font style="color:rgb(31, 35, 40);">Please make sure you download and organize the data following</font><font style="color:rgb(31, 35, 40);"> </font><font style="color:rgb(31, 35, 40);">Preparation</font><font style="color:rgb(31, 35, 40);"> </font><font style="color:rgb(31, 35, 40);">before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)</font>

View on GitHub
GitHub Stars134
CategoryDevelopment
Updated3d ago
Forks7

Languages

Python

Security Score

85/100

Audited on Mar 27, 2026

No findings