OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

<div align='center' > [📖 arXiv Paper] [📊 Datasets] [🏆 Models] </div> OpenOmni is the end-to-end fully open-sourced pioneering method that successfully incorporates image,speech and text into the omni large language model. OpenOmni's design for speech generation through language bridging and text-guided speech can be quickly trained in situations where omni-modal data and VRAM resources are scarce. OpenOmni not only supports omni-modal nderstanding, but also supports two real-time emotional speech generation modes, CTC mode and AR mode, so that users can flexibly choose according to their needs to achieve a balance between generation speed and quality. The flexible framework design allows OpenOmni to be easily and quickly applied to a variety of downstream tasks, such as speech embodied navigation, multi-role-playing speech dialogue, etc. Everyone is welcome to come and experience it now!

🔥 Update

[2025/09/22]🔥After a year of community evaluation, our work has been accepted by NIPS 2025. Congratulations!
[2025/05/26]🔥Our [OmniCharacter]—built on MMEvol and OpenOmni series—has been accepted to the main track of ACL 2025. You’re all welcome to give it a try!
[2025/05/15]🔥Two paper has beed accepted by ACL 2025 main based on our findings (LLaMA-Omin2 and OmniCharacter). We warmly welcome everyone to use our work.
[2025/05/05]🔥Our gate fusion technology for more acurrate speech content generation is adopted by LLaMA-Omni2
[2025/02/12]🔥Add some missing file and fix all possible bug
[2025/01/13]🔥OpenOmni is coming! We release the code, model and data
[2025/01/09]🔥After two months of company audit! We release the paper
[2024/11/14]🔥We submit the paper for peer review openreview
[2024/09/15]🔥We write the first line of OpenOmni project for fully open-sourced pioneering OmniLLM in end-to-end manner.

👀 Contents

Setup
Model
Preparation
Train
Evaluation
Example
Citation

📷 Setup

Please follow the instructions below to install the required packages.

Clone this repository

git clone https://github.com/RainBowLuoCS/OpenOmni.git
cd OpenOmni

Install Package

conda create -n openomni python=3.10 -y
conda activate openomni
pip install --upgrade pip  # enable PEP 660 support
pip install -e ".[train]"
pip install -r requirements.txt

Install additional packages for training

pip install flash-attn --no-build-isolation

🔥 Fast Usage

After downloading the weights and configuring the paths properly. Two open-sourced speech tokenizer are needed for speech discretization and reconstruction with different vocabulary size! CosVoice for 6K CTC Mode and GLM4Voice for 16K AR Mode

Fast inference for omnimodal input (speech,text,image and video)

python inference.py

Fast interation for omnimodal input (speech,text,image and video)

python demo.py

Model

Here are the pretrained weights and instruction tuning weights

| Stage | Model | Speech Projector | Image Projector | IT Data | Download | | --- | --- | --- | --- | --- | --- | | 1-1 | OpenOMNI-Qwen2-7B-Stage1-1 | ckpt | ckpt | openomni_stage1-1.json | ckpt | | 2-1 | OpenOMNI-Qwen2-7B-Stage2-1 | ckpt | ckpt | openomni_stage2-1.json | ckpt | | 2-2 | OpenOMNI-Qwen2-7B-Stage2-2 | ckpt | ckpt | openomni_stage2-2.json | ckpt | | 3-1 | OpenOMNI-Qwen2-7B-Stage3-1 | ckpt | ckpt | openomni_stage3-1.json | ckpt | | 3-2 | OpenOMNI-Qwen2-7B-Stage3-2 | ckpt | ckpt | openomni_stage3-2.json | ckpt |

Preparation

Dataset

Please follow MMEvol to prepare the corresponding images-text datasets. Here we only provide the details of speech-text datasets.

The following is the data directory tree of OpenOmni

data structure

datasets
├── json # data receipe
│   ├── openomni_stage1-1.json # speech2text pretraining
│   ├── openomni_stage2-1.json # image2text pretraining
│   ├── openomni_stage2-2.json # image2text instruction tuning
│   ├── openomni_stage3-1.json # text2speech pretraining
│   ├── openomni_stage3-2.json # text2speech emotional injection
├── asr # classic bilingual speech corpus
│   ├── AISHELL-4
│   ├── LibriSPeech
│   ├── WeNetSpeech
├── audio_en # synthetic english speech corpus for question
├── audio_llava # synthetic bilingual speech corpus for answer
├── audio_zh # synthetic chinese speech corpus for question
├── audio_unit # synthetic bilingual speech corpus for answer
├── audio_prefer # synthetic emotional bilingual speech corpus for answer
├── audio_reject # synthetic emotional bilingual speech corpus for answer
├── audio_ultrachat # synthetic bilingual speech corpus for answer
├── ai2d
│   ├── abc_images
│   ├── annotations
│   ├── images
│   ├── questions
│   └── categories.json
......

All file/path starting with "audio" are self-synthesized.
DPO contains approximately 9k entries for "prefer" and "reject," covering 9 types of emotions.

More details about data curation can be found in our paper.

Train

Speech2Text Pretrain

Please download the MMEvol, AIShell-4, LibriSPeech, WeNetSpeech, OpenOmni Data and organize the data following Preparation before training . Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

bash scripts/train/llama3/speech2text_pretrain.sh
bash scripts/train/qwen2/speech2text_pretrain.sh

Image2Text Pretrain

Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

bash scripts/train/llama3/image2text_pretrain.sh
bash scripts/train/qwen2/image2text_pretrain.sh

Image2Text Instruction Tuning

bash scripts/train/llama3/image2text_finetune.sh
bash scripts/train/qwen2/image2text_finetue.sh

OpenOmni

Install / Use

README

OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

🔥 Update

<font style="color:rgb(31, 35, 40);">👀</font><font style="color:rgb(31, 35, 40);"> Contents</font>

<font style="color:rgb(31, 35, 40);">📷</font><font style="color:rgb(31, 35, 40);"> Setup</font>

🔥 Fast Usage

<font style="color:rgb(31, 35, 40);">Model</font>

<font style="color:rgb(31, 35, 40);">Preparation</font>

<font style="color:rgb(31, 35, 40);">Dataset</font>

<font style="color:rgb(31, 35, 40);">data structure</font>

<font style="color:rgb(31, 35, 40);">Train</font>

<font style="color:rgb(31, 35, 40);">Speech2Text Pretrain</font>

<font style="color:rgb(31, 35, 40);">Image2Text Pretrain</font>

<font style="color:rgb(31, 35, 40);">Image2Text Instruction Tuning</font>

<font style="color:rgb(31, 35, 40);">Text2Speech Pretrain</font>