Lyra

[ICCV 2025] Official Implementation for "Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition"

Generate Convert Improve

Install / Use

/learn @JIA-Lab-research/Lyra

About this skill

Quality Score

0/100

README

<img src="assets/lyra.svg" alt="icon" width="30" height="30"> <span style="font-size:30px;">Lyra: An Efficient and Speech-Centric Framework <br>for Omni-Cognition</span>

Overview of Lyra:

Lyra shows superiority compared with leading omni-models in:

Stronger performance: Achieve SOTA results across a variety of speech-centric tasks.
More versatile: Support image, video, speech/long-speech, sound understanding and speech generation.
More efficient: Less training data, support faster training and inference.

Release

[12/12] 🔥 Lyra is coming! We release the paper, demo, code, models, training data and evaluation data. More related checkpoints will be released soon!

Demo
Install
Model
Preparation
Train
Evaluation
Examples
Citation
Acknowledgement
License

Demo

We provide video demo here for better experience and illustrations. More examples can be found in our project page and feel free to try our online demo! Due to the computing cost, GPU memory of the demo machine (GeForce RTX 3090), and uploading storage, the long-speech function is not supported for the current online demo. 😰

❗❗❗For the online demo, start by selecting the instruction type (either speech or text) in the top-left corner.

Install

Please follow the instructions below to install the required packages.

Clone this repository:

git clone https://github.com/dvlab-research/Lyra.git

Install Package:

conda create -n lyra python=3.10 -y
conda activate lyra
cd Lyra
pip install --upgrade pip
pip install -e .

Install optional packages for simultaneous text-speech generation:

pip install pip==24.0
pip install fairseq==0.12.2
pip install --upgrade pip

Model

Lyra supports multi-modal inputs. When the data contains a speech modality, we use the latent cross-modality regularizer to assist. Data from each modality is processed through encoders and projectors before being sent into the LLM. Within the LLM, multi-modality LoRA and latent multi-modality extraction modules operate synergistically, facilitating the simultaneous generation of both speech and text outputs.

We provide all our fully finetuned models:

| Model | Base LLM | Vision Encoder | Speech Encoder | Projector | Full CKPT | | ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------ | | Lyra_Mini_3B | Qwen2VL_2B_LLM | Qwen2VL_2B_ViT | whisper-large-v3-turbo | 3B_proj | 3B_ckpt | | Lyra_Base_9B | Qwen2VL_7B_LLM | Qwen2VL_7B_ViT | whisper-large-v3 | 9B_proj | 9B_ckpt | | Lyra_Pro_74B | Qwen2VL_70B_LLM | Qwen2VL_70B_ViT | whisper-large-v3 | 74B_proj | 74B_ckpt |

Preparation

Training Data

We provide the processed data for the model training. All speech-related training data can be downloaded Lyra-Data.

For model pretraining data, please download the following the training multi-modality data and organize them as:

⇒ means put the data in the local folder. The pretraining json file can be downloaded from Lyra_Pretrain.

LibriSpeech ⇒ data/Lyra_Pretrain/LibriSpeech

and ⇒ data/Lyra_SFT/multi_modality_speech/LibriSpeech

and ⇒ data/Lyra_Eval/LibriSpeech download all training and develop data.
Common Voice ⇒ data/Lyra_Pretrain/CommonVoice download the English Common Voice Corpus.

During the pretraining process, we filtered out some noisy and short audio speech data.

For the image part of finetuning data, similar to Mini-Gemini, please download the following the instruction data and organize them as:

⇒ means put the data in the local folder.

COCO train2017 ⇒ data/Lyra_SFT/multi_modality_image/coco
GQA ⇒ data/Lyra_SFT/multi_modality_image/gqa
OCR-VQA (we save all files as .jpg) ⇒ data/Lyra_SFT/multi_modality_image/ocr_vqa
TextVQA (not included for training) ⇒ data/Lyra_SFT/multi_modality_image/textvqa
VisualGenome part1, VisualGenome part2 ⇒ data/Lyra_SFT/multi_modality_image/vg
ShareGPT4V-100K ⇒ data/Lyra_SFT/multi_modality_image/sam, share_textvqa, wikiart, ...
LAION GPT4V ⇒ data/Lyra_SFT/multi_modality_image/gpt4v-dataset
ALLaVA Instruction ⇒ data/Lyra_SFT/multi_modality_image/ALLaVA-4V
DocVQA ⇒ data/Lyra_SFT/multi_modality_image/docvqa
ChartQA ⇒ data/Lyra_SFT/multi_modality_image/chartqa
DVQA ⇒ data/Lyra_SFT/multi_modality_image/dvqa
AI2D ⇒ data/Lyra_SFT/multi_modality_image/ai2d

For the audio part of finetuning data, please download the following the instruction data and organize them as:

⇒ means put the data in the local folder.

Lyra_MultiModal ⇒ data/Lyra_SFT/multi_modality_speech/Lyra_MM

For reproduced details, please refer the Lyra multi-modality preparation.

For the long speech audio finetuning data, please download the following the instruction data and organize them as:

⇒ means put the data in the local folder.

Lyra_LongSpeech ⇒ data/Lyra_SFT/long_speech/Lyra_LongSpeech

For reproduced details, please refer the Lyra long-speech preparation.

For the text-speech generation data, please download the following the instructio

Related Skills

node-connect

340.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

340.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.2k

Commit, push, and open a PR

JIA-Lab-research

View profile

View on GitHub

GitHub Stars305

CategoryDevelopment

Updated2mo ago

Forks29

JIA-Lab-research/Lyra

Languages

Python

Security Score

100/100

Audited on Jan 19, 2026

No findings

Lyra

Install / Use

README

<img src="assets/lyra.svg" alt="icon" width="30" height="30"> <span style="font-size:30px;">Lyra: An Efficient and Speech-Centric Framework <br>for Omni-Cognition</span>

Release

Contents

Demo

Install

Model

Preparation

Training Data

Related Skills