Lyra
[ICCV 2025] Official Implementation for "Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition"
Install / Use
/learn @JIA-Lab-research/LyraREADME
<img src="assets/lyra.svg" alt="icon" width="30" height="30"> <span style="font-size:30px;">Lyra: An Efficient and Speech-Centric Framework <br>for Omni-Cognition</span>
<a href='https://huggingface.co/papers/2412.09501'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Discussion-orange'></a> <a href='https://huggingface.co/collections/zszhong/lyra-model-674ea5bb3b39ff8f15de75fc'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a> <a href='https://huggingface.co/collections/zszhong/lyra-data-675d80fbab80334eb52cdd82'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-green'></a> <a href='https://huggingface.co/collections/zszhong/lyra-evaluation-675d7f038747ba865932a149'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Evaluation-yellow'></a><br> <a href='https://arxiv.org/pdf/2412.09501.pdf'><img src='https://img.shields.io/badge/Paper-arXiv-red'></a> <a href='https://www.youtube.com/watch?v=7kh-M0jmmtI'><img src='https://img.shields.io/badge/Video-YouTube-red'></a> <a href='https://103.170.5.190:17860/'><img src='https://img.shields.io/badge/Project-Demo-violet'></a> <a href='https://lyra-omni.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
Overview of Lyra:
<div align=center> <img width="98%" src="assets/overview.png"/> </div>Lyra shows superiority compared with leading omni-models in:
- Stronger performance: Achieve SOTA results across a variety of speech-centric tasks.
- More versatile: Support image, video, speech/long-speech, sound understanding and speech generation.
- More efficient: Less training data, support faster training and inference.
Release
- [12/12] 🔥 Lyra is coming! We release the paper, demo, code, models, training data and evaluation data. More related checkpoints will be released soon!
Contents
Demo
We provide video demo here for better experience and illustrations. More examples can be found in our project page and feel free to try our online demo! Due to the computing cost, GPU memory of the demo machine (GeForce RTX 3090), and uploading storage, the long-speech function is not supported for the current online demo. 😰
❗❗❗For the online demo, start by selecting the instruction type (either speech or text) in the top-left corner.
<p align="center" width="98%"> <a href="https://youtu.be/7kh-M0jmmtI" target="_blank"> <img src="https://raw.githubusercontent.com/dvlab-research/Lyra/main/assets/video.png" alt="Lyra" style="width: 98%; min-width: 300px; display: block; margin: auto;"> </a> </p>Install
Please follow the instructions below to install the required packages.
- Clone this repository:
git clone https://github.com/dvlab-research/Lyra.git
- Install Package:
conda create -n lyra python=3.10 -y
conda activate lyra
cd Lyra
pip install --upgrade pip
pip install -e .
- Install optional packages for simultaneous text-speech generation:
pip install pip==24.0
pip install fairseq==0.12.2
pip install --upgrade pip
Model
<div align=center> <img width="98%" src="assets/framework.png"/> </div>Lyra supports multi-modal inputs. When the data contains a speech modality, we use the latent cross-modality regularizer to assist. Data from each modality is processed through encoders and projectors before being sent into the LLM. Within the LLM, multi-modality LoRA and latent multi-modality extraction modules operate synergistically, facilitating the simultaneous generation of both speech and text outputs.
We provide all our fully finetuned models:
| Model | Base LLM | Vision Encoder | Speech Encoder | Projector | Full CKPT | | ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------ | | Lyra_Mini_3B | Qwen2VL_2B_LLM | Qwen2VL_2B_ViT | whisper-large-v3-turbo | 3B_proj | 3B_ckpt | | Lyra_Base_9B | Qwen2VL_7B_LLM | Qwen2VL_7B_ViT | whisper-large-v3 | 9B_proj | 9B_ckpt | | Lyra_Pro_74B | Qwen2VL_70B_LLM | Qwen2VL_70B_ViT | whisper-large-v3 | 74B_proj | 74B_ckpt |
Preparation
Training Data
We provide the processed data for the model training. All speech-related training data can be downloaded Lyra-Data.
For model pretraining data, please download the following the training multi-modality data and organize them as:
⇒ means put the data in the local folder. The pretraining json file can be downloaded from Lyra_Pretrain.
-
LibriSpeech ⇒
data/Lyra_Pretrain/LibriSpeech and ⇒
data/Lyra_SFT/multi_modality_speech/LibriSpeech and ⇒
data/Lyra_Eval/LibriSpeechdownload all training and develop data. -
Common Voice ⇒
data/Lyra_Pretrain/CommonVoicedownload the English Common Voice Corpus.
During the pretraining process, we filtered out some noisy and short audio speech data.
For the image part of finetuning data, similar to Mini-Gemini, please download the following the instruction data and organize them as:
⇒ means put the data in the local folder.
- COCO train2017 ⇒
data/Lyra_SFT/multi_modality_image/coco - GQA ⇒
data/Lyra_SFT/multi_modality_image/gqa - OCR-VQA (we save all files as
.jpg) ⇒data/Lyra_SFT/multi_modality_image/ocr_vqa - TextVQA (not included for training) ⇒
data/Lyra_SFT/multi_modality_image/textvqa - VisualGenome part1, VisualGenome part2 ⇒
data/Lyra_SFT/multi_modality_image/vg - ShareGPT4V-100K ⇒
data/Lyra_SFT/multi_modality_image/sam,share_textvqa,wikiart, ... - LAION GPT4V ⇒
data/Lyra_SFT/multi_modality_image/gpt4v-dataset - ALLaVA Instruction ⇒
data/Lyra_SFT/multi_modality_image/ALLaVA-4V - DocVQA ⇒
data/Lyra_SFT/multi_modality_image/docvqa - ChartQA ⇒
data/Lyra_SFT/multi_modality_image/chartqa - DVQA ⇒
data/Lyra_SFT/multi_modality_image/dvqa - AI2D ⇒
data/Lyra_SFT/multi_modality_image/ai2d
For the audio part of finetuning data, please download the following the instruction data and organize them as:
⇒ means put the data in the local folder.
-
Lyra_MultiModal ⇒
data/Lyra_SFT/multi_modality_speech/Lyra_MMFor reproduced details, please refer the Lyra multi-modality preparation.
For the long speech audio finetuning data, please download the following the instruction data and organize them as:
⇒ means put the data in the local folder.
-
Lyra_LongSpeech ⇒
data/Lyra_SFT/long_speech/Lyra_LongSpeechFor reproduced details, please refer the Lyra long-speech preparation.
For the text-speech generation data, please download the following the instructio
Related Skills
node-connect
340.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
340.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.2kCommit, push, and open a PR
