SkillAgentSearch skills...

Lyra

[ICCV 2025] Official Implementation for "Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition"

Install / Use

/learn @JIA-Lab-research/Lyra

README

<img src="assets/lyra.svg" alt="icon" width="30" height="30"> <span style="font-size:30px;">Lyra: An Efficient and Speech-Centric Framework <br>for Omni-Cognition</span>

<a href='https://huggingface.co/papers/2412.09501'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Discussion-orange'></a> <a href='https://huggingface.co/collections/zszhong/lyra-model-674ea5bb3b39ff8f15de75fc'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a> <a href='https://huggingface.co/collections/zszhong/lyra-data-675d80fbab80334eb52cdd82'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-green'></a> <a href='https://huggingface.co/collections/zszhong/lyra-evaluation-675d7f038747ba865932a149'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Evaluation-yellow'></a><br> <a href='https://arxiv.org/pdf/2412.09501.pdf'><img src='https://img.shields.io/badge/Paper-arXiv-red'></a> <a href='https://www.youtube.com/watch?v=7kh-M0jmmtI'><img src='https://img.shields.io/badge/Video-YouTube-red'></a> <a href='https://103.170.5.190:17860/'><img src='https://img.shields.io/badge/Project-Demo-violet'></a> <a href='https://lyra-omni.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>

Overview of Lyra:

<div align=center> <img width="98%" src="assets/overview.png"/> </div>

Lyra shows superiority compared with leading omni-models in:

  1. Stronger performance: Achieve SOTA results across a variety of speech-centric tasks.
  2. More versatile: Support image, video, speech/long-speech, sound understanding and speech generation.
  3. More efficient: Less training data, support faster training and inference.

Release

Contents

Demo

We provide video demo here for better experience and illustrations. More examples can be found in our project page and feel free to try our online demo! Due to the computing cost, GPU memory of the demo machine (GeForce RTX 3090), and uploading storage, the long-speech function is not supported for the current online demo. 😰

❗❗❗For the online demo, start by selecting the instruction type (either speech or text) in the top-left corner.

<p align="center" width="98%"> <a href="https://youtu.be/7kh-M0jmmtI" target="_blank"> <img src="https://raw.githubusercontent.com/dvlab-research/Lyra/main/assets/video.png" alt="Lyra" style="width: 98%; min-width: 300px; display: block; margin: auto;"> </a> </p>

Install

Please follow the instructions below to install the required packages.

  1. Clone this repository:
git clone https://github.com/dvlab-research/Lyra.git
  1. Install Package:
conda create -n lyra python=3.10 -y
conda activate lyra
cd Lyra
pip install --upgrade pip
pip install -e .
  1. Install optional packages for simultaneous text-speech generation:
pip install pip==24.0
pip install fairseq==0.12.2
pip install --upgrade pip

Model

<div align=center> <img width="98%" src="assets/framework.png"/> </div>

Lyra supports multi-modal inputs. When the data contains a speech modality, we use the latent cross-modality regularizer to assist. Data from each modality is processed through encoders and projectors before being sent into the LLM. Within the LLM, multi-modality LoRA and latent multi-modality extraction modules operate synergistically, facilitating the simultaneous generation of both speech and text outputs.

We provide all our fully finetuned models:

| Model | Base LLM | Vision Encoder | Speech Encoder | Projector | Full CKPT | | ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------ | | Lyra_Mini_3B | Qwen2VL_2B_LLM | Qwen2VL_2B_ViT | whisper-large-v3-turbo | 3B_proj | 3B_ckpt | | Lyra_Base_9B | Qwen2VL_7B_LLM | Qwen2VL_7B_ViT | whisper-large-v3 | 9B_proj | 9B_ckpt | | Lyra_Pro_74B | Qwen2VL_70B_LLM | Qwen2VL_70B_ViT | whisper-large-v3 | 74B_proj | 74B_ckpt |

Preparation

Training Data

We provide the processed data for the model training. All speech-related training data can be downloaded Lyra-Data.

For model pretraining data, please download the following the training multi-modality data and organize them as:

means put the data in the local folder. The pretraining json file can be downloaded from Lyra_Pretrain.

  • LibriSpeechdata/Lyra_Pretrain/LibriSpeech

    ​ and ⇒ data/Lyra_SFT/multi_modality_speech/LibriSpeech

    ​ and ⇒ data/Lyra_Eval/LibriSpeech download all training and develop data.

  • Common Voicedata/Lyra_Pretrain/CommonVoice download the English Common Voice Corpus.

During the pretraining process, we filtered out some noisy and short audio speech data.

For the image part of finetuning data, similar to Mini-Gemini, please download the following the instruction data and organize them as:

means put the data in the local folder.

  • COCO train2017data/Lyra_SFT/multi_modality_image/coco
  • GQAdata/Lyra_SFT/multi_modality_image/gqa
  • OCR-VQA (we save all files as .jpg) ⇒ data/Lyra_SFT/multi_modality_image/ocr_vqa
  • TextVQA (not included for training) ⇒ data/Lyra_SFT/multi_modality_image/textvqa
  • VisualGenome part1, VisualGenome part2data/Lyra_SFT/multi_modality_image/vg
  • ShareGPT4V-100Kdata/Lyra_SFT/multi_modality_image/sam, share_textvqa, wikiart, ...
  • LAION GPT4Vdata/Lyra_SFT/multi_modality_image/gpt4v-dataset
  • ALLaVA Instructiondata/Lyra_SFT/multi_modality_image/ALLaVA-4V
  • DocVQAdata/Lyra_SFT/multi_modality_image/docvqa
  • ChartQAdata/Lyra_SFT/multi_modality_image/chartqa
  • DVQAdata/Lyra_SFT/multi_modality_image/dvqa
  • AI2Ddata/Lyra_SFT/multi_modality_image/ai2d

For the audio part of finetuning data, please download the following the instruction data and organize them as:

means put the data in the local folder.

For the long speech audio finetuning data, please download the following the instruction data and organize them as:

means put the data in the local folder.

For the text-speech generation data, please download the following the instructio

Related Skills

View on GitHub
GitHub Stars305
CategoryDevelopment
Updated2mo ago
Forks29

Languages

Python

Security Score

100/100

Audited on Jan 19, 2026

No findings