EchoInk

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [🔥The Exploration of R1 for General Audio-Visual Reasoning with Qwen2.5-Omni]

Generate Convert Improve

Install / Use

/learn @HarryHsing/EchoInk

About this skill

Quality Score

0/100

README

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

📄 Technical Report (arXiv) • 🤗 Model (EchoInk-R1-7B) • 🤗 Dataset (AVQA-R1-6K)

Overview

EchoInk-R1 is the first general framework for unified audio-visual reasoning via reinforcement learning, built upon Qwen2.5-Omni-7B and optimized using Group Relative Policy Optimization (GRPO). It supports structured reasoning over synchronized audio-image inputs through multiple-choice question answering.

We introduce AVQA-R1-6K, a dataset derived from OmniInstruct-v1, comprising:

4,490 training samples
1,911 validation samples
Each sample includes a synchronized audio-image pair with a multiple-choice question and four options.

Beyond our core study, EchoInk-R1 provides an extensible RL fine-tuning framework for Qwen2.5-Omni, enabling easy adaptation to new multimodal reasoning tasks with minimal modifications.

Performance

EchoInk-R1-7B achieves 85.77% accuracy on the AVQA-R1-6K validation set, surpassing the base Qwen2.5-Omni-7B model (80.53%) using only 562 RL steps.

All code, models, and data are released to support transparency and reproducibility.

News

[2025/05/08] Released the AVQA-R1-6K dataset, EchoInk-R1-7B model, full training & evaluation pipeline, and technical report.

Highlights

Built on Qwen2.5-Omni-7B with GRPO-based RL
Supports audio, image, video, and text modalities
Provides a complete pipeline: dataset, training, and evaluation

Reflective Reasoning: Aha Moments

During training, EchoInk-R1 exhibits reflective reasoning behaviors, where it revisits initial assumptions and refines its responses under ambiguous multimodal cues. These “aha moments” reveal its capacity for belief revision and deeper cross-modal understanding.

Learning Dynamics

Accuracy reward steadily improves throughout training, indicating that GRPO effectively guides the model toward more accurate and reasoned outputs.
Completion length exhibits a two-phase trend: an initial increase as the model explores elaborated reasoning, followed by a gradual decline toward more concise and efficient answers.
Format reward converges rapidly, showing that the model quickly internalizes the required response structure.

Setup & Installation

Environment Setup

git clone https://github.com/HarryHsing/EchoInk
cd EchoInk

conda create -n echoink-r1 python=3.11
conda activate echoink-r1
bash setup.sh

Download Dataset

To download and extract the AVQA-R1-6K dataset:

git lfs install
git clone https://huggingface.co/datasets/harryhsing/AVQA-R1-6K
cd AVQA-R1-6K
tar -xzvf AVQA_R1.tar.gz

📁 Dataset Structure
AVQA_R1/
├── train/
│   ├── audios/
│   ├── images/
│   └── omni_rl_format_train.json
├── valid/
│   ├── audios/
│   ├── images/
│   └── omni_rl_format_valid.json

Training

Download Qwen2.5-Omni-7B Model

First, download the base model: Qwen2.5-Omni-7B

Modify config.json of Qwen2.5-Omni-7B to include "hidden_size": 3584 at the root level.

Launch GRPO Training

bash ./src/scripts/run_grpo_image_audio_avqa.sh

📝 Set per_device_train_batch_size=1 as in previous R1-V setups
📝 To use custom data, follow the JSON format in ./src/make_omniInstruct_r1_dataset.py for audio–image or audio–video tasks.
📝 See Qwen2.5-Omni issue #205 if you run into a dtype mismatch error.
⚙️ Trained on 8×A100 (80G) GPUs; also supported on 4×A100 (80G).

Evaluation

Evaluate on the AVQA-R1-6K validation set:

python ./src/omniInstruct-v1_eval_valid.py # Run the model on the validation set
python ./src/omniInstruct-v1_cal_metrics_valid.py # Compute accuracy

Acknowledgements

We thank the open-source community. This work builds on Qwen2.5-Omni, Video-R1, Open-R1-Video, R1-V, and DeepSeek-R1.

Citation

If you find EchoInk-R1 useful, please cite:

@article{xing2025echoink,
      title={{EchoInk-R1}: Exploring Audio-Visual Reasoning in Multimodal {LLMs} via Reinforcement Learning}, 
      author={Zhenghao Xing and Xiaowei Hu and Chi-Wing Fu and Wenhai Wang and Jifeng Dai and Pheng-Ann Heng},
      year={2025},
      journal={arXiv preprint arXiv:2505.04623}
}

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

400

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

HarryHsing

View profile

View on GitHub

GitHub Stars77

CategoryEducation

Updated5d ago

Forks6

HarryHsing/EchoInk

Languages

Python

Security Score

80/100

Audited on Apr 4, 2026

No findings