EchoInk
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning [π₯The Exploration of R1 for General Audio-Visual Reasoning with Qwen2.5-Omni]
Install / Use
/learn @HarryHsing/EchoInkREADME
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
π Technical Report (arXiv) β’ π€ Model (EchoInk-R1-7B) β’ π€ Dataset (AVQA-R1-6K)
Overview
EchoInk-R1 is the first general framework for unified audio-visual reasoning via reinforcement learning, built upon Qwen2.5-Omni-7B and optimized using Group Relative Policy Optimization (GRPO). It supports structured reasoning over synchronized audio-image inputs through multiple-choice question answering.
We introduce AVQA-R1-6K, a dataset derived from OmniInstruct-v1, comprising:
- 4,490 training samples
- 1,911 validation samples
- Each sample includes a synchronized audio-image pair with a multiple-choice question and four options.
Beyond our core study, EchoInk-R1 provides an extensible RL fine-tuning framework for Qwen2.5-Omni, enabling easy adaptation to new multimodal reasoning tasks with minimal modifications.
Performance
EchoInk-R1-7B achieves 85.77% accuracy on the AVQA-R1-6K validation set, surpassing the base Qwen2.5-Omni-7B model (80.53%) using only 562 RL steps.
All code, models, and data are released to support transparency and reproducibility.
News
- [2025/05/08] Released the AVQA-R1-6K dataset, EchoInk-R1-7B model, full training & evaluation pipeline, and technical report.
Highlights
- Built on Qwen2.5-Omni-7B with GRPO-based RL
- Supports audio, image, video, and text modalities
- Provides a complete pipeline: dataset, training, and evaluation
Reflective Reasoning: Aha Moments
During training, EchoInk-R1 exhibits reflective reasoning behaviors, where it revisits initial assumptions and refines its responses under ambiguous multimodal cues. These βaha momentsβ reveal its capacity for belief revision and deeper cross-modal understanding.
<p align="center"> <img src="./images/case_1.png" width="650px" alt="Case 1 reasoning" /> </p> <hr style="width:650px; height:3px; background-color:#666; border:none; margin: 20px auto;" /> <p align="center"> <img src="./images/case_3.png" width="650px" alt="Case 2 reasoning" /> </p>Learning Dynamics
- Accuracy reward steadily improves throughout training, indicating that GRPO effectively guides the model toward more accurate and reasoned outputs.
- Completion length exhibits a two-phase trend: an initial increase as the model explores elaborated reasoning, followed by a gradual decline toward more concise and efficient answers.
- Format reward converges rapidly, showing that the model quickly internalizes the required response structure.
Setup & Installation
Environment Setup
git clone https://github.com/HarryHsing/EchoInk
cd EchoInk
conda create -n echoink-r1 python=3.11
conda activate echoink-r1
bash setup.sh
Download Dataset
To download and extract the AVQA-R1-6K dataset:
git lfs install
git clone https://huggingface.co/datasets/harryhsing/AVQA-R1-6K
cd AVQA-R1-6K
tar -xzvf AVQA_R1.tar.gz
π Dataset Structure
AVQA_R1/
βββ train/
β βββ audios/
β βββ images/
β βββ omni_rl_format_train.json
βββ valid/
β βββ audios/
β βββ images/
β βββ omni_rl_format_valid.json
Training
Download Qwen2.5-Omni-7B Model
First, download the base model: Qwen2.5-Omni-7B
Modify config.json of Qwen2.5-Omni-7B to include "hidden_size": 3584 at the root level.
Launch GRPO Training
bash ./src/scripts/run_grpo_image_audio_avqa.sh
π Set
per_device_train_batch_size=1as in previous R1-V setups
π To use custom data, follow the JSON format in./src/make_omniInstruct_r1_dataset.pyfor audioβimage or audioβvideo tasks.
π See Qwen2.5-Omni issue #205 if you run into a dtype mismatch error.
βοΈ Trained on 8ΓA100 (80G) GPUs; also supported on 4ΓA100 (80G).
Evaluation
Evaluate on the AVQA-R1-6K validation set:
python ./src/omniInstruct-v1_eval_valid.py # Run the model on the validation set
python ./src/omniInstruct-v1_cal_metrics_valid.py # Compute accuracy
Acknowledgements
We thank the open-source community. This work builds on Qwen2.5-Omni, Video-R1, Open-R1-Video, R1-V, and DeepSeek-R1.
Citation
If you find EchoInk-R1 useful, please cite:
@article{xing2025echoink,
title={{EchoInk-R1}: Exploring Audio-Visual Reasoning in Multimodal {LLMs} via Reinforcement Learning},
author={Zhenghao Xing and Xiaowei Hu and Chi-Wing Fu and Wenhai Wang and Jifeng Dai and Pheng-Ann Heng},
year={2025},
journal={arXiv preprint arXiv:2505.04623}
}
Related Skills
proje
Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star βοΈ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
400Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
