V1
v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning
Install / Use
/learn @jun297/V1README
v1: Learning to Point Visual Tokens <br> for Multimodal Grounded Reasoning
<p align="left"> <a href='https://jiwanchung.github.io/' target='_blank'>Jiwan Chung<sup>*</sup></a>  <a href='https://junhyeok.kim/' target='_blank'>Junhyeok Kim<sup>*</sup></a>  <a href='https://scholar.google.com/citations?user=w3hOuRoAAAAJ' target='_blank'>Siyeol Kim</a>  <a href='https://jaeyoung-l.github.io/' target='_blank'>Jaeyoung Lee</a>  <a href="https://scholar.google.com/citations?user=Og3gN_AAAAAJ" target='_blank'>Minsoo Kim</a>  <a href='https://mirlab.yonsei.ac.kr/' target='_blank'>Youngjae Yu</a> </p> <p align="center"> <img src="assets/figure.png"> </p>Installation
conda create -n v1 python=3.10 -y
conda activate v1
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Demo
Gradio Web UI
Highly Recommended as the copy tokens are displayed on image.
<p align="center"> <img src="assets/demo.png"> </p>python run_gradio.py
Inference
python inference.py
The script uses a default image URL and text prompt. To use your own inputs, you can modify the image variable within the messages list and the text field for the user prompt.
Data
We have released a 100-item sample of our v1g dataset on the Hugging Face Hub. You can load it easily using the datasets library:
from datasets import load_dataset
ds = load_dataset("kjunh/v1g-sample")
Coming Soon
- [x] Inference code
- [x] Training data sample
- [ ] Training data
- [ ] Evaluation code
- [ ] Training code
Citation
If you find our work valuable, please cite:
@misc{chung2025v1learningpointvisual,
title={v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning},
author={Jiwan Chung and Junhyeok Kim and Siyeol Kim and Jaeyoung Lee and Min Soo Kim and Youngjae Yu},
year={2025},
eprint={2505.18842},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.18842},
}
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
flutter-tutor
Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d
groundhog
400Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
workshop-rules
Materials used to teach the summer camp <Data Science for Kids>
