SkillAgentSearch skills...

V1

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

Install / Use

/learn @jun297/V1
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

v1: Learning to Point Visual Tokens <br> for Multimodal Grounded Reasoning

<p align="left"> <a href='https://jiwanchung.github.io/' target='_blank'>Jiwan Chung<sup>*</sup></a>&emsp; <a href='https://junhyeok.kim/' target='_blank'>Junhyeok Kim<sup>*</sup></a>&emsp; <a href='https://scholar.google.com/citations?user=w3hOuRoAAAAJ' target='_blank'>Siyeol Kim</a>&emsp; <a href='https://jaeyoung-l.github.io/' target='_blank'>Jaeyoung Lee</a>&emsp; <a href="https://scholar.google.com/citations?user=Og3gN_AAAAAJ" target='_blank'>Minsoo Kim</a>&emsp; <a href='https://mirlab.yonsei.ac.kr/' target='_blank'>Youngjae Yu</a> </p>

arXiv Model Data

<p align="center"> <img src="assets/figure.png"> </p>

Installation

conda create -n v1 python=3.10 -y
conda activate v1
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Demo

Gradio Web UI

Highly Recommended as the copy tokens are displayed on image.

<p align="center"> <img src="assets/demo.png"> </p>
python run_gradio.py

Inference

python inference.py

The script uses a default image URL and text prompt. To use your own inputs, you can modify the image variable within the messages list and the text field for the user prompt.

Data

We have released a 100-item sample of our v1g dataset on the Hugging Face Hub. You can load it easily using the datasets library:

from datasets import load_dataset

ds = load_dataset("kjunh/v1g-sample")

Coming Soon

  • [x] Inference code
  • [x] Training data sample
  • [ ] Training data
  • [ ] Evaluation code
  • [ ] Training code

Citation

If you find our work valuable, please cite:

@misc{chung2025v1learningpointvisual,
      title={v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning}, 
      author={Jiwan Chung and Junhyeok Kim and Siyeol Kim and Jaeyoung Lee and Min Soo Kim and Youngjae Yu},
      year={2025},
      eprint={2505.18842},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.18842}, 
}

Related Skills

View on GitHub
GitHub Stars19
CategoryEducation
Updated29d ago
Forks1

Languages

Python

Security Score

75/100

Audited on Mar 10, 2026

No findings