FALCON
[ICCV 2025] Official repository of "FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers"
Install / Use
/learn @JiuTian-VL/FALCONREADME
<sup>1</sup>Harbin Institute of Technology, Shenzhen<br> <sup>2</sup>Huawei Noah's Ark Lab<br> †Corresponding author
</div> </div>If you find this work useful for your research, please kindly cite our paper and star our repo.
Updates
- [01/2026] :fire: The extended paper of FALCON++ is released on TechRxiv.
- [12/2025] :fire: Checkpoint released. Enjoy it!
- [07/2025] :fire: The code and project page are released. Enjoy it!
- [06/2025] :fire: The arXiv paper is updated.
- [06/2025] FALCON is accepted to ICCV 2025!
- [01/2025] arXiv paper released.
Introduction
This is the github repository of FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers. In this work, we propose the FALCON model, which introduces a novel visual register technique to simultaneously address the issues of visual redundancy and fragmentation in the high-resolution visual encoding of MLLMs.
<div align="center"> <img src='assets/FALCON_arch.png' width='100%'> </div>Installation
- Clone this repository and navigate to the folder
git clone git@github.com:JiuTian-VL/JiuTian-FALCON.git
cd falcon
- Install Package
conda create -n falcon python=3.10 -y
conda activate falcon
pip install --upgrade pip
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Quick Start
We have developed a well-encapsulated class JiutianHDInfer specifically designed for model inference in jiutian/eval/model_infer.py.
Below is an example of how to use the JiutianHDInfer class. By calling the inference method, you can easily obtain the model's inference results.
from jiutian.eval.model_infer import JiutianHDInfer
model_infer = JiutianHDInfer(
model_path='/path/to/ckpt',
model_base='/path/to/base_ckpt or None',
conv_mode='llama_3_1',
)
image_file = '/path/to/image'
question = 'question'
model_infer.inference(image_file, question)
Evaluations
See docs/Evaluation.md for details.
Citation
If you find this work useful for your research, please kindly cite our paper:
@inproceedings{zhang2025falcon,
title={Falcon: Resolving visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers},
author={Zhang, Renshan and Shao, Rui and Chen, Gongwei and Zhang, Miao and Zhou, Kaiwen and Guan, Weili and Nie, Liqiang},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={23530--23540},
year={2025}
}
