SkillAgentSearch skills...

ChatterBox

[AAAI2025] ChatterBox: Multi-round Multimodal Referring and Grounding, Multimodal, Multi-round dialogues

Install / Use

/learn @sunsmarterjie/ChatterBox
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <h1> [AAAI2025] ChatterBox </h1> <h3><img src="assets/chatterbox_logo_1.png" alt="Alt text for the image" width="25" height="25"> ChatterBox: Multi-round Multimodal Referring and Grounding</h3>

Yunjie Tian*<sup>1</sup>, Tianren Ma*<sup>1</sup>, Lingxi Xie<sup>2</sup>, Jihao Qiu<sup>1</sup>, Xi Tang<sup>1</sup>, Yuan Zhang<sup>1</sup>, Jianbin Jiao<sup>1</sup>, Qi Tian<sup>2</sup>, Qixiang Ye<sup>1</sup>

<sup>1</sup> University of Chinese Academy of Sciences, <sup>2</sup> HUAWEI Inc.

Paper: (arXiv 2401.13307)

</div>

Abstract

In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues. We present a new benchmark and an efficient vision-language model for this purpose. The new benchmark, named CB-300K, spans challenges including multi-round dialogue, complex spatial relationships among multiple instances, and consistent reasoning, which are beyond those shown in existing benchmarks. The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks. By tokenizing instance regions, the language branch acquires the ability to perceive referential information. Meanwhile, ChatterBox feeds a query embedding in the vision branch to a token receiver for visual grounding. A two-stage optimization strategy is devised, making use of both CB-300K and auxiliary external data to improve the model's stability and capacity for instance-level understanding. Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with complicated and precise interactions.

Overview

<p align="center"> <img src="assets/figure-structure-1.jpg" width="80%"></a> <br> The architecture of the ChatterBox model. </p>

Key Contributions:

  • CB-300K - We establish the CB-300K benchmark to facilitate the research in multi-round referring and grounding.
  • Chatterbox Model - We establish the ChatterBox model in a dual-branch architecture to solve multi-round referring and grounding problem.

Updates

  • Jan. 24th, 2024: The paper, code, and dataset is released.

Release

Contents

Install

  1. Clone this repository and navigate to ChatterBox folder
git clone https://github.com/sunsmarterjie/ChatterBox
cd ChatterBox
  1. Install Packages
conda create -n chatterbox python=3.11.5 
conda activate chatterbox
pip install --upgrade pip  # enable PEP 660 support
pip install -r requirements.txt
pip install deepspeed==0.11.1
unzip mmcv-1.4.7.zip
cd mmcv-1.4.7/
MMCV_WITH_OPS=1 pip install -e .
cd ../model/GroundingDINO/ops
python setup.py build install

Train

We build visual branch of ChatterBox using GroundingDINO and DINO, we provide GroundDINO version now.

  • Prepare datasets/models:

Download CB-300K, VG, COCO2017, COCO2014, RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, OpenSource, clip-vit-large-patch14, LLaVA-Instruct-150K, llava-llama-2-13b, CB-materials, groundingdino_swinb.

├── datasets
|   ├── CB-300K
|   |    ├── CB-MRG
|   |    ├── CB-LC
│   │    └── ...
|   ├── VG
|   |    ├── VG_100K
|   |    ├── VG_100K_2
│   │    └── ...
│   ├── MSCOCO2017
|   |    ├── train2017
│   │    └── ...
│   ├── MSCOCO2014
|   |    ├── train2014
│   │    └── ...
│   ├── Flickr30K
|   |    ├── flickr30k-images
│   │    └── ...
│   ├── llava_instruct_150k.json
|   ├── CB_materials
|            ├── CB-refcoco-GND
|            ├── CB-coco-GND
|            ├── CB-refcoco-REF
│            └── ...
│── clip-vit-large-patch14
|             ├── config.json
│             └── ...
│── llava-llama-2-13b-chat-lightning-preview
|                      ├── config.json
│                      └── ...
│── OpenSource
|        ├── finetune_refcoco_train.json
|        ├── finetune_refcoco+_train.json
│        └── ...
├── groundingdino_swinb_cogcoor.pth
  • Train ChatterBox on 8xA800 GPUs (80GB).
python startup_stage1.py  # stage1
python startup_stage2.py  # stage2

Evaluation

See details at evaluation.

Citation

If this project has been helpful or if you've used our dataset, please cite:

@inproceedings{tian2025chatterbox,
  title={ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions},
  author={Tian, Yunjie and Ma, Tianren and Xie, Lingxi and Ye, Qixiang},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={7},
  pages={7401--7409},
  year={2025}
}

Acknowledgment

This project is based on LLaVA (paper, code), LISA (paper, code), GPT4RoI (paper, code), thanks for their excellent works.

View on GitHub
GitHub Stars61
CategoryDevelopment
Updated1mo ago
Forks2

Languages

Python

Security Score

95/100

Audited on Feb 19, 2026

No findings