<div align="center"> <h1> [AAAI2025] ChatterBox </h1> <h3><img src="assets/chatterbox_logo_1.png" alt="Alt text for the image" width="25" height="25"> ChatterBox: Multi-round Multimodal Referring and Grounding</h3>

Yunjie Tian*<sup>1</sup>, Tianren Ma*<sup>1</sup>, Lingxi Xie<sup>2</sup>, Jihao Qiu<sup>1</sup>, Xi Tang<sup>1</sup>, Yuan Zhang<sup>1</sup>, Jianbin Jiao<sup>1</sup>, Qi Tian<sup>2</sup>, Qixiang Ye<sup>1</sup>

<sup>1</sup> University of Chinese Academy of Sciences, <sup>2</sup> HUAWEI Inc.

Paper: (arXiv 2401.13307)

</div>

Abstract

In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues. We present a new benchmark and an efficient vision-language model for this purpose. The new benchmark, named CB-300K, spans challenges including multi-round dialogue, complex spatial relationships among multiple instances, and consistent reasoning, which are beyond those shown in existing benchmarks. The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks. By tokenizing instance regions, the language branch acquires the ability to perceive referential information. Meanwhile, ChatterBox feeds a query embedding in the vision branch to a token receiver for visual grounding. A two-stage optimization strategy is devised, making use of both CB-300K and auxiliary external data to improve the model's stability and capacity for instance-level understanding. Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with complicated and precise interactions.

Overview

<p align="center"> <img src="assets/figure-structure-1.jpg" width="80%"></a> <br> The architecture of the ChatterBox model. </p>

Key Contributions:

CB-300K - We establish the CB-300K benchmark to facilitate the research in multi-round referring and grounding.
Chatterbox Model - We establish the ChatterBox model in a dual-branch architecture to solve multi-round referring and grounding problem.

Updates

Jan. 24th, 2024: The paper, code, and dataset is released.

Install

Clone this repository and navigate to ChatterBox folder

git clone https://github.com/sunsmarterjie/ChatterBox
cd ChatterBox

Install Packages

conda create -n chatterbox python=3.11.5 
conda activate chatterbox
pip install --upgrade pip  # enable PEP 660 support
pip install -r requirements.txt
pip install deepspeed==0.11.1
unzip mmcv-1.4.7.zip
cd mmcv-1.4.7/
MMCV_WITH_OPS=1 pip install -e .
cd ../model/GroundingDINO/ops
python setup.py build install

Train

We build visual branch of ChatterBox using GroundingDINO and DINO, we provide GroundDINO version now.

Prepare datasets/models:

Download CB-300K, VG, COCO2017, COCO2014, RefCOCO, RefCOCO+, RefCOCOg, Flickr30K, OpenSource, clip-vit-large-patch14, LLaVA-Instruct-150K, llava-llama-2-13b, CB-materials, groundingdino_swinb.

├── datasets
|   ├── CB-300K
|   |    ├── CB-MRG
|   |    ├── CB-LC
│   │    └── ...
|   ├── VG
|   |    ├── VG_100K
|   |    ├── VG_100K_2
│   │    └── ...
│   ├── MSCOCO2017
|   |    ├── train2017
│   │    └── ...
│   ├── MSCOCO2014
|   |    ├── train2014
│   │    └── ...
│   ├── Flickr30K
|   |    ├── flickr30k-images
│   │    └── ...
│   ├── llava_instruct_150k.json
|   ├── CB_materials
|            ├── CB-refcoco-GND
|            ├── CB-coco-GND
|            ├── CB-refcoco-REF
│            └── ...
│── clip-vit-large-patch14
|             ├── config.json
│             └── ...
│── llava-llama-2-13b-chat-lightning-preview
|                      ├── config.json
│                      └── ...
│── OpenSource
|        ├── finetune_refcoco_train.json
|        ├── finetune_refcoco+_train.json
│        └── ...
├── groundingdino_swinb_cogcoor.pth

Train ChatterBox on 8xA800 GPUs (80GB).

python startup_stage1.py  # stage1
python startup_stage2.py  # stage2

Evaluation

See details at evaluation.

Citation

If this project has been helpful or if you've used our dataset, please cite:

@inproceedings{tian2025chatterbox,
  title={ChatterBox: Multimodal Referring and Grounding with Chain-of-Questions},
  author={Tian, Yunjie and Ma, Tianren and Xie, Lingxi and Ye, Qixiang},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={7},
  pages={7401--7409},
  year={2025}
}

Acknowledgment

This project is based on LLaVA (paper, code), LISA (paper, code), GPT4RoI (paper, code), thanks for their excellent works.

ChatterBox

Install / Use

README