LISA: Reasoning Segmentation via Large Language Model

<div align='center'>LISA: Large Language Instructed Segmentation Assistant</div>

<div align='center'> <a href="https://arxiv.org/pdf/2308.00692.pdf">Paper</a> | <a href="https://huggingface.co/xinlai">Models</a> | <a href="#training">Training</a> | <a href="#inference">Inference</a> | <a href="#deployment">Local Deployment</a> | <a href="#dataset">Dataset</a> | <a href="">Online Demo</a> | <a href="https://huggingface.co/collections/Senqiao/lisa-67713837a32d6abf516a162e">LISA++ Dataset</a> | <a href="https://huggingface.co/Senqiao/LISA_Plus_7b">LISA++ Model</a>

</div>  <table class="center"> <tr> <td style="text-align:center;">Input</td> <td style="text-align:center;">Output</td> <td style="text-align:center;">Input</td> <td style="text-align:center;">Output</td> </tr> <tr> <td><img src="imgs/obama.jpg"></td> <td><img src="vis_output/obama.jpg"></td> <td><img src="imgs/trump.jpg"></td> <td><img src="vis_output/trump.jpg"></td> </tr> <tr> <td width=25% style="text-align:center;color:gray;">"Who was <ins>the president of the US</ins> in this image? Please output segmentation mask and explain the reason."</td> <td width=25% style="text-align:center;">"Sure, the segmentation result is [SEG]. The President of the United States in the image is <ins>President Obama.</ins>”</td> <td width=25% style="text-align:center;color:gray">"Who was <ins>the president of the US</ins> in this image? Please output segmentation mask and explain why."</td> <td width=25% style="text-align:center;">"Sure, [SEG]. In the image, the President of the United States is <ins>President Trump.</ins>"</td> </tr> <tr> <td><img src="imgs/stand_higher.jpg"></td> <td><img src="vis_output/stand_higher.jpg"></td> <td><img src="imgs/camera_lens.jpg"></td> <td><img src="vis_output/camera_lens.jpg"></td> </tr> <tr> <td width=25% style="text-align:center;color:gray;">"What can <ins>make the woman stand higher</ins>? Please output segmentation mask and explain why."</td> <td width=25% style="text-align:center;">"Sure, [SEG]. The woman is <ins>standing higher by using a ladder</ins>..."</td> <td width=25% style="text-align:center;color:gray">"Can you segment <ins>the camera lens that is more suitable for photographing nearby objects</ins> in this image?"</td> <td width=25% style="text-align:center;">"Sure, [SEG]."</td> </tr> <tr> <td><img src="imgs/dog_with_horn.jpg"></td> <td><img src="vis_output/dog_with_horn.jpg"></td> <td><img src="imgs/wash_hands.jpg"></td> <td><img src="vis_output/wash_hands.jpg"></td> </tr> <tr> <td width=25% style="text-align:center;color:gray;">"Can you segment <ins>the unusual part</ins> in this image and explain why."</td> <td width=25% style="text-align:center;">"Sure, it is [SEG]. In the image, the unusual part is <ins>the dog wearing a reindeer antler headband</ins>..."</td> <td width=25% style="text-align:center;color:gray">"Where to <ins>wash hands</ins> in this image? Please output segmentation mask."</td> <td width=25% style="text-align:center;">"Sure, [SEG]."</td> </tr> <tr> <td><img src="imgs/jackma.jpg"></td> <td><img src="vis_output/jackma.jpg"></td> <td><img src="imgs/blackpink.jpg"></td> <td><img src="vis_output/blackpink.jpg"></td> </tr> <tr> <td width=25% style="text-align:center;color:gray;">"Can you segment <ins>the founder of Alibaba</ins> in this image and explain why?"</td> <td width=25% style="text-align:center;">"Sure, it is [SEG]. In the image, the man wearing a red tie and sitting on one of the chairs is <ins>Jack Ma</ins>, the co-founder of Alibaba Group..."</td> <td width=25% style="text-align:center;color:gray">"Please segment <ins>Lisa</ins> in this figure."</td> <td width=25% style="text-align:center;">"Sure, [SEG]."</td> </tr> </table> <img src="imgs/fig_overview.jpg" width="100%">

News

[x] [2024.12.30] We released the LISA++ model and datasets, available here. Our findings show that incorporating Visual COT data can further enhance the model’s global understanding. We will update the paper soon, stay tuned!
[x] [2024.6.21] LISA is selected as Oral Presentation in CVPR 2024!
[x] [2023.8.30] Release three new models LISA-7B-v1, LISA-7B-v1-explanatory, and LISA-13B-llama2-v1-explanatory. Welcome to check them out!
[x] [2023.8.23] Refactor code, and release new model LISA-13B-llama2-v1. Welcome to check it out!
[x] [2023.8.9] Training code is released!
[x] [2023.8.4] Online Demo is released!
[x] [2023.8.4] ReasonSeg Dataset and the LISA-13B-llama2-v0-explanatory model are released!
[x] [2023.8.3] Inference code and the LISA-13B-llama2-v0 model are released. Welcome to check them out!
[x] [2023.8.2] Paper is released and GitHub repo is created.

LISA: Reasoning Segmentation via Large Language Model [Paper] Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia

LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model [Paper] Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, Jiaya Jia

Abstract

In this work, we propose a new segmentation task --- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. We establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: Large-language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks. For more details, please refer to the paper.

Highlights

LISA unlocks the new segmentation capabilities of multi-modal LLMs, and can handle cases involving:

complex reasoning;
world knowledge;
explanatory answers;
multi-turn conversation.

LISA also demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement.

Experimental results

Installation

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Training

Training Data Preparation

The training data consists of 4 types of data:

Semantic segmentation datasets: ADE20K, COCO-Stuff, Mapillary, PACO-LVIS, PASCAL-Part, COCO Images

Note: For COCO-Stuff, we use the annotation file stuffthingmaps_trainval2017.zip. We only use the PACO-LVIS part in PACO. COCO Images should be put into the dataset/coco/ directory.
Referring segmentation datasets: refCOCO, refCOCO+, refCOCOg, refCLEF ([sa

LISA

Install / Use

README