L2G

From Local Matches to Global Masks: Novel Instance Detection in Open-World Scenes

Generate Convert Improve

Install / Use

/learn @IRVLUTD/L2G

About this skill

Quality Score

0/100

README

L2G-Det

From Local Matches to Global Masks: Novel Instance Detection in Open-World Scenes

arXiv, Project

Detecting and segmenting novel object instances in open-world environments is a fundamental problem in robotic perception. Given only a small set of template images, a robot must locate and segment a specific object instance in a cluttered, previously unseen scene. Existing proposal-based approaches are highly sensitive to proposal quality and often fail under occlusion and background clutter. We propose L2G-Det, a local-to-global instance detection framework that bypasses explicit object proposals by leveraging dense patch-level matching between templates and the query image. Locally matched patches generate candidate points, which are refined through a candidate selection module to suppress false positives. The filtered points are then used to prompt an augmented Segment Anything Model (SAM) with instance-specific object tokens, enabling reliable reconstruction of complete instance masks. Experiments demonstrate improved performance over proposal-based methods in challenging open-world settings.

Framework

L2G.

📸 Detection Examples

RoboTools

High Resolution

Getting Started

Prerequisites

Python 3.10
torch (tested 2.6)
torchvision

Installation

We test the code on Ubuntu 20.04.

git clone https://github.com/IRVLUTD/L2G.git
cd L2G
# Create the conda env
conda create -n L2G python=3.10
conda activate L2G
# Install PyTorch
pip install torch==2.6.0+cu118 torchvision==0.21.0+cu118 torchaudio==2.6.0+cu118 --index-url https://download.pytorch.org/whl/cu118
# Install other packages
pip install -e.

Preparing models

Please put them into "checkpoints" folder as follows:

checkpoints/
├── dinov3/
│   └── dinov3_vitl16_pretrain_*.pt
│
├── SAM/
│   └── sam2.1_hiera_large.pt
│
├── Adapter/
│   ├── High_Res_Adapter.pt
│   └── RoboTools_Adapter.pt
│
├── Object_tokens_High_Res/
│   ├── full_mask_tokens_000001.pt
│   ├── full_mask_tokens_000002.pt
│   ├── ...
│
└── Object_tokens_RoboTools/
    ├── full_mask_tokens_000001.pt
    ├── full_mask_tokens_000002.pt
    ├── ...

Preparing Datasets

<details> <summary> Setting Up Detection Datasets </summary>

The RoboTools dataset is divided into 24 scenes (Scene 1–24). Download the dataset:

The High_Resolution dataset is divided into 22 scenes (Hard : Scene 1–10; Easy: Scene 11-22). Download the dataset:

Please put them into "Data" folder as follows:

data/
│
├── Query/
│   ├── High_Resolution/
│   │   ├── 000001/
│   │   ├── 000002/
│   │   └── ...
│   │
│   └── RoboTools/
│       ├── 000001/
│       ├── 000002/
│       └── ...
│
└── Templates/
    ├── High_Resolution_all/
    │   ├── rgb/
    │   │   ├── 000001/
    │   │   ├── 000002/
    │   │   └── ...
    │   └── mask/
    │       ├── 000001/
    │       ├── 000002/
    │       └── ...
    │
    └── RoboTools_all/
        ├── rgb/
        │   ├── 000001/
        │   ├── 000002/
        │   └── ...
        │
        └── mask/
            ├── 000001/
            ├── 000002/
            └── ...

</details>

Usage

Demo

You can directly run the demo:

python run.py --config Demo.yaml

or check inference on the image

Benchmark

Sample the template images:

cd tools

# --n 8          : Number of templates to sample per object
# --datasets     : Dataset name (e.g., RoboTools; High_Resolution)
python sample_templates.py --n 8 --datasets RoboTools

Run L2G on the Benchmark:

python run.py --config RoboTools.yaml  #or High_Res.yaml

# then merge results using tools/utils/merge.py. You can download Ground truth files in the following link.

We include the ground truth files and our predictions in this link. You can run eval_results.py to evaluate them.

Create the template-based training images

Download the background with the link. Among these, Backgrounds_2048 is constructed by cropping local regions from the original high-resolution background images, resulting in images of size 2048 × 1536.

# Create the template-based training images on RoboTools
python tools/Compose_objects.py \
--objects-root data/Templates/RoboTools_all \
--backgrounds Backgrounds_2048 \
--out-root RoboTools_create \
--bbox-out-root RoboTools_create_bbox \
--start-object-id 1 \
--end-object-id 20

Training

Check the training demo in notebooks.

Real-World Robot Experiment

Click the following image to watch the video.

Acknowledgments

This project is based on the following repositories:

Related Skills

node-connect

349.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。