SkillAgentSearch skills...

Uni3D

[ICLR'24 Spotlight] Uni3D: 3D Visual Representation from BAAI

Install / Use

/learn @baaivision/Uni3D
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align='center'> <h2><a href="https://arxiv.org/abs/2310.06773">Uni3D: Exploring Unified 3D Representation at Scale</a></h2>

Junsheng Zhou<sup>1,2*</sup>, Jinsheng Wang<sup>1*</sup>, Baorui Ma<sup>1*</sup>, Yu-Shen Liu<sup>2</sup>, Tiejun Huang<sup>1,3</sup>, Xinlong Wang<sup>1</sup>

<sup>1</sup>BAAI, <sup>2</sup>THU, <sup>3</sup>PKU <br><sup>*</sup> Equal Contribution

ICLR 2024 (Spotlight)

PWC PWC PWC

</div> <p align="center"> <img src="assets/overview.jpg" alt="overview" width="800" /> </p>

We present Uni3D, a unified and scalable 3D pretraining framework for large-scale 3D representation learning, and explore its limits at the scale of one billion parameters. Uni3D uses a 2D initialized ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features. Via the simple architecture and pretext task, Uni3D can leverage abundant 2D pretrained models as initialization and image-text aligned models as the target, unlocking the great potential of 2D models and scaling-up strategies to the 3D world. We efficiently scale up Uni3D to one billion parameters, and set new records on a broad range of 3D tasks.

Schedule

We are committed to open-sourcing Uni3D related materials, including:

  • [x] Extended Uni3D to a 3D metric (Uni3D-score) for enhanced semantic coherence in text-to-3D tasks. For details, see GeoDream.
  • [x] The weights of models range from 6M to 1B parameters.
  • [x] Evaluation code
  • [x] Evaluation data
  • [x] Pretraining code
  • [ ] Pretraining data

We hope to foster the growth of our community through open-sourcing and promoting collaboration👬. Let's step towards multimodal intelligence together🍻.

Installation

Clone this repository and install the required packages:

git clone https://github.com/baaivision/Uni3D.git
cd Uni3D

conda create -n uni3d python=3.8
conda activate uni3d
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

pip install -r requirements.txt

# install pointnet2 extensions from https://github.com/erikwijmans/Pointnet2_PyTorch
pip install "git+git://github.com/erikwijmans/Pointnet2_PyTorch.git#egg=pointnet2_ops&subdirectory=pointnet2_ops_lib"

Core packages:

Model Zoo

| Model | Training Data | Objaverse-LVIS Top1 (Top5) | ModelNet40 Top1 (Top5) | ScanObjectNN Top1 (Top5) | | :------: | :------: | :------: |:------: |:------: | | Uni3d-B | Ensembled w/o LVIS | 45.9 (74.8) | 86.1 (98.7) | 61.7 (89.5) | | Uni3d-B | Ensembled | 51.7 (80.8) | 86.3 (97.9) | 63.8 (90.2) | | Uni3d-L | Ensembled w/o LVIS | 46.2 (74.7) | 86.6 (97.8) | 58.4 (90.1) | | Uni3d-L | Ensembled | 53.1 (81.5) | 86.3 (98.3) | 58.2 (89.4) | | Uni3d-g | Ensembled w/o LVIS | 47.2 (76.1) | 86.8 (98.4) | 66.5 (90.1) | | Uni3d-g | Ensembled | 53.5 (82.0) | 87.3 (99.2) | 63.9 (91.7) | | Uni3d-g 🔥 | Ensembled | 55.3 (82.9) | 88.2 (99.3) | 65.3 (92.7) |

Evaluation of Zero-shot 3D classification

We evaluate the zero-shot 3D classification performance on three datasets: Objaverse-LVIS, ModelNet40 and ScanObjectNN.

  1. Please refer to DATASETS.md for evaluation dataset preparation.
  2. [Recommended 🤗] Download the clip model and put it in /path/to/clip_model folder.
  3. Download model zoo weights and put them in /path/to/checkpoints folder.
  4. Run bash scripts/inference.sh [scale] to evaluate the model on the above datasets, e.g., bash scripts/inference.sh giant.

Pre-training

  1. Please refer to DATASETS.md for pre-train dataset preparation.
  2. [Recommended 🤗] Download the clip model and put it in /path/to/clip_model folder.
  3. [Recommended 🤗] Download the initialization model and put it in /path/to/init_model folder.
  4. Run bash scripts/pretrain.sh to pre-train the model on ensemble datasets.

Visualization

Open-world Understanding

<p align="center"> <img src="assets/scene_understanding.jpg" alt="scene" width="800" /> </p>

One-shot Part Segmentation

<p align="center"> <img src="assets/vis_part.jpg" alt="partseg" width="800" /> </p>

Point Cloud Painting

<p align="center"> <img src="assets/editing.jpg" alt="editing" width="800" /> </p>

Cross-modal Retrieval

<p align="center"> <img src="assets/retrival_text.jpg" alt="retrival_text" width="800" /> </p> <p align="center"> <img src="assets/retrival.jpg" alt="retrival" width="800" /> </p>

Acknowledgement

Uni3D is built using the awesome EVA, OpenCLIP, timm, DeepSpeed, ULIP and OpenShape.

This work is supported by the National Science and Technology Major Project (No. 2022ZD0116314).

本项目受新一代人工智能国家科技重大专项(No. 2022ZD0116314)支持。

Citation

@inproceedings{zhou2023uni3d,
  title={Uni3d: Exploring unified 3d representation at scale},
  author={Zhou, Junsheng and Wang, Jinsheng and Ma, Baorui and Liu, Yu-Shen and Huang, Tiejun and Wang, Xinlong},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}
View on GitHub
GitHub Stars659
CategoryEducation
Updated1d ago
Forks45

Languages

Python

Security Score

100/100

Audited on Mar 31, 2026

No findings