GLD

Official implementation of "Repurposing Geometric Foundation Models for Multi-view Diffusion"

Generate Convert Improve

Install / Use

/learn @cvlab-kaist/GLD

About this skill

Quality Score

0/100

README

Geometric Latent Diffusion: Repurposing Geometric Foundation Models for Multi-view Diffusion

Wooseok Jang1, Seonghu Jeon1, Jisang Han1, Jinhyeok Choi1, Minkyung Kwon1, Seungryong Kim1, Saining Xie2, Sainan Liu3

1KAIST 2New York University 3Intel Labs

News

2026-03-25: Clean up camera conventions and remove unused debugging code. All input cameras are now expected in OpenCV convention (X-right, Y-down, Z-forward) + Updated Checkpoint.
2026-03-24: Initial code and model release.

Overview

GLD performs multi-view diffusion in the feature space of geometric foundation models (Depth Anything 3 / VGGT), enabling novel view synthesis with zero-shot geometry — trained from scratch without text-to-image pretraining.

4.4× faster training convergence vs. VAE-based approaches
Zero-shot depth & 3D from synthesized latents via frozen decoders
State-of-the-art on RE10K and DL3DV benchmarks

Requirements

GPU: 48GB+ VRAM recommended (e.g., A6000, A100). Cascade mode loads two DiT models simultaneously.
Python: 3.10+

Installation

conda env create -f environment.yml
conda activate gld

Pretrained Models

Download all checkpoints from HuggingFace:

# Download all model weights
python -c "from huggingface_hub import snapshot_download; snapshot_download('SeonghuJeon/GLD', local_dir='.')"

This places files as follows:

pretrained_models/
  da3/
    model.safetensors              # DA3-Base encoder weights
    dpt_decoder.pt                 # DPT decoder (depth + geometry)
  mae_decoder.pt                   # DA3 MAE decoder (RGB)
  vggt/
    mae_decoder.pt                 # VGGT MAE decoder (RGB)

checkpoints/
  da3_level1.pt                    # DA3 level-1 diffusion
  da3_cascade.pt                   # DA3 cascade (level-1 → level-0)
  vggt_level1.pt                   # VGGT level-1 diffusion
  vggt_cascade.pt                  # VGGT cascade (level-1 → level-0)

model_stats/                       # Latent normalization statistics
  da3/normalization_stats_level{0,1}.pt
  vggt/normalization_stats_level{0,1}.pt
  vggt/special_stats_level{0,1}.pt

Note: model_stats/ and configs/ are already included in the Git repository — they are not downloaded from HuggingFace. Make sure to git clone the repo first, then run snapshot_download inside the cloned directory.

Quick Demo

# DA3 backbone
./run_demo.sh da3

# VGGT backbone
./run_demo.sh vggt

This runs NVS on included demo scenes and generates 3D reconstructions (GLB + COLMAP). To specify a GPU: ./run_demo.sh da3 <GPU_ID>

NOTE: For now, 3D reconstruction is supported for DA3 Only. 3D Reconstruction code for VGGT checkpoint will be updated soon!

Training

Stage 2: Multi-view Diffusion

# DA3 level-1
./run_train.sh da3 level1

# DA3 cascade (level-1 → level-0)
./run_train.sh da3 cascade

# VGGT level-1
./run_train.sh vggt level1

Multi-GPU: edit --nproc_per_node in run_train.sh.

Stage 1: Decoder Training

Train the MAE decoder (RGB reconstruction) on frozen DA3 encoder features with GAN + LPIPS losses:

./scripts/run_train_stage1_mae.sh [NUM_GPUS] [RESUME_CKPT]

# Example: 4 GPUs
./scripts/run_train_stage1_mae.sh 4

# Resume from checkpoint
./scripts/run_train_stage1_mae.sh 4 results/stage1-mae/.../checkpoints/0050000.pt

See configs/training/DA3_stage1_mae.yaml for training hyperparameters.

Evaluation

# DA3 cascade (default)
./eval_gld.sh da3 cascade

# VGGT cascade
./eval_gld.sh vggt cascade

# Independent (single level, no cascade)
./eval_gld.sh da3 independent

Project Structure

├── src/
│   ├── stage1/                    # Feature encoder (DA3/VGGT) + decoders (MAE/DPT)
│   ├── stage2/                    # DiT diffusion transformer
│   ├── utils/                     # Metrics, camera, config, validation
│   ├── datasets/                  # Eval dataset adapter
│   ├── video/                     # Training data loaders (CUT3R format)
│   ├── train_multiview_da3.py     # Stage 2 training
│   ├── train_stage1_mae.py        # Stage 1 decoder training
│   └── eval_gld_metric.py
├── configs/
│   ├── training/                  # Model configs (DA3/VGGT × level1/cascade)
│   └── eval/                      # Evaluation configs
├── demo/                          # Demo scenes (RE10K + DL3DV)
├── scripts/                       # 3D reconstruction utilities
├── run_train.sh
├── eval_gld.sh
├── run_demo.sh
└── environment.yml

Citation

@article{jang2026gld,
  title={Repurposing Geometric Foundation Models for Multi-view Diffusion},
  author={Jang, Wooseok and Jeon, Seonghu and Han, Jisang and Choi, Jinhyeok and Kwon, Minkyung and Kim, Seungryong and Xie, Saining and Liu, Sainan},
  journal={arXiv preprint arXiv:2603.22275},
  year={2026}
}

Acknowledgements

Built upon RAE, Depth Anything 3, VGGT, CUT3R, and SiT.

Related Skills

node-connect

347.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。