BridgeDepth

[ICCV 2025 Highlight] BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

Generate Convert Improve

Install / Use

/learn @aeolusguan/BridgeDepth

About this skill

Quality Score

0/100

README

BridgeDepth

Official implementation of paper:

BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment, ICCV 2025 Tongfan Guan, Jiaxin Guo, Chen Wang, Yun-Hui Liu

Abstract

Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry yet struggle with ambiguities such as reflective or textureless surfaces. Despite post-hoc synergies, these paradigms remain largely disjoint in practice. We introduce a unified framework that bridges both through iterative bidirectional alignment of their latent representations. At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with stereo hypothesis representations during stereo reasoning. This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry within a single network. Extensive experiments demonstrate state-of-the-art results: it reduces zero-shot generalization error by >40% on Middlebury and ETH3D, while addressing longstanding failures on transparent and reflective surfaces. By harmonizing multi-view geometry with monocular context, our approach enables robust 3D perception that transcends modality-specific limitations.

Example of reconstructions

TLDR: A unified framework combines monocular and stereo depth estimation through iterative bidirectional alignment of latent representations, achieving state-of-the-art results and addressing ambiguities in stereo vision.

Get Started

Installation

Clone BridgeDepth

git clone https://github.com/aeolusguan/BridgeDepth
cd BridgeDepth

Create the environment, here we recommend using conda.

conda create -n bridgedepth python=3.10
conda activate bridgedepth
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu126  # use the correct version of cuda for your system
pip install -r requirement.txt

Checkpoints

We provide several pre-trained models:

| Model name | Benchmark | Training resolutions | Stereo encoder | Training Config | |------------|-----------|----------------------|----------------|-----------------| | sf.pth | Scene Flow | 368x784 | BasicEncoder | default.py | | l_sf.pth | Scene Flow | 368x784 | ConvNext-Tiny | l_train.yaml | | kitti.pth | KITTI 2012/2015 | 304x1152 |ConvNext-Tiny | kitti_mix_train.yaml | |eth3d_pretrain.pth, eth3d.pth | ETH3D | 384x512 | ConvNext-Tiny | eth3d_pretrain.yaml, eth3d.yaml | | middlebury_pretrain.pth, middlebury.pth | Middlebury | 384x512, 512x768 | ConvNext-Tiny | middlebury_pretrain.yaml, middlebury.yaml | | rvc_pretrain.pth, rvc.pth | Robust Vision Challenge | 384x768, 384x768 | ConvNext-Tiny | rvc_pretrain.yaml, rvc.yaml |

Run demo

python demo.py --model_name rvc_pretrain  # also try with [rvc | eth3d_pretrain | middlebury_pretrain]
# If network issue, you can first download the checkpoint, and replace $checkpoint to the path of checkpoint file
# python demo.py --checkpoint_path $checkpoint

You can see output disparity visualization

Point cloud output (without denoising)

Inference

To test on your own stereo image pairs, placed at $left_directory and $right_direcoty respectively

python infer.py --input $left_directory $right_directory --output $output_directory --from-pretrained rvc_pretrain # also try with [rvc | eth3d_pretrain | middlebury_pretrain]

Tips:

For in the wild deployment, we generally recommend the rvc_pretrain.pth checkpoint. You are encouraged to also try other models for your best fit (middlebury_pretrain.pth, eth3d_pretrain.pth, or rvc.pth maybe your favorite).
For high-resolution image (>720p), you are highly suggested to run with smaller scale, e.g., downsampled to 720p, not only for faster inference but also better performance.

ONNX Export

We give a demo onnx export script:

python scripts/make_onnx.py --model_name rvc_pretrain --height 540 --width 960 # adjust input size and model id for your need

Datasets

To train/evaluate BridgeDepth, you first need to prepare datasets following this guide.

Evaluation

To evaluate on SceneFlow test set, run

python main.py --num-gpus 4 --eval-only --from-pretrained sf  # use the number of gpus for your need
# If network issue, you can first download the checkpoint, and replace $checkpoint to the path of checkpoint file
# python main.py --num-gpus 4 --eval-only --from-pretrained $checkpoint
# or
python main.py --num-gpus 4 --eval-only --from-pretrained l_sf

For zero-shot generalization evaluation

python main.py --num-gpus 4 --eval-only --config-file configs/zero_shot_evaluation.yaml --from-pretrained sf

For submission to KITTI 2012/2015, ETH3D, and Middlebury online test sets, you can run:

python infer.py --dataset-name kitti_2015 --from-pretrained kitti  # produce kitti_2015_submission in current working directory
python infer.py --dataset-name kitti_2012 --from-pretrained kitti  # produce kitti_2012_submission in current working directory
python infer.py --dataset-name eth3d --output eth3d_submission --from-pretrained eth3d  # try with --from-pretrained rvc for _RVC submission
python infer.py --dataset-name middlebury_H --output middlebury_submission --from-pretrained middlebury # try with --from-pretrained rvc for _RVC submission

Training

First, download DAv2 models

mkdir checkpoints; cd checkpoints
wget https://huggingface.co/depth-anything/Depth-Anything-V2-Large/resolve/main/depth_anything_v2_vitl.pth
cd ..

Train on SceneFlow

python main.py --num-gpus 4 --checkpoint-dir checkpoints/sf
python main.py --num-gpus 4 --config-file configs/L_train.yaml --checkpoint-dir checkpoints/l_sf

Finetune for Benchmarks

# KITTI
python main.py --num-gpus 4 --config-file configs/kitti_mix_train.yaml --checkpoint-dir checkpoints/kitti SOLVER.RESUME checkpoints/l_sf/step_300000.pth
# ETH3D
python main.py --num-gpus 4 --config-file configs/eth3d_pretrain.yaml --checkpoint-dir checkpoints/eth3d_pretrain SOLVER.RESUME checkpoints/l_sf/step_300000.pth
python main.py --num-gpus 4 --config-file configs/eth3d.yaml --checkpoint-dir checkpoints/eth3d SOLVER.RESUME checkpoints/eth3d_pretrain/step_300000.pth
# Middlebury
python main.py --num-gpus 4 --config-file configs/middlebury_pretrain.yaml --checkpoint-dir checkpoints/middlebury_pretrain SOLVER.RESUME checkpoints/l_sf/step_300000.pth
python main.py --num-gpus 4 --config-file configs/middlebury.yaml --checkpoint-dir checkpoints/middlebury SOLVER.RESUME checkpoints/middlebury_pretrain/step_200000.pth
# RVC
python main.py --num-gpus 4 --config-file configs/rvc_pretrain.yaml --checkpoint-dir checkpoints/rvc_pretrain SOLVER.RESUME checkpoints/l_sf/step_300000.pth
python main.py --num-gpus 4 --config-file configs/rvc.yaml --checkpoint-dir checkpoints/rvc SOLVER.RESUME checkpoints/rvc_pretrain/step_200000.pth

We support using tensorboard to monitor the training process. You can first start a tensorboard session with

tensorboard --logdir checkpoints

and then access http://localhost:6006 in your browser.

BibTex

@article{guan2025bridgedepth,
    author    = {Guan, Tongfan and Guo, Jiaxin and Wang, Chen and Liu, Yun-Hui},
    title     = {BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {27681-27691}
}

Acknowledgement

Thanks to the authors of DepthAnything V2, NMRF, DEFOM-Stereo and FoundationStereo for their code release. Finally, thanks to ICCV reviewers and AC for their appreciation of this work and constructive feedback.

Related Skills

node-connect

339.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

339.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.9k

Commit, push, and open a PR