Lotus
Official implementation of Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction
Install / Use
/learn @EnVision-Research/LotusREADME
<img src="assets/badges/lotus_icon.png" alt="lotus" style="height:1em; vertical-align:bottom;"/> Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction
Jing He<sup>1<span style="color:red;">✱</span></sup>, Haodong Li<sup>1<span style="color:red;">✱</span></sup>, Wei Yin<sup>2</sup>, Yixun Liang<sup>1</sup>, Leheng Li<sup>1</sup>, Kaiqiang Zhou<sup>3</sup>, Hongbo Zhang<sup>3</sup>, Bingbing Liu<sup>3</sup>,<br> Ying-Cong Chen<sup>1,4✉</sup>
<span class="author-block"><sup>1</sup>HKUST(GZ)</span> <span class="author-block"><sup>2</sup>University of Adelaide</span> <span class="author-block"><sup>3</sup>Noah's Ark Lab</span> <span class="author-block"><sup>4</sup>HKUST</span><br> <span class="author-block"> <sup style="color:red;">✱</sup>Both authors contributed equally. <sup>✉</sup>Corresponding author. </span>
🔥🔥🔥 Please also check our latest Lotus-2! Useful links: Project Page, Github Repo. 🔥🔥🔥

We present Lotus, a diffusion-based visual foundation model for dense geometry prediction. With minimal training data, Lotus achieves SoTA performance in two key geometry perception tasks, i.e., zero-shot depth and normal estimation. "Avg. Rank" indicates the average ranking across all metrics, where lower values are better. Bar length represents the amount of training data used.
📢 News
- 2025-04-03: The training code of Lotus (Generative & Discriminative) is now available!
- 2025-01-17: Please check out our latest models (lotus-normal-g-v1-1, lotus-normal-d-v1-1), which were trained with aligned surface normals, leading to improved performance!
- 2024-11-13: The demo now supports video depth estimation!
- 2024-11-13: The Lotus disparity models (Generative & Discriminative) are now available, which achieve better performance!
- 2024-10-06: The demos are now available (Depth & Normal). Please have a try! <br>
- 2024-10-05: The inference code is now available! <br>
- 2024-09-26: Paper released. Click here if you are curious about the 3D point clouds of the teaser's depth maps! <br>
🛠️ Setup
This installation was tested on: Ubuntu 20.04 LTS, Python 3.10, CUDA 12.3, NVIDIA A800-SXM4-80GB.
- Clone the repository (requires git):
git clone https://github.com/EnVision-Research/Lotus.git
cd Lotus
- Install dependencies (requires conda):
conda create -n lotus python=3.10 -y
conda activate lotus
pip install -r requirements.txt
🤗 Gradio Demo
- For depth estimation, run:
python app.py depth - For normal estimation, run:
python app.py normal
🔥 Training
-
Initialize your Accelerate environment with:
accelerate config --config_file=$PATH_TO_ACCELERATE_CONFIG_FILEPlease make sure you have installed the accelerate package. We have tested our training scripts with the accelerate version 0.29.3.
-
Prepare your training data:
- Hypersim:
- Download this script into your
$PATH_TO_RAW_HYPERSIM_DATAdirectory for data downloading. - Run the following command to download the data:
cd $PATH_TO_RAW_HYPERSIM_DATA # Download the tone-mapped images python ./download.py --contains scene_cam_ --contains final_preview --contains tonemap.jpg --silent # Download the depth maps python ./download.py --contains scene_cam_ --contains geometry_hdf5 --contains depth_meters --silent # Download the normal maps python ./download.py --contains scene_cam_ --contains geometry_hdf5 --contains normal --silent - Download the split file from here and put it in the
$PATH_TO_RAW_HYPERSIM_DATAdirectory. - Process the data with the command:
bash utils/process_hypersim.sh.
- Download this script into your
- Virtual KITTI:
- Download the rgb, depth, and textgz into the
$PATH_TO_VKITTI_DATAdirectory and unzip them. - Make sure the directory structure is as follows:
where $SceneX/Y/frames/rgb/Camera_Z/rgb_%05d.jpg SceneX/Y/frames/depth/Camera_Z/depth_%05d.png SceneX/Y/colors.txt SceneX/Y/extrinsic.txt SceneX/Y/intrinsic.txt SceneX/Y/info.txt SceneX/Y/bbox.txt SceneX/Y/pose.txtX \in \{01, 02, 06, 18, 20\}$ and represent one of 5 different locations. $`Y \in {\texttt{15-deg-left}, \texttt{15-deg-right}, \texttt{30-deg-left}, \texttt{30-deg-right
- Download the rgb, depth, and textgz into the
