SkillAgentSearch skills...

Lotus

Official implementation of Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

Install / Use

/learn @EnVision-Research/Lotus
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<img src="assets/badges/lotus_icon.png" alt="lotus" style="height:1em; vertical-align:bottom;"/> Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

Page Paper HuggingFace Demo HuggingFace Demo ComfyUI Replicate

Jing He<sup>1<span style="color:red;"></span></sup>, Haodong Li<sup>1<span style="color:red;"></span></sup>, Wei Yin<sup>2</sup>, Yixun Liang<sup>1</sup>, Leheng Li<sup>1</sup>, Kaiqiang Zhou<sup>3</sup>, Hongbo Zhang<sup>3</sup>, Bingbing Liu<sup>3</sup>,<br> Ying-Cong Chen<sup>1,4✉</sup>

<span class="author-block"><sup>1</sup>HKUST(GZ)</span> <span class="author-block"><sup>2</sup>University of Adelaide</span> <span class="author-block"><sup>3</sup>Noah's Ark Lab</span> <span class="author-block"><sup>4</sup>HKUST</span><br> <span class="author-block"> <sup style="color:red;"></sup>Both authors contributed equally. <sup></sup>Corresponding author. </span>

🔥🔥🔥 Please also check our latest Lotus-2! Useful links: Project Page, Github Repo. 🔥🔥🔥

teaser teaser

We present Lotus, a diffusion-based visual foundation model for dense geometry prediction. With minimal training data, Lotus achieves SoTA performance in two key geometry perception tasks, i.e., zero-shot depth and normal estimation. "Avg. Rank" indicates the average ranking across all metrics, where lower values are better. Bar length represents the amount of training data used.

📢 News

  • 2025-04-03: The training code of Lotus (Generative & Discriminative) is now available!
  • 2025-01-17: Please check out our latest models (lotus-normal-g-v1-1, lotus-normal-d-v1-1), which were trained with aligned surface normals, leading to improved performance!
  • 2024-11-13: The demo now supports video depth estimation!
  • 2024-11-13: The Lotus disparity models (Generative & Discriminative) are now available, which achieve better performance!
  • 2024-10-06: The demos are now available (Depth & Normal). Please have a try! <br>
  • 2024-10-05: The inference code is now available! <br>
  • 2024-09-26: Paper released. Click here if you are curious about the 3D point clouds of the teaser's depth maps! <br>

🛠️ Setup

This installation was tested on: Ubuntu 20.04 LTS, Python 3.10, CUDA 12.3, NVIDIA A800-SXM4-80GB.

  1. Clone the repository (requires git):
git clone https://github.com/EnVision-Research/Lotus.git
cd Lotus
  1. Install dependencies (requires conda):
conda create -n lotus python=3.10 -y
conda activate lotus
pip install -r requirements.txt 

🤗 Gradio Demo

  1. Online demo: Depth & Normal
  2. Local demo
  • For depth estimation, run:
    python app.py depth
    
  • For normal estimation, run:
    python app.py normal
    

🔥 Training

  1. Initialize your Accelerate environment with:

    accelerate config --config_file=$PATH_TO_ACCELERATE_CONFIG_FILE
    

    Please make sure you have installed the accelerate package. We have tested our training scripts with the accelerate version 0.29.3.

  2. Prepare your training data:

  • Hypersim:
    • Download this script into your $PATH_TO_RAW_HYPERSIM_DATA directory for data downloading.
    • Run the following command to download the data:
      cd $PATH_TO_RAW_HYPERSIM_DATA
      
      # Download the tone-mapped images
      python ./download.py --contains scene_cam_ --contains final_preview --contains tonemap.jpg --silent
      
      # Download the depth maps
      python ./download.py --contains scene_cam_ --contains geometry_hdf5 --contains depth_meters --silent
      
      # Download the normal maps
      python ./download.py --contains scene_cam_ --contains geometry_hdf5 --contains normal --silent
      
    • Download the split file from here and put it in the $PATH_TO_RAW_HYPERSIM_DATA directory.
    • Process the data with the command: bash utils/process_hypersim.sh.
  • Virtual KITTI:
    • Download the rgb, depth, and textgz into the $PATH_TO_VKITTI_DATA directory and unzip them.
    • Make sure the directory structure is as follows:
      SceneX/Y/frames/rgb/Camera_Z/rgb_%05d.jpg
      SceneX/Y/frames/depth/Camera_Z/depth_%05d.png
      SceneX/Y/colors.txt
      SceneX/Y/extrinsic.txt
      SceneX/Y/intrinsic.txt
      SceneX/Y/info.txt
      SceneX/Y/bbox.txt
      SceneX/Y/pose.txt
      
      where $X \in \{01, 02, 06, 18, 20\}$ and represent one of 5 different locations. $`Y \in {\texttt{15-deg-left}, \texttt{15-deg-right}, \texttt{30-deg-left}, \texttt{30-deg-right
View on GitHub
GitHub Stars793
CategoryDevelopment
Updated39m ago
Forks46

Languages

Python

Security Score

95/100

Audited on Apr 2, 2026

No findings