SkillAgentSearch skills...

MoGe

[CVPR'25 Oral] MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

Install / Use

/learn @microsoft/MoGe
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

MoGe: Accurate Monocular Geometry Estimation

MoGe is a powerful model for recovering 3D geometry from monocular open-domain images, including metric point maps, metric depth maps, normal maps and camera FOV. Check our websites (MoGe-1, MoGe-2) for videos and interactive results!

📖 Publications

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

<div align="center"> <a href="https://arxiv.org/abs/2507.02546"><img src='https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv&logoColor=white' alt='arXiv'></a> <a href='https://wangrc.site/MoGe2Page/'><img src='https://img.shields.io/badge/Project_Page-Website-green?logo=googlechrome&logoColor=white' alt='Project Page'></a> <a href='https://huggingface.co/spaces/Ruicheng/MoGe-2'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo_(MoGe_v2)-blue'></a>

https://github.com/user-attachments/assets/8f9ae680-659d-4f7f-82e2-b9ed9d6b988a

</div>

MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision

<div align="center"> <a href="https://arxiv.org/abs/2410.19115"><img src='https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv&logoColor=white' alt='arXiv'></a> <a href='https://wangrc.site/MoGePage/'><img src='https://img.shields.io/badge/Project_Page-Website-green?logo=googlechrome&logoColor=white' alt='Project Page'></a> <a href='https://huggingface.co/spaces/Ruicheng/MoGe'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Demo_(MoGe_v1)-blue'></a> </div> <img src="./assets/overview_simplified.png" width="100%" alt="Method overview" align="center">

🌟 Features

  • Accurate 3D geometry estimation: Estimate point maps & depth maps & normal maps from open-domain single images with high precision -- all capabilities in one model, one forward pass.
  • Optional ground-truth FOV input: Enhance model accuracy further by providing the true field of view.
  • Flexible resolution support: Works seamlessly with various resolutions and aspect ratios, from 2:1 to 1:2.
  • Optimized for speed: Achieves 60ms latency per image (A100 or RTX3090, FP16, ViT-L). Adjustable inference resolution for even faster speed.

✨ News

(2025-10-16)

  • Updated training code for MoGe-2.

(2025-06-10)

  • Released MoGe-2, a state-of-the-art model for monocular geometry, with these new capabilities in one unified model:
    • point map prediction in metric scale;
    • comparable and even better performance over MoGe-1;
    • significant improvement of visual sharpness;
    • high-quality normal map estimation;
    • lower inference latency.

📦 Installation

Install via pip

pip install git+https://github.com/microsoft/MoGe.git

Or clone this repository

git clone https://github.com/microsoft/MoGe.git
cd MoGe
pip install -r requirements.txt   # install the requirements

Note: MoGe should be compatible with most requirements versions. Please check the requirements.txt for more details if you encounter any dependency issues.

🤗 Pretrained Models

Our pretrained models are available on the huggingface hub:

<table> <thead> <tr> <th>Version</th> <th>Hugging Face Model</th> <th>Metric scale</th> <th>Normal</th> <th>#Params</th> </tr> </thead> <tbody> <tr> <td>MoGe-1</td> <td><a href="https://huggingface.co/Ruicheng/moge-vitl" target="_blank"><code>Ruicheng/moge-vitl</code><a></td> <td>-</td> <td>-</td> <td>314M</td> </tr> <tr> <td rowspan="4">MoGe-2</td> <td><a href="https://huggingface.co/Ruicheng/moge-2-vitl" target="_blank"><code>Ruicheng/moge-2-vitl</code></a></td> <td>✅</td> <td>-</td> <td>326M</td> </tr> <tr> <td><a href="https://huggingface.co/Ruicheng/moge-2-vitl-normal" target="_blank"><code>Ruicheng/moge-2-vitl-normal</code></a></td> <td>✅</td> <td>✅</td> <td>331M</td> </tr> <tr> <td><a href="https://huggingface.co/Ruicheng/moge-2-vitb-normal" target="_blank"><code>Ruicheng/moge-2-vitb-normal</code></a></td> <td>✅</td> <td>✅</td> <td>104M</td> </tr> <tr> <td><a href="https://huggingface.co/Ruicheng/moge-2-vits-normal" target="_blank"><code>Ruicheng/moge-2-vits-normal</code></a></td> <td>✅</td> <td>✅</td> <td>35M</td> </tr> </tbody> </table>

NOTE: moge-2-vitl-normal has full capabilities, with almost the same level of performance as moge-2-vitl plus extra normal map estimation.

You may import the MoGeModel class of the matched version, then load the pretrained weights via MoGeModel.from_pretrained("HUGGING_FACE_MODEL_REPO_NAME") with automatic downloading. If loading a local checkpoint, replace the model name with the local path.

For ONNX support, please refer to docs/onnx.md.

💡 Minimal Code Example

Here is a minimal example for loading the model and inferring on a single image.

import cv2
import torch
# from moge.model.v1 import MoGeModel
from moge.model.v2 import MoGeModel # Let's try MoGe-2

device = torch.device("cuda")

# Load the model from huggingface hub (or load from local).
model = MoGeModel.from_pretrained("Ruicheng/moge-2-vitl-normal").to(device)                             

# Read the input image and convert to tensor (3, H, W) with RGB values normalized to [0, 1]
input_image = cv2.cvtColor(cv2.imread("PATH_TO_IMAGE.jpg"), cv2.COLOR_BGR2RGB)                       
input_image = torch.tensor(input_image / 255, dtype=torch.float32, device=device).permute(2, 0, 1)    

# Infer 
output = model.infer(input_image)
"""
`output` has keys "points", "depth", "mask", "normal" (optional) and "intrinsics",
The maps are in the same size as the input image. 
{
    "points": (H, W, 3),    # point map in OpenCV camera coordinate system (x right, y down, z forward). For MoGe-2, the point map is in metric scale.
    "depth": (H, W),        # depth map
    "normal": (H, W, 3)     # normal map in OpenCV camera coordinate system. (available for MoGe-2-normal)
    "mask": (H, W),         # a binary mask for valid pixels. 
    "intrinsics": (3, 3),   # normalized camera intrinsics
}
"""

For more usage details, see the MoGeModel.infer() docstring.

💡 Usage

Gradio demo | moge app

The demo for MoGe-1 is also available at our Hugging Face Space.

# Using the command line tool
moge app        # will run MoGe-2 demo by default.

# In this repo
python moge/scripts/app.py   # --share for Gradio public sharing

See also moge/scripts/app.py

Inference | moge infer

Run the script moge/scripts/infer.py via the following command:

# Save the output [maps], [glb] and [ply] files
moge infer -i IMAGES_FOLDER_OR_IMAGE_PATH --o OUTPUT_FOLDER --maps --glb --ply

# Show the result in a window (requires pyglet < 2.0, e.g. pip install pyglet==1.5.29)
moge infer -i IMAGES_FOLDER_OR_IMAGE_PATH --o OUTPUT_FOLDER --show

For detailed options, run moge infer --help:

Usage: moge infer [OPTIONS]

  Inference script

Options:
  -i, --input PATH            Input image or folder path. "jpg" and "png" are
                              supported.
  --fov_x FLOAT               If camera parameters are known, set the
                              horizontal field of view in degrees. Otherwise,
                              MoGe will estimate it.
  -o, --output PATH           Output folder path
  --pretrained TEXT           Pretrained model name or path. If not provided,
                              the corresponding default model will be chosen.
  --version [v1|v2]           Model version. Defaults to "v2"
  --device TEXT               Device name (e.g. "cuda", "cuda:0", "cpu").
                              Defaults to "cuda"
  --fp16                      Use fp16 precision for much faster inference.
  --resize INTEGER            Resize the image(s) & output maps to a specific
                              size. Defaults to None (no resizing).
  --resolution_level INTEGER  An integer [0-9] for the resolution level for
                              inference. Higher value means more tokens and
                              the finer details will be captured, but
                              inference can be slower. Defaults to 9. Note
                              that it is irrelevant to the output size, which
                              is always the same as the input size.
                              `resolution_level` actually controls
                              `num_tokens`. See `num_tokens` for more details.
  --num_tokens INTEGER        number of tokens used for inference. A integer
                              in the (suggested) range of `[1200, 2500]`.
                              `resolution_level` will be ignored if
                              `num_tokens` is provided. Default: None
  --threshold FLOAT           Threshold for removing edges. Defaults to 0.01.
                              Smaller value removes more edges. "inf" means no
                              thresholding.
  --maps                      Whether to save the output maps (image, point
                              map, depth map, normal map, mask) and fov.
  --glb                       Whether to save the output as a.glb file. The
                              color will be saved as a texture.
  --ply                       Whether to save the output as a.ply file. The
                              color will be saved as vertex colors.
  --show                      Whether show the output in a window. Note that
                              this requires pyglet<2 installed as required by
                              trimesh.
  --he
View on GitHub
GitHub Stars2.4k
CategoryDevelopment
Updated4h ago
Forks166

Languages

Python

Security Score

85/100

Audited on Mar 25, 2026

No findings