Wonder3D

Single Image to 3D using Cross-Domain Diffusion for 3D Generation

Generate Convert Improve

Install / Use

/learn @xxlong0/Wonder3D

About this skill

Quality Score

0/100

README

中文版本中文

Wonder3D

Single Image to 3D using Cross-Domain Diffusion (CVPR 2024 Highlight)

Paper | Project page | Hugging Face Demo | Colab from @camenduru

Wonder3D reconstructs highly-detailed textured meshes from a single-view image in only 2 ∼ 3 minutes. Wonder3D first generates consistent multi-view normal maps with corresponding color images via a cross-domain diffusion model, and then leverages a novel normal fusion method to achieve fast and high-quality reconstruction.

News

2024.12.22 We have extent the [Wonder3D] to a more advanced version, Wonder3D++!.
2024.08.29 <span style="color:red">Fixed an issue in '/mvdiffusion/pipelines/pipeline_mvdiffusion_image' where cross-domain attention did not work correctly during classifier-free guidance (CFG) inference, causing misalignment between the RGB and normal generation results.</span> To address this issue, we need to place the RGB and normal domain inputs in the first and second halves of the batch, respectively, before feeding them into the model. This approach differs from the typical CFG method, which separates unconditional and conditional inputs into the first and second halves of the batch. The results before and after the bug fix are shown below:

Fixed a severe training bug. The "zero_init_camera_projection" in 'configs/train/stage1-mix-6views-lvis.yaml' should be False. Otherwise, the domain control and pose control will be invalid in the training.
2024.03.19 Checkout our new model GeoWizard that jointly produces depth and normal with high fidelity from single images.
2024.05.24 We release a large 3D native diffusion model CraftsMan3D that is directly trained on 3D representation and therefore is capable of producing complex structures.
2024.05.29 We release a more powerful MV cross-domain diffusion model Era3D that jointly produces 512x512 color images and normal maps, but more importantly Era3D could automatically figure out the focal length and elevation degree of the input image so that avoid geometry distortions.

Usage


# First clone the repo, and use the commands in the repo

import torch
import requests
from PIL import Image
import numpy as np
from torchvision.utils import make_grid, save_image
from diffusers import DiffusionPipeline  # only tested on diffusers[torch]==0.19.3, may have conflicts with newer versions of diffusers

def load_wonder3d_pipeline():

    pipeline = DiffusionPipeline.from_pretrained(
    'flamehaze1115/wonder3d-v1.0', # or use local checkpoint './ckpts'
    custom_pipeline='flamehaze1115/wonder3d-pipeline',
    torch_dtype=torch.float16
    )

    # enable xformers
    pipeline.unet.enable_xformers_memory_efficient_attention()

    if torch.cuda.is_available():
        pipeline.to('cuda:0')
    return pipeline

pipeline = load_wonder3d_pipeline()

# Download an example image.
cond = Image.open(requests.get("https://d.skis.ltd/nrp/sample-data/lysol.png", stream=True).raw)

# The object should be located in the center and resized to 80% of image height.
cond = Image.fromarray(np.array(cond)[:, :, :3])

# Run the pipeline!
images = pipeline(cond, num_inference_steps=20, output_type='pt', guidance_scale=1.0).images

result = make_grid(images, nrow=6, ncol=2, padding=0, value_range=(0, 1))

save_image(result, 'result.png')

Collaborations

Our overarching mission is to enhance the speed, affordability, and quality of 3D AIGC, making the creation of 3D content accessible to all. While significant progress has been achieved in the recent years, we acknowledge there is still a substantial journey ahead. We enthusiastically invite you to engage in discussions and explore potential collaborations in any capacity. <span style="color:red">If you're interested in connecting or partnering with us, please don't hesitate to reach out via email (xxlong@connect.hku.hk)</span> .

News

2024.02 We release the training codes. Welcome to train wonder3D on your personal data.
2023.10 We release the inference model and codes.

Preparation for inference

Linux System Setup.

conda create -n wonder3d
conda activate wonder3d
pip install -r requirements.txt
pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

Windows System Setup.

Please switch to branch main-windows to see details of windows setup.

Docker Setup

see docker/README.MD

Training

Here we provide two training scripts train_mvdiffusion_image.py and train_mvdiffusion_joint.py.

The training has two stages: 1) first train multi-view attentions by randomly taking normal or color flag; 2) add cross-domain attention modules into the SD model, and only optimize the newly added parameters.

You need to modify root_dir that contain the data of the config files configs/train/stage1-mix-6views-lvis.yaml and configs/train/stage2-joint-6views-lvis.yaml accordingly.

# stage 1:
accelerate launch --config_file 8gpu.yaml train_mvdiffusion_image.py --config configs/train/stage1-mix-6views-lvis.yaml

# stage 2
accelerate launch --config_file 8gpu.yaml train_mvdiffusion_joint.py --config configs/train/stage2-joint-6views-lvis.yaml

Prepare the training data

see render_codes/README.md.

Inference

Optional. If you have troubles to connect to huggingface. Make sure you have downloaded the following models. Download the checkpoints and into the root folder.

If you are in mainland China, you may download via aliyun.

Wonder3D
|-- ckpts
    |-- unet
    |-- scheduler
    |-- vae
    ...

Then modify the file ./configs/mvdiffusion-joint-ortho-6views.yaml, set pretrained_model_name_or_path="./ckpts"

Download the SAM model. Put it to the sam_pt folder.

Wonder3D
|-- sam_pt
    |-- sam_vit_h_4b8939.pth

Predict foreground mask as the alpha channel. We use Clipdrop to segment the foreground object interactively. You may also use rembg to remove the backgrounds.

# !pip install rembg
import rembg
result = rembg.remove(result)
result.show()

Run Wonder3d to produce multiview-consistent normal maps and color images. Then you can check the results in the folder ./outputs. (we use rembg to remove backgrounds of the results, but the segmentations are not always perfect. May consider using Clipdrop to get masks for the generated normal maps and color images, since the quality of masks will significantly influence the reconstructed mesh quality.)

accelerate launch --config_file 1gpu.yaml test_mvdiffusion_seq.py \
            --config configs/mvdiffusion-joint-ortho-6views.yaml validation_dataset.root_dir={your_data_path} \
            validation_dataset.filepaths=['your_img_file'] save_dir={your_save_path}

see example:

accelerate launch --config_file 1gpu.yaml test_mvdiffusion_seq.py \
            --config configs/mvdiffusion-joint-ortho-6views.yaml validation_dataset.root_dir=./example_images \
            validation_dataset.filepaths=['owl.png'] save_dir=./outputs

Interactive inference: run your local gradio demo. (Only generate normals and colors without reconstruction)

python gradio_app_mv.py   # generate multi-view normals and colors

Mesh Extraction

Instant-NSR Mesh Extraction

cd ./instant-nsr-pl
python launch.py --config configs/neuralangelo-ortho-wmask.yaml --gpu 0 --train dataset.root_dir=../{your_save_path}/cropsize-{crop_size}-cfg{guidance_scale:.1f}/ dataset.scene={scene}

see example:

cd ./instant-nsr-pl
python launch.py --config configs/neuralangelo-ortho-wmask.yaml --gpu 0 --train dataset.root_dir=../outputs/cropsize-192-cfg1.0/ dataset.scene=owl

Our generated normals and color images are defined in orthographic views, so the reconstructed mesh is also in orthographic camera space. If you use MeshLab to view the meshes, you can click Toggle Orthographic Camera in View tab.

Interactive inference: run your local gradio demo. (First generate normals and colors, and then do reconstructions. No need to perform gradio_app_mv.py first.)

python gradio_app_recon.py

NeuS-based Mesh Extraction

Since there are many complaints about the Windows setup of instant-nsr-pl, we provide the NeuS-based reconstruction, which may get rid of the requirement problems.

NeuS consumes less GPU memory and favors smooth surfaces without parameters tuning. However, NeuS consumes more times and its texture may be less sharp. If you are not sensitive to time, we recommend NeuS for optimization due to its robustness.

cd ./NeuS
bash run.sh output_folder_path scene_name

Common questions

Q: Tips to get better results.

Wonder3D is sensitive the facing direciton of input images. By experiments, front-facing images always lead to good reconstruction.
Limited by resources, current implemetation only supports limited views (6 views) and low resolution (256x256). Any images will be first resized into 256x256 for generation, so images after such a downsample that still keep clear and sharp features will lead to good results.
Images with occlusions will cause

Related Skills

node-connect

334.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

82.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

334.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

82.3k

Commit, push, and open a PR