Magic123

[ICLR24] Official PyTorch Implementation of Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors

Generate Convert Improve

Install / Use

/learn @guochengqian/Magic123

About this skill

Quality Score

0/100

README

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors [ICLR 2024]

arXiv | webpage

Guocheng Qian 1,2, Jinjie Mai 1, Abdullah Hamdi 3, Jian Ren 2, Aliaksandr Siarohin 2, Bing Li 1, Hsin-Ying Lee 2, Ivan Skorokhodov 1,2, Peter Wonka 1, Sergey Tulyakov 2, Bernard Ghanem 1

1 King Abdullah University of Science and Technology (KAUST), 2 Snap Inc., 3 Visual Geometry Group, University of Oxford

Training convergence of a demo example: <img src="docs/static/ironman-val-magic123.gif" width="800" />

Compare Magic123 without textual inversion with abaltions using only 2D prior (SDS) or using only 3D prior (Zero123):

https://github.com/guochengqian/Magic123/assets/48788073/c91f4c81-8c2c-4f84-8ce1-420c12f7e886

Effects of Joint Prior. Increasing the strength of 2D prior leads to more imagination, more details, and less 3D consistencies.

https://github.com/guochengqian/Magic123/assets/48788073/98cb4dd7-7bf3-4179-9b6d-e8b47d928a68

Official PyTorch Implementation of Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors. Code is built upon Stable-DreamFusion repo.

NEWS:

[2024/01/16] Magic123 gets accepted to ICLR24
[2023/07/25] Code is available at GitHub
[2023/07/03] Paper is available at arXiv
[2023/06/25] Much better performance than the submitted version is achieved by 1）reimplementing Magic123 using Stable DreamFusion code, 2）fixing some gradient issues, 3）leveraging the tricks
[2023] Initial version of Magic123 submitted to conference

Install

We only test on Ubuntu system. Make sure git, wget, Eigen are installed.

apt update && apt upgrade
apt install git wget libeigen3-dev -y

Install Environment

source install.sh

Note: in this install.sh, we use python venv by default. If you prefer conda, uncomment the conda and comment venv in the file and run the same command.

Download pre-trained models

Zero-1-to-3 for 3D diffusion prior. We use 105000.ckpt by default, reimplementation borrowed from Stable Diffusion repo, and is available in guidance/zero123_utils.py.
```
cd pretrained/zero123
wget https://huggingface.co/cvlab/zero123-weights/resolve/main/105000.ckpt
cd ../../
```

MiDaS for depth estimation. We use dpt_beit_large_512.pt. Put it in folder pretrained/midas/

mkdir -p pretrained/midas
cd pretrained/midas
wget https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_large_512.pt
cd ../../

Usage

Preprocess [Optional]

We have included all preprocessed files in ./data directory. Preprocessing is only necessary if you want to test on your own examples. Takes seconds.

Step1: Extract depth

python preprocess_image.py --path /path/to/image

Step 2: textual inversion [Optional]

Magic123 uses the default textual inversion from diffuers, which consumes around 2 hours on a 32G V100. If you do not want to spend time in this textual inversion, you can: (1) study whether there is other faster textual inversion; or (2) do not use textual inversion in the loss of texture and shape consistencies. To run textual inversion:

bash scripts/textual_inversion/textual_inversion.sh $GPU_IDX runwayml/stable-diffusion-v1-5 /path/to/example/rgba.png /path/to/save $token_name $init_token --max_train_steps 5000

$token_name is a the special token, usually name that by examplename $init_token is a single token to describe the image using natural language

For example:

bash scripts/textual_inversion/textual_inversion.sh runwayml/stable-diffusion-v1-5 data/demo/a-full-body-ironman/rgba.png out/textual_inversion/ironman _ironman_ ironman --max_train_steps 3000

Don't forget to move the final learned_embeds.bin under data/demo/a-full-body-ironman/

Run

Run Magic123 for a single example

Takes ~40 mins for the coarse stage and ~20 mins for the second stage on a 32G V100.

bash scripts/magic123/run_both_priors.sh $GPU_NO $JOBNAME_First_Stage $JOBNAME_Second_Stage $PATH_to_Example_Directory $IMAGE_BASE_NAME $Enable_First_Stage $Enable_Second_Stage {More_Arugments}

As an example, run Magic123 in the dragon example using both stages in GPU 0 and set the jobname for the first stage as nerf and the jobname for the second stage as dmtet, by the following command:

bash scripts/magic123/run_both_priors.sh 0 nerf dmtet data/realfusion15/metal_dragon_statue 1 1

More arguments (e.g. --lambda_guidance 1 40) can be appended to the command line such as:

bash scripts/magic123/run_both_priors.sh 0 nerf dmtet data/realfusion15/metal_dragon_statue 1 1 --lambda_guidance 1 40

Run Magic123 for a group of examples

Run all examples in a folder, check the scripts scripts/magic123/run_folder_both_priors.sh
Run all examples in a given list, check the scripts scripts/magic123/run_list_both_priors.sh

Run Magic123 on a single example without textual inversion

textual inversion is tedious (requires ~2.5 hours optimization), if you want to test Magic123 quickly on your own example without textual inversion (might degrade the performance), try the following:

first, foreground and depth estimation

python preprocess_image.py --path data/demo/a-full-body-ironman/main.png

Run Magic123 coarse stage without textual inversion, takes ~40 mins

export RUN_ID='default-a-full-body-ironman'
export DATA_DIR='data/demo/a-full-body-ironman'
export IMAGE_NAME='rgba.png'
export FILENAME=$(basename $DATA_DIR)
export dataset=$(basename $(dirname $DATA_DIR))
CUDA_VISIBLE_DEVICES=0 python main.py -O \
--text "A high-resolution DSLR image of a full body ironman" \
--sd_version 1.5 \
--image ${DATA_DIR}/${IMAGE_NAME} \
--workspace out/magic123-${RUN_ID}-coarse/$dataset/magic123_${FILENAME}_${RUN_ID}_coarse \
--optim adam \
--iters 5000 \
--guidance SD zero123 \
--lambda_guidance 1.0 40 \
--guidance_scale 100 5 \
--latent_iter_ratio 0 \
--normal_iter_ratio 0.2 \
--t_range 0.2 0.6 \
--bg_radius -1 \
--save_mesh

Run Magic123 fine stage without textual inversion, takes around ~20 mins

export RUN_ID='default-a-full-body-ironman'
export RUN_ID2='dmtet'
export DATA_DIR='data/demo/a-full-body-ironman'
export IMAGE_NAME='rgba.png'
export FILENAME=$(basename $DATA_DIR)
export dataset=$(basename $(dirname $DATA_DIR))
CUDA_VISIBLE_DEVICES=0 python main.py -O \
--text "A high-resolution DSLR image of a full body ironman" \
--sd_version 1.5 \
--image ${DATA_DIR}/${IMAGE_NAME} \
--workspace out/magic123-${RUN_ID}-${RUN_ID2}/$dataset/magic123_${FILENAME}_${RUN_ID}_${RUN_ID2} \
--dmtet --init_ckpt out/magic123-${RUN_ID}-coarse/$dataset/magic123_${FILENAME}_${RUN_ID}_coarse/checkpoints/magic123_${FILENAME}_${RUN_ID}_coarse.pth \
--iters 5000 \
--optim adam \
--known_view_interval 4 \
--latent_iter_ratio 0 \
--guidance SD zero123 \
--lambda_guidance 1e-3 0.01 \
--guidance_scale 100 5 \
--rm_edge \
--bg_radius -1 \
--save_mesh

Run ablation studies

Run Magic123 with only 2D prior with textual inversion (Like RealFusion but we achieve much better performance through training stragies and the coarse-to-fine pipeline)
```
bash scripts/magic123/run_2dprior.sh 0 nerf dmtet data/realfusion15/metal_dragon_statue 1 1
```
Run Magic123 with only 2D prior without textual inversion (Like RealFusion but we achieve much better performance through training stragies and the coarse-to-fine pipeline)
```
bash scripts/magic123/run_2dprior_notextinv_ironman.sh 0 default 1 1
```
note: change the path and the text prompt inside the script if you wana test another example.
Run Magic123 with only 3D prior (Like Zero-1-to-3 but we achieve much better performance through training stragies and the coarse-to-fine pipeline)
```
bash scripts/magic123/run_3dprior.sh 0 nerf dmtet data/demo/a-full-body-ironman 1 1
```

Tips and Tricks

Fix camera distance (radius_range) and FOV (fovy_range) and tune the camera polar range (theta_range). Note it is better to keep camera jittering to reduce grid artifacts.
Smaller range of time steps for the defusion noise (t_range). We find [0.2, 0.6] gives better performance for image-to-3D tasks.
Using normals as latent in the first 2000 improves generated geometry a bit gernerally (but not always). We turn on this for Magic123 corase stage in the script --normal_iter_ratio 0.2
We erode segmentation edges (makes the segmentation map 2 pixels shrinked towards internal side) t

Related Skills

node-connect

339.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

339.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.9k

Commit, push, and open a PR