SkillAgentSearch skills...

Magic123

[ICLR24] Official PyTorch Implementation of Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors

Install / Use

/learn @guochengqian/Magic123
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors [ICLR 2024]

arXiv | webpage

<img src="docs/static/magic123.gif" width="800" />

Guocheng Qian <sup>1,2</sup>, Jinjie Mai <sup>1</sup>, Abdullah Hamdi <sup>3</sup>, Jian Ren <sup>2</sup>, Aliaksandr Siarohin <sup>2</sup>, Bing Li <sup>1</sup>, Hsin-Ying Lee <sup>2</sup>, Ivan Skorokhodov <sup>1,2</sup>, Peter Wonka <sup>1</sup>, Sergey Tulyakov <sup>2</sup>, Bernard Ghanem <sup>1</sup>

<sup>1</sup> King Abdullah University of Science and Technology (KAUST), <sup>2</sup> Snap Inc., <sup>3</sup> Visual Geometry Group, University of Oxford

Training convergence of a demo example: <img src="docs/static/ironman-val-magic123.gif" width="800" />

Compare Magic123 without textual inversion with abaltions using only 2D prior (SDS) or using only 3D prior (Zero123):

https://github.com/guochengqian/Magic123/assets/48788073/c91f4c81-8c2c-4f84-8ce1-420c12f7e886

Effects of Joint Prior. Increasing the strength of 2D prior leads to more imagination, more details, and less 3D consistencies.

<img src="docs/static/2d_3d.png" width="800" />

https://github.com/guochengqian/Magic123/assets/48788073/98cb4dd7-7bf3-4179-9b6d-e8b47d928a68

Official PyTorch Implementation of Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors. Code is built upon Stable-DreamFusion repo.

NEWS:

  • [2024/01/16] Magic123 gets accepted to ICLR24
  • [2023/07/25] Code is available at GitHub
  • [2023/07/03] Paper is available at arXiv
  • [2023/06/25] Much better performance than the submitted version is achieved by 1)reimplementing Magic123 using Stable DreamFusion code, 2)fixing some gradient issues, 3)leveraging the tricks
  • [2023] Initial version of Magic123 submitted to conference

Install

We only test on Ubuntu system. Make sure git, wget, Eigen are installed.

apt update && apt upgrade
apt install git wget libeigen3-dev -y

Install Environment

source install.sh

Note: in this install.sh, we use python venv by default. If you prefer conda, uncomment the conda and comment venv in the file and run the same command.

Download pre-trained models

  • Zero-1-to-3 for 3D diffusion prior. We use 105000.ckpt by default, reimplementation borrowed from Stable Diffusion repo, and is available in guidance/zero123_utils.py.

    cd pretrained/zero123
    wget https://huggingface.co/cvlab/zero123-weights/resolve/main/105000.ckpt
    cd ../../
    
  • MiDaS for depth estimation. We use dpt_beit_large_512.pt. Put it in folder pretrained/midas/

    mkdir -p pretrained/midas
    cd pretrained/midas
    wget https://github.com/isl-org/MiDaS/releases/download/v3_1/dpt_beit_large_512.pt
    cd ../../
    

Usage

Preprocess [Optional]

We have included all preprocessed files in ./data directory. Preprocessing is only necessary if you want to test on your own examples. Takes seconds.

Step1: Extract depth

python preprocess_image.py --path /path/to/image 

Step 2: textual inversion [Optional]

Magic123 uses the default textual inversion from diffuers, which consumes around 2 hours on a 32G V100. If you do not want to spend time in this textual inversion, you can: (1) study whether there is other faster textual inversion; or (2) do not use textual inversion in the loss of texture and shape consistencies. To run textual inversion:

bash scripts/textual_inversion/textual_inversion.sh $GPU_IDX runwayml/stable-diffusion-v1-5 /path/to/example/rgba.png /path/to/save $token_name $init_token --max_train_steps 5000

$token_name is a the special token, usually name that by examplename $init_token is a single token to describe the image using natural language

For example:

bash scripts/textual_inversion/textual_inversion.sh runwayml/stable-diffusion-v1-5 data/demo/a-full-body-ironman/rgba.png out/textual_inversion/ironman _ironman_ ironman --max_train_steps 3000

Don't forget to move the final learned_embeds.bin under data/demo/a-full-body-ironman/

Run

Run Magic123 for a single example

Takes ~40 mins for the coarse stage and ~20 mins for the second stage on a 32G V100.

bash scripts/magic123/run_both_priors.sh $GPU_NO $JOBNAME_First_Stage $JOBNAME_Second_Stage $PATH_to_Example_Directory $IMAGE_BASE_NAME $Enable_First_Stage $Enable_Second_Stage {More_Arugments}

As an example, run Magic123 in the dragon example using both stages in GPU 0 and set the jobname for the first stage as nerf and the jobname for the second stage as dmtet, by the following command:

bash scripts/magic123/run_both_priors.sh 0 nerf dmtet data/realfusion15/metal_dragon_statue 1 1 

More arguments (e.g. --lambda_guidance 1 40) can be appended to the command line such as:

bash scripts/magic123/run_both_priors.sh 0 nerf dmtet data/realfusion15/metal_dragon_statue 1 1 --lambda_guidance 1 40

Run Magic123 for a group of examples

  • Run all examples in a folder, check the scripts scripts/magic123/run_folder_both_priors.sh
  • Run all examples in a given list, check the scripts scripts/magic123/run_list_both_priors.sh

Run Magic123 on a single example without textual inversion

textual inversion is tedious (requires ~2.5 hours optimization), if you want to test Magic123 quickly on your own example without textual inversion (might degrade the performance), try the following:

  • first, foreground and depth estimation

    python preprocess_image.py --path data/demo/a-full-body-ironman/main.png
    
  • Run Magic123 coarse stage without textual inversion, takes ~40 mins

    export RUN_ID='default-a-full-body-ironman'
    export DATA_DIR='data/demo/a-full-body-ironman'
    export IMAGE_NAME='rgba.png'
    export FILENAME=$(basename $DATA_DIR)
    export dataset=$(basename $(dirname $DATA_DIR))
    CUDA_VISIBLE_DEVICES=0 python main.py -O \
    --text "A high-resolution DSLR image of a full body ironman" \
    --sd_version 1.5 \
    --image ${DATA_DIR}/${IMAGE_NAME} \
    --workspace out/magic123-${RUN_ID}-coarse/$dataset/magic123_${FILENAME}_${RUN_ID}_coarse \
    --optim adam \
    --iters 5000 \
    --guidance SD zero123 \
    --lambda_guidance 1.0 40 \
    --guidance_scale 100 5 \
    --latent_iter_ratio 0 \
    --normal_iter_ratio 0.2 \
    --t_range 0.2 0.6 \
    --bg_radius -1 \
    --save_mesh
    
  • Run Magic123 fine stage without textual inversion, takes around ~20 mins

    export RUN_ID='default-a-full-body-ironman'
    export RUN_ID2='dmtet'
    export DATA_DIR='data/demo/a-full-body-ironman'
    export IMAGE_NAME='rgba.png'
    export FILENAME=$(basename $DATA_DIR)
    export dataset=$(basename $(dirname $DATA_DIR))
    CUDA_VISIBLE_DEVICES=0 python main.py -O \
    --text "A high-resolution DSLR image of a full body ironman" \
    --sd_version 1.5 \
    --image ${DATA_DIR}/${IMAGE_NAME} \
    --workspace out/magic123-${RUN_ID}-${RUN_ID2}/$dataset/magic123_${FILENAME}_${RUN_ID}_${RUN_ID2} \
    --dmtet --init_ckpt out/magic123-${RUN_ID}-coarse/$dataset/magic123_${FILENAME}_${RUN_ID}_coarse/checkpoints/magic123_${FILENAME}_${RUN_ID}_coarse.pth \
    --iters 5000 \
    --optim adam \
    --known_view_interval 4 \
    --latent_iter_ratio 0 \
    --guidance SD zero123 \
    --lambda_guidance 1e-3 0.01 \
    --guidance_scale 100 5 \
    --rm_edge \
    --bg_radius -1 \
    --save_mesh 
    

Run ablation studies

  • Run Magic123 with only 2D prior with textual inversion (Like RealFusion but we achieve much better performance through training stragies and the coarse-to-fine pipeline)

    bash scripts/magic123/run_2dprior.sh 0 nerf dmtet data/realfusion15/metal_dragon_statue 1 1
    
  • Run Magic123 with only 2D prior without textual inversion (Like RealFusion but we achieve much better performance through training stragies and the coarse-to-fine pipeline)

    bash scripts/magic123/run_2dprior_notextinv_ironman.sh 0 default 1 1
    

    note: change the path and the text prompt inside the script if you wana test another example.

  • Run Magic123 with only 3D prior (Like Zero-1-to-3 but we achieve much better performance through training stragies and the coarse-to-fine pipeline)

    bash scripts/magic123/run_3dprior.sh 0 nerf dmtet data/demo/a-full-body-ironman 1 1
    

Tips and Tricks

  1. Fix camera distance (radius_range) and FOV (fovy_range) and tune the camera polar range (theta_range). Note it is better to keep camera jittering to reduce grid artifacts.
  2. Smaller range of time steps for the defusion noise (t_range). We find [0.2, 0.6] gives better performance for image-to-3D tasks.
  3. Using normals as latent in the first 2000 improves generated geometry a bit gernerally (but not always). We turn on this for Magic123 corase stage in the script --normal_iter_ratio 0.2
  4. We erode segmentation edges (makes the segmentation map 2 pixels shrinked towards internal side) t

Related Skills

View on GitHub
GitHub Stars1.6k
CategoryDevelopment
Updated1d ago
Forks101

Languages

Jupyter Notebook

Security Score

100/100

Audited on Mar 27, 2026

No findings