Inserting Anybody in Diffusion Models via Celeb Basis (NeurIPS 23)

<div> <a href="https://ygtxr1997.github.io/" target="_blank">Ge Yuan</a>1,2, <a href="http://vinthony.github.io/" target="_blank">Xiaodong Cun</a>2, <a href="https://yzhang2016.github.io" target="_blank">Yong Zhang</a>2, <a href="https://scholar.google.com/citations?user=ym_t6QYAAAAJ&hl=zh-CN&oi=sra" target="_blank">Maomao Li</a>2,*, <a href="https://chenyangqiqi.github.io/" target="_blank">Chenyang Qi</a>3,2, <a href="https://xinntao.github.io/" target="_blank">Xintao Wang</a>2, <a href="https://scholar.google.com/citations?hl=zh-CN&user=4oXBp9UAAAAJ" target="_blank">Ying Shan</a>2, <a href="https://scholar.google.com/citations?user=CCUQi50AAAAJ" target="_blank">Huicheng Zheng</a>1,* (* Corresponding Authors) </div> <div class="is-size-5 publication-authors"> 1 Sun Yat-sen University     2 Tencent AI Lab     3 HKUST </div>

TL;DR: Intergrating a unique individual into the pre-trained diffusion model with:

✅ just one facial photograph ✅ only 1024 learnable parameters ✅ in 3 minutes tunning ✅ Textural-Inversion compatibility ✅ Genearte and interact with other (new person) concepts

Fig1

Updates

2023/10/11: Our paper is accepted by NIPS'23!
2023/06/23: Code released!

How It Work

Fig2

First, we collect about 1,500 celebrity names as the initial collection. Then, we manually filter the initial one to m = 691 names, based on the synthesis quality of text-to-image diffusion model(stable-diffusion} with corresponding name prompt. Later, each filtered name is tokenized and encoded into a celeb embedding group. Finally, we conduct Principle Component Analysis to build a compact orthogonal basis.

Fig4

We then personalize the model using input photo. During training~(left), we optimize the coefficients of the celeb basis with the help of a fixed face encoder. During inference~(right), we combine the learned personalized weights and shared celeb basis to generate images with the input identity.

More details can be found in our project page.

Setup

Our code mainly bases on Textual Inversion. We add some environment requirements for Face Alignment & Recognition to the original environment of Textual Inversion. To set up our environment, please run:

conda env create -f environment.yaml
conda activate sd

The pre-trained weights used in this repo include Stable Diffusion v1-4 and CosFace R100 trained on Glint360K. You may copy these pre-trained weights to ./weights, and the directory tree will be like:

CelebBasis/
  |-- weights/
      |--glint360k_cosface_r100_fp16_0.1/
          |-- backbone.pth (249MB)
      |--sd-v1-4-full-ema.ckpt (7.17GB)

We use PIPNet to align and crop the face. The PIPNet pre-trained weights can be downloaded from this link (provided by @justindujardin) or our Baidu Yun Drive with extracting code: ygss. Please copy epoch59.pth and FaceBoxesV2.pth to CelebBasis/evaluation/face_align/PIPNet/weights/.

Usage

0. Face Alignment

To make the Face Recognition model work as expected, given an image of a person, we first align and crop the face following FFHQ-Dataset.

Assuming your image folder is /Your/Path/To/Images/ori/ and the output folder is /Your/Path/To/Image/ffhq/, you may run the following command to align & crop images.

bash ./00_align_face.sh /Your/Path/To/Images/ori /Your/Path/To/Images/ffhq

Then, a pickle file named ffhq.pickle using absolute path will be generated under /Your/Path/To/Images/, which is used for training dataset setting later. For example, we provide the original and cropped StyleGAN generated faces in Baiduyun Drive (code:ygss), where:

stylegan3-r-ffhq-1024x1024 is the original images (/Your/Path/To/Images/ori)
stylegan3-r-ffhq-1024x1024_ffhq is the cropped images (/Your/Path/To/Image/ffhq/)
stylegan3-r-ffhq-1024x1024_ffhq.pickle is the pickle list file (/Your/Path/To/Images/ffhq.pickle)

We also provide some cropped faces in ./infer_images/dataset_stylegan3_10id/ffhq as the example and reference.

1. Personalization

The training config file is ./configs/stable-diffusion/aigc_id.yaml. The most important settings are listed as follows.

Important Data Settings

data:
  params:
    batch_size: 2  # We use batch_size 2
    train:
      target: ldm.data.face_id.FaceIdDatasetOneShot  # or ldm.data.face_id.FaceIdDatasetStyleGAN3
      params:
        pickle_path: /Your/Path/To/Images/ffhq.pickle  # pickle file generated by Face Alignment, consistent with 'target'
        num_ids: 2  # how many IDs used for jointly training
        specific_ids: [1,2]  # you may specify the index of ID for training, e.g. [0,1,2,3,4,5,6,7,8,9], 0 means the first
    validation:
      target: ldm.data.face_id.FaceIdDatasetOneShot
      params:
        pickle_path: /Your/Path/To/Images/ffhq.pickle  # consistent with train.params.pickle_path

Important Model Settings

model:
  params:
    personalization_config:
      target: ldm.modules.embedding_manager.EmbeddingManagerId
      params:
        max_ids: 10  # max joint learning #ids, should >= data.train.num_ids
        num_embeds_per_token: 2  # consistent with [cond_stage_config]
        meta_mlp_depth: 1  # single layer is ok
        meta_inner_dim: 512  # consistent with [n_components]
        test_mode: 'coefficient'  # coefficient/embedding/image/all
        momentum: 0.99  # momentum update the saved dictionary
        save_fp16: False  # save FP16, default is FP32

    cond_stage_config:
      target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
      params:
        use_celeb: True  # use celeb basis
        use_svd: True  # use SVD version of PCA
        rm_repeats: True  # removing repeated words can be better
        celeb_txt: "./infer_images/wiki_names_v2.txt"  # celebs, wiki_names_v1, wiki_names_v2.txt
        n_components: 512  # consistent with [meta_inner_dim]
        use_flatten: False  # flattening means dropping the word position information
        num_embeds_per_token: 2  # consistent with [personalization_config]

Important Training Settings

lightning:
  modelcheckpoint:
    params:
      every_n_train_steps: 200  # 100x num of IDs
  callbacks:
    image_logger:
      params:
        batch_frequency: 600  # 300x num of IDs
  trainer:
    max_steps: 800  # 400x num of IDs

Training

bash ./01_start_train.sh ./weights/sd-v1-4-full-ema.ckpt

Consequently, a project folder named traininYYYY-MM-DDTHH-MM-SS_celebbasis is generated under ./logs.

2. Generation

Edit the prompt file ./infer_images/example_prompt.txt, where sks denotes the first identity and ks denotes the second identity.

Optionally, in ./02_start_test.sh, you may modify the following var as you need:

step_list=(799)  # the step of trained '.pt' files, e.g. (99 199 299 399)
eval_id1_list=(0)  # the ID index of the 1st person, e.g. (0 1 2 3 4)
eval_id2_list=(1)  # the ID index of the 2nd person, e.g. (0 1 2 3 4)

Testing

bash ./02_start_test.sh "./weights/sd-v1-4-full-ema.ckpt" "./infer_images/example_prompt.txt" "traininYYYY-MM-DDTHH-MM-SS_celebbasis"

The generated images are under ./outputs/traininYYYY-MM-DDTHH-MM-SS_celebbasis.

3. (Optional) Extracting ID Coefficients

Optionally, you can extract the coefficients for each identity by running:

bash ./03_extract.sh "./weights/sd-v1-4-full-ema.ckpt" "traininYYYY-MM-DDTHH-MM-SS_celebbasis"

The extracted coefficients or embeddings are under ./weights/ti_id_embeddings/.

TODO

[x] release code
[x] release celeb basis names
[ ] simplify the pipeline
[ ] add diffusers supports
[ ] add SDXL supports
[ ] release google colab project
[ ] release WebUI extension

CelebBasis

Install / Use

README