ClipFace: Text-guided Editing of Textured 3D Morphable Models Official PyTorch implementation of SIGGRPAH 2023 paper

Teaser

ClipFace: Text-guided Editing of Textured 3D Morphable Models Shivangi Aneja, Justus Thies, Angela Dai, Matthias Niessner https://shivangi-aneja.github.io/projects/clipface

Abstract: We propose ClipFace, a novel self-supervised approach for text-guided editing of textured 3D morphable model of faces. Specifically, we employ user-friendly language prompts to enable control of the expressions as well as appearance of 3D faces. We leverage the geometric expressiveness of 3D morphable models, which inherently possess limited controllability and texture expressivity, and develop a self-supervised generative model to jointly synthesize expressive, textured, and articulated faces in 3D. We enable high-quality texture generation for 3D faces by adversarial self-supervised training, guided by differentiable rendering against collections of real RGB images. Controllable editing and manipulation are given by language prompts to adapt texture and expression of the 3D morphable model. To this end, we propose a neural network that predicts both texture and expression latent codes of the morphable model. Our model is trained in a self-supervised fashion by exploiting differentiable rendering and losses based on a pre-trained CLIP model. Once trained, our model jointly predicts face textures in UV-space, along with expression parameters to capture both geometry and texture changes in facial expressions in a single forward pass. We further show the applicability of our method to generate temporally changing textures for a given animation sequence.

<a id="section1">1. Getting started</a>

Pre-requisites

Linux
NVIDIA GPU + CUDA 11.4
Python 3.8

Installation

Dependencies:
It is recommended to install dependecies using pip The dependencies for defining the environment are provided in requirements.txt. For differentiable rendering, we use NvDiffrast, which can also be installed via pip.

<a id="section2">2. Pre-trained Models required for training ClipFace</a>

Please download these models, as they will be required for experiments.

| Path | Description |:--------------------------------------| :---------- | FLAME | We use FLAME 3DMM in our experiments. FLAME takes as input shape, pose and expression blendshapes and predicts mesh vertices. We used the FLAME 2020 generic model for our experiments. Using any other FLAME model might lead to wrong mesh predictions for expression manipulation experiments. Please download the model from the official website by signing their user agreement. Copy the generic model as data/flame/generic_model.pkl and FLAME template as data/flame/head_template.obj in the project directory. | DECA | DECA model predicts FLAME parameters for an RGB image. This is used during training StyleGAN-based texture generator, is available for download here This can be skipped you don't intend to train the texture generator and use our pre-trained texture generator.

<a id="section3">3. Training</a>

The code is well-documented and should be easy to follow.

Source Code: $ git clone this repo and install the dependencies from requirements.txt. The source code is implemented in PyTorch Lightning and differentiable rendering with NvDiffrast so familiarity with these is expected.
Dataset: We used FFHQ dataset to train our texture generator. This is publicly available here. All images are resized to 512 X 512 for our experiments.
Data Generation: From the original FFHQ dataset (70,000 images), we first remove images with headwear and eyewear. This gives us a clean and filtered FFHQ dataset (~45,000 images), which we use to train our stylegan-based texture generator. We use DECA model to predict FLAME parameters for each image in this filtered dataset. We pre-compute these FLAME parameters prior to training the generator model. We then use FLAME to predict mesh vertices for each image. Finally, we render the mesh with texture maps generated using our generator using differentiable rendering. For real images, we mask out background and mouth interior using alpha masks extracted from DECA. We provide the filtered image list, alpha masks and FLAME parameters for the filtered dataset in Section 4 for simplicity. For texture manipulation of video sequence, we provide the pre-computed FLAME parameters for the video sequences (Laughing & Angry) in Section 4.
Training: Run the corresponding scripts depending on whether you want to train the texture generator or perform text-guided manipulation. The scripts for training are available in trainer/ directory.
- Texture Generator: We use StyleGAN2 generator with adaptive discriminator StyleGAN-ADA to generate UV maps due to its faster convergence. To train, run the following command:
```
python -m trainer.trainer_stylegan.train_stylegan_ada_texture_patch
```
- Text-guided Manipulation: We perform text-guided manipulation on textures generated using our pre-trained texture generator trained above. We first pre-train the mapper networks to predict zero offsets before performing text-guided manipulation, this pretrained checkpoint is available in Section 4. Run these scripts to perform text-guided manipulation.
```
# To train only for texture manipulation
python -m trainer.trainer_texture_expression.train_mlp_texture

# To train for both texture and expression manipulation
python -m trainer.trainer_texture_expression.train_mlp_texture_expression
```

Text-guided Video Manipulation: In the paper, we show results for temporally changing textures guided by text prompts. Similar to above, we pre-train the mapper network to predict zero offsets before performing text-guided manipulation, however here the mapper is conditioned on image latent code as well as expression and pose code; Checkpoints available in Section 4. Run the following script to manipulate temporally changing texture.
```
# To synthesize temporal textures for given video sequence
python -m trainer.trainer_video.train_video_mlp_texture.py
```

Path Configuration: The configuration for training texture generator is configs/stylegan_ada.yaml and for text-guided manipulation is configs/clipface.yaml. Please refer to these files to configure the data paths and model paths for training.
- Refer to configs/stylegan_ada.yaml to define the necessary data paths and model paths for training texture generator.
- Refer to configs/stylegan_ada_clip_mlp.py to define the necessary data paths and model paths for training texture and expression mappers for text-guided manipulation. Update the text prompt for manipulation in this file, defined as altered_prompt.

<a id="section4">4. ClipFace Pretrained Models and Dataset Assets</a>

| Path | Description |:--------------------------------------------------------------------------------------------------------------| :---------- | Filtered FFHQ Dataset | Download the filenames of Filtered FFHQ dataset; alpha masks and FLAME-space mesh vertices predicted using DECA. This can be skipped if you don't intend to train the texture generator and use our pre-trained texture generator. | Texture Generator | The pretrained texture generator to synthesize UV texture maps. | UV Texture Latent Codes | The latent codes generated from texture generator used to train the text-guided mapper networks. | Text-Manipulation Assets | The flame parameters & vertices for a neutral template face, These will be used to perform clip-guided manipulation. Copy these to data/clip/ directory. | Video Manipulation Dataset | In the paper, for temporal textures we show results for two text prompts (laughing and angry). Here we provide the pre-computed FLAME parameters for these sequences. Download them and extract to appropriate directory and configure the path corresponding to key exp_codes_pth in config/clipface.yaml. | Pretrained Zero-Offset Mapper | Pretrained mappers to predict zero offsets for text-guided manipulation | Pretrained Texture & Expression Manipulation Models | Pretrained ClipFace checkpoints for different texture and expression styles shown in paper. Texture manipulation models can be downloaded from here; and expression manipulation models can be downloaded from here. | Pretrained Zero-Offset Video Mapper | Pretrained mappers to predict zero offsets for text-guided video manipulation | Pretrained Video Manipulation Models | P

ClipFace

Install / Use

README

ClipFace: Text-guided Editing of Textured 3D Morphable Models<br><sub>Official PyTorch implementation of SIGGRPAH 2023 paper</sub>