DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer

Paper | Demo

Update

[30/03/2024]: The evaluation code is updated.
[07/02/2024]: The inference script is released.
[06/02/2024]: The model weight is released.

Get started

Environment Setup

conda create --name diffspeaker python=3.9
conda activate diffspeaker

Install MPI-IS. Follow the command in MPI-IS to install the package. Depending on if you have /usr/include/boost/ directories, The command is likely to be

git clone https://github.com/MPI-IS/mesh.git
cd mesh
sudo apt-get install libboost-dev
python -m pip install pip==20.2.4
BOOST_INCLUDE_DIRS=/usr/include/boost/ make all
python -m pip install --upgrade pip

Then install the rest of the dependencies.

cd ..
git clone https://github.com/theEricMa/DiffSpeaker.git
cd DiffSpeaker
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install imageio-ffmpeg
pip install -r requirements.txt

Model Weights

You can access the model parameters by clicking here. Place the checkpoints folder into the root directory of your project. This folder includes the models that have been trained on the BIWI and vocaset datasets, utilizing wav2vec and hubert as the backbones.

Prediction

For the BIWI model, use the script below to perform inference on your chosen audio files. Specify the audio file using the --example argument.

sh scripts/demo/demo_biwi.sh

For the vocaset model, run the following script.

sh scripts/demo/demo_vocaset.sh

Evaluation

To obtain the metrics reported in the paper, use the scripts in scripts/diffusion/biwi_evaluation and scripts/diffusion/vocaset_evaluation. For example, to evaluate DiffSpeaker in BIWI dataset with the hubert backbone, use the following script.

sh scripts/diffusion/biwi_evaluation/diffspeaker_hubert_biwi.sh

Training

Data Preparation

Model Training

mkdir experiments

DiffSpeaker

Install / Use

README