DeepAVFusion
Official codebase for "Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling".
Install / Use
/learn @stoneMo/DeepAVFusionREADME
Masked Autoencoders enable strong Audio-Visual Early Fusion
<div align="center"> <img width="100%" alt="DeepAVFusion Illustration" src="assets/deepavfusion.png"> </div>Official codebase and pre-trained models for our DeepAVFusion framework as described in the paper.
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling
Shentong Mo, Pedro Morgado<br> IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
Setup
Environment
Our environment was created as follows
conda create -n deepavfusion python=3.10
conda activate deepavfusion
conda install pytorch=2.0 torchvision=0.15 torchaudio=2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install submitit hydra-core av wandb tqdm scipy scikit-image scikit-learn timm mir_eval jupyter matplotlib
Simply run conda env create -f requirements.yml to replicate it.
Datasets
In this work, we used a variety of datasets, including VGGSound, AudioSet, MUSIC and AVSBench. We assume that you have downloaded all datasets. Expected data format is briefly described in DATASETS.md
PATH2VGGSOUND="/path/to/vggsound"
PATH2AUDIOSET="/path/to/audioset"
PATH2MUSIC="/path/to/music"
PATH2AVSBENCH="/path/to/avsbench"
DeepAVFusion Pre-training
We release two models based on the VIT-Base architecture, trained on the VGGSounds and AudioSet datasets, respectively. The models were trained with the following commands.
# Pre-training on VGGSounds
PYTHONPATH=. python launcher.py --config-name=deepavfusion job_name=deepavfusion_vitb_vggsound_ep\${opt.epochs} \
data.dataset=vggsound data.data_path=${PATH2VGGSOUND} \
model.fusion.layers=all model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 \
opt.epochs=200 opt.warmup_epochs=40 opt.batch_size=64 opt.accum_iter=1 opt.blr=1.5e-4 \
env.ngpu=8 env.world_size=1 env.seed=0
# Pre-training on AudioSet
PYTHONPATH=. python launcher.py --config-name=deepavfusion job_name=deepavfusion_vitb_as2m_ep\${opt.epochs} \
data.dataset=audioset data.data_path=${PATH2AUDIOSET} \
model.fusion.layers=all model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 \
opt.epochs=200 opt.warmup_epochs=40 opt.batch_size=64 opt.accum_iter=4 opt.blr=1.5e-4 \
env.ngpu=8 env.world_size=1 env.seed=0
The nearest neighbor training curve of the model trained on VGGSound can be seen below. The retrieval performance of fusion tokens is substantially better than uni-modal representations, suggesting that fusion tokens aggregate high-level semantics, while uni-modal representations encode the low-level details required for masked reconstruction.
<div align="center"> <img width="50%" alt="DeepAVFusion training curve" src="assets/training_curve.png"> </div>The pre-trained models are available in the checkpoints/ directory.
Downstream tasks
We evaluate our model on a variety of downstream tasks. In each case, the pre-trained model is used for feature extraction (with or without fine-tuning depending on the evaluation protocol) and a task-specific decoder is trained from scratch to carry the task.
Audio Event Recognition
| Dataset | Eval Protocol | Pre-trained Model | Top1 Acc | |
|:--------:|:-------------:|:-----------------:|:--------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| VGGSound | Linear Probe | VGGSound-200ep | 53.08 | <details><summary>CMD</summary>PYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_vggsound pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=60 opt.warmup_epochs=10 opt.batch_size=64 opt.accum_iter=4 opt.blr=0.3 env.ngpu=4 env.world_size=1</details> |
| VGGSound | Linear Probe | AudioSet2M-200ep | 53.08 | <details><summary>CMD</summary>PYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_vggsound pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=60 opt.warmup_epochs=10 opt.batch_size=64 opt.accum_iter=4 opt.blr=0.3 env.ngpu=4 env.world_size=1</details> |
| VGGSound | Fine-tuning | VGGSound-200ep | 58.19 | <details><summary>CMD</summary>PYTHONPATH=. python launcher.py --config-name=finetune job_name=eval_finetune_vggsound pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1</details> |
| VGGSound | Fine-tuning | AudioSet2M-200ep | 57.91 | <details><summary>CMD</summary>PYTHONPATH=. python launcher.py --config-name=finetune job_name=finetune_vggsound pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1</details> |
| Dataset | Eval Protocol | Pre-trained Model | Top1 AP | |
|:------------:|:-------------:|:-----------------:|:-------:|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| AudioSet-Bal | Linear Probe | VGGSound-200ep | 53.08 | <details><summary>CMD</summary>PYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_as2mbal pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=300 opt.warmup_epochs=20 opt.batch_size=256 opt.accum_iter=1 opt.blr=0.3 env.ngpu=2 env.world_size=1</details> |
| AudioSet-Bal | Linear Probe | AudioSet2M-200ep | 53.08 | <details><summary>CMD</summary>PYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_as2mbal pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=300 opt.warmup_epochs=20 opt.batch_size=256 opt.accum_iter=1 opt.blr=0.3 env.ngpu=2 env.world_size=1</details> |
| AudioSet-Bal | Fine-tuning | VGGSound-200ep | 58.19 | <details><summary>CMD</summary>PYTHONPATH=. python launcher.py --config-name=finetune job_name=eval_finetune_as2mbal pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=200 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1</details> |
| AudioSet-Bal | Fine-tuning | AudioSet2M-200ep | 57.91 | <details><summary>CMD</summary>PYTHONPATH=. python launcher.py --config-name=finetune job_name=eval_finetune_as2mbal pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=200 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1</details> |
Visually Guided Source Separation
| Dataset | Pre-training | SDR | SIR | SAR | | |:--------------:|:----------------:|:----:|:----:|:-----:|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
