Gotta Hear Them All: Towards Sound Source Aware Audio Generation

Flexibly generate sounds by composing visual, text, and audio sound source prompts.

This work is accepted at AAAI 2026.

In order to run our code, please clone the repository and follow these instructions to set up a virtual environment:

conda create -n SSV2A python==3.10
pip install -r requirements.txt

The ssv2a module provides implementations for SSV2A. We also provide scripts for major functions below.

Scheduled Releases

[ ] Distribute the VGG Sound Single Source (VGGS3) dataset.
[x] Upload code for multimodal inference.
[x] Upload code for vision-to-audio inference.

Pretrained Weights

We provide pretrained weights of SSV2A modules at this google drive link, which has the following contents:

| Files | Comment | |------------|--------------------------------------------------------------------------------------| | ssv2a.json | Configuration File of SSV2A | | ssv2a.pth | Pretrained Checkpoint of SSV2A | | agg.pth | Pretrained Checkpoint of Temporal Aggregation Module (for video-to-audio generation) |

Please download them according to your usage cases.

As SSV2A works with YOLOv8 for visual sound source detection, it also needs to include a pretrained YOLO checkpoint for inference. We recommend using yolov8x-oi7 pretrained on the OpenImagesV7 dataset. After downloading this model, paste its path in the "detection-model" field in ssv2a.json.

Inference

There are several hyperparameters you can adjust to control the generation fidelity/diversity/relevance. We list them here:

| Parameter | Default Value | Comment | |-------------------|----|-------------------------------------------------------------------------------------------------------------------------------| | --var_samples | 64 | Number of variational samples drawn in each generation and averaged. Higher number increases fidelity and decreases diversity. | | --cycle_its | 64 | Number of Cycle Mix iterations. Higher number increases generation relevance to given conditions. | | --cycle_samples | 64 | Number of variational samples drawn in each Cycle Mix iteration. Higher number increases fidelity and decreases diversity. | | --duration | 10 | Length of generated audio in seconds. | | --seed | 42 | Random seed for generation. |

Image to Audio Generation

Navigate to the root directory of this repo and execute the following script:

python infer_i2a.py \
--cfg "ssv2a.json" \
--ckpt "ssv2a.pth" \
--image_dir "./images" \
--out_dir "./output"

Replace the arguments with the actual path names on your machine.

Video to Audio Generation

Navigate to the root directory of this repo and execute the following script:

python infer_v2a.py \
--cfg "ssv2a.json" \
--ckpt "ssv2a.pth" \
--agg_ckpt "agg.pth" \
--image_dir "/images" \
--out_dir "./output"

Replace the arguments with the actual path names on your machine.

Multimodal Sound Source Composition

SSV2A accepts multimodal conditions where you describe sound sources as image, text, or audio.

You need to download the DALLE-2 Prior module first in order to close the modality gap of text conditions in CLIP. We recommend this version pretrained by LAION. You can also download from our drive:

| Item | File | |--------------------|------| | Configuration File | dalle2_prior_config.json | | Checkpoint | dalle2_prior.pth |

When these are ready, navigate to the root directory of this repo and execute the following script:

python infer_v2a.py \
--cfg "ssv2a.json" \
--ckpt "ssv2a.pth" \
--dalle2_cfg "dalle2_prior_config.json" \
--dalle2_ckpt "dalle2_prior.pth" \
--images "talking_man.png" "dog.png" \
--texts "raining heavily" "street ambient" \
--audios "thunder.wav" \
--out_dir "./output/audio.wav"

Here are some argument specifications:

--images takes visual conditions as a list of images as .png or .jpg files.
--texts takes text conditions as a list of strings.
--audios takes audio conditions as a list of .wav, .flac, or .mp3 files.

Note that this script, unlike our I2A and V2A codes, only support single-sample inference instead of batches. We support a maximum of 64 sound source condition slots in total for generation. You can leave any modality blank for flexibility. You can also only supply one modality only, such as texts.

Feel free to play with this feature and let your imagination run wild :)

Cite this work

If you find our work useful, please consider citing

@inproceedings{SS2A,
  title={Gotta Hear Them All: Towards Sound Source Aware Audio Generation},
  author={Guo, Wei and Wang, Heng and Ma, Jianbo and Cai, Weidong},
  booktitle={AAAI},
  year={2026}
}

References

SSV2A has made friends with several models. We list major references in our code here:

AudioLDM, by Haohe Liu
AudioLDM2, by Haohe Liu
LAION-Audio-630K, by LAION
CLAP, by LAION
frechet-audio-distance, by Haohao Tan
DALLE2-pytorch, by Phil Wang
CLIP, by OpenAI

Thank you for the excellent works! Other references are commented inline.

SSV2A

Install / Use

README