SSV2A
Gotta Hear Them All: Towards Sound Source Aware Audio Generation.
Install / Use
/learn @wguo-ai/SSV2AREADME
Gotta Hear Them All: Towards Sound Source Aware Audio Generation
Flexibly generate sounds by composing visual, text, and audio sound source prompts.
This work is accepted at AAAI 2026.
In order to run our code, please clone the repository and follow these instructions to set up a virtual environment:
conda create -n SSV2A python==3.10pip install -r requirements.txt
The ssv2a module provides implementations for SSV2A. We also provide scripts for major functions below.
Scheduled Releases
- [ ] Distribute the VGG Sound Single Source (VGGS3) dataset.
- [x] Upload code for multimodal inference.
- [x] Upload code for vision-to-audio inference.
Pretrained Weights
We provide pretrained weights of SSV2A modules at this google drive link, which has the following contents:
| Files | Comment | |------------|--------------------------------------------------------------------------------------| | ssv2a.json | Configuration File of SSV2A | | ssv2a.pth | Pretrained Checkpoint of SSV2A | | agg.pth | Pretrained Checkpoint of Temporal Aggregation Module (for video-to-audio generation) |
Please download them according to your usage cases.
As SSV2A works with YOLOv8 for visual sound source detection,
it also needs to include a pretrained YOLO checkpoint for inference. We recommend using yolov8x-oi7
pretrained on the OpenImagesV7 dataset. After downloading this model, paste its path in the "detection-model" field in ssv2a.json.
Inference
There are several hyperparameters you can adjust to control the generation fidelity/diversity/relevance. We list them here:
| Parameter | Default Value | Comment |
|-------------------|----|-------------------------------------------------------------------------------------------------------------------------------|
| --var_samples | 64 | Number of variational samples drawn in each generation and averaged. Higher number increases fidelity and decreases diversity. |
| --cycle_its | 64 | Number of Cycle Mix iterations. Higher number increases generation relevance to given conditions. |
| --cycle_samples | 64 | Number of variational samples drawn in each Cycle Mix iteration. Higher number increases fidelity and decreases diversity. |
| --duration | 10 | Length of generated audio in seconds. |
| --seed | 42 | Random seed for generation. |
Image to Audio Generation
Navigate to the root directory of this repo and execute the following script:
python infer_i2a.py \
--cfg "ssv2a.json" \
--ckpt "ssv2a.pth" \
--image_dir "./images" \
--out_dir "./output"
Replace the arguments with the actual path names on your machine.
Video to Audio Generation
Navigate to the root directory of this repo and execute the following script:
python infer_v2a.py \
--cfg "ssv2a.json" \
--ckpt "ssv2a.pth" \
--agg_ckpt "agg.pth" \
--image_dir "/images" \
--out_dir "./output"
Replace the arguments with the actual path names on your machine.
Multimodal Sound Source Composition
SSV2A accepts multimodal conditions where you describe sound sources as image, text, or audio.
You need to download the DALLE-2 Prior module first in order to close the modality gap of text conditions in CLIP. We recommend this version pretrained by LAION. You can also download from our drive:
| Item | File | |--------------------|------| | Configuration File | dalle2_prior_config.json | | Checkpoint | dalle2_prior.pth |
When these are ready, navigate to the root directory of this repo and execute the following script:
python infer_v2a.py \
--cfg "ssv2a.json" \
--ckpt "ssv2a.pth" \
--dalle2_cfg "dalle2_prior_config.json" \
--dalle2_ckpt "dalle2_prior.pth" \
--images "talking_man.png" "dog.png" \
--texts "raining heavily" "street ambient" \
--audios "thunder.wav" \
--out_dir "./output/audio.wav"
Here are some argument specifications:
--imagestakes visual conditions as a list of images as.pngor.jpgfiles.--textstakes text conditions as a list of strings.--audiostakes audio conditions as a list of.wav,.flac, or.mp3files.
Note that this script, unlike our I2A and V2A codes, only support single-sample inference instead of batches. We support a maximum of 64 sound source condition slots in total for generation. You can leave any modality blank for flexibility. You can also only supply one modality only, such as texts.
Feel free to play with this feature and let your imagination run wild :)
Cite this work
If you find our work useful, please consider citing
@inproceedings{SS2A,
title={Gotta Hear Them All: Towards Sound Source Aware Audio Generation},
author={Guo, Wei and Wang, Heng and Ma, Jianbo and Cai, Weidong},
booktitle={AAAI},
year={2026}
}
References
SSV2A has made friends with several models. We list major references in our code here:
- AudioLDM, by Haohe Liu
- AudioLDM2, by Haohe Liu
- LAION-Audio-630K, by LAION
- CLAP, by LAION
- frechet-audio-distance, by Haohao Tan
- DALLE2-pytorch, by Phil Wang
- CLIP, by OpenAI
Thank you for the excellent works! Other references are commented inline.
