SoundSCaper
No description available
Install / Use
/learn @Yuanbo2020/SoundSCaperREADME
Automatic soundscape captioner (SoundSCaper): Soundscape Captioning using Sound Affective Quality Network and Large Language Model
Paper link: IEEE TMM
ResearchGate: SoundSCaper
Acknowledgement
We appreciate Dr. Francesco Aletta for valuable discussions and Dr. Gunnar Cerwen for professional soundscape captions. We also appreciate the following 32 audio/soundscape experts and general users who participated in the human evaluation experiments, including Boyan Zhang, Prof. Dr. Catherine Lavandier, Dr. Huizhong Zhang, Hupeng Wu, Jiayu Xie, Dr. Karlo Filipan, Kening Guo, Kenneth Ooi, Xiaochao Chen, Xiang Fang, Yi Yuan, Yanzhao Bi, Zibo Liu, Xinyi Che, Xin Shen, Chen Fang, Yanru Wu, Yuze Li, Zexi Lu, Shiheng Zhang, Xuefeng Yang, Tong Ye, Zeyu Xu, as well as three anonymous experts and six anonymous general users. We thank Prof. Jian Guan and Feiyang Xiao for their valuable support in revising this paper and assisting with model training. <br />
If relevant, please feel free to consider citing our paper:
@ARTICLE{TMM_SoundScaper,
author={Hou, Yuanbo and Ren, Qiaoqiao and Mitchell, Andrew and Wang, Wenwu and Kang, Jian and Belpaeme, Tony and Botteldooren, Dick},
journal={IEEE Transactions on Multimedia},
title={Soundscape Captioning Using Sound Affective Quality Network and Large Language Model},
year={2026},
volume={28},
number={},
pages={2186-2200},
doi={10.1109/TMM.2026.3651023}}
- SoundSCaper
- Introduction
- Figure
- 1. Overall framework of the automatic soundscape captioner (SoundSCaper)
- 2. The acoustic model SoundAQnet simultaneously models acoustic scene (AS), audio event (AE), and emotion-related affective quality (AQ)
- 3. Process of the LLM part in the SoundSCaper
- 4. Spearman's rho correlation between different AQs and AEs predicted by SoundAQnet
- 5. Spearman's rho correlation between different AEs and 8D AQs predicted by SoundAQnet
- Run Sound-AQ models to predict the acoustic scene, audio event, and human-perceived affective qualities
- More supplementary experiments' codes and results
- Case study <br>
Introduction
0. SounAQnet training steps (Optional)
1) Dataset preparation
-
Download and place the ARAUS dataset (ARAUS_repository) into the Dataset_all_ARAUS directory or the Dataset_training_validation_test directory (recommended)
-
Follow the ARAUS steps (ARAUS_repository) to generate the raw audio dataset. The dataset is about 53 GB, please reserve enough space when preparing the dataset. (If it is in WAV, it may be about 134 GB.)
-
Split the raw audio dataset according to the training, validation, and test audio file IDs in the Dataset_training_validation_test directory.
The labels of our annotated acoustic scenes and audio events, for the audio clips in the ARAUS dataset, are placed in the Dataset_all_ARAUS directory and the Dataset_training_validation_test directory.
- Acoustic feature extraction
-
Log Mel spectrogram Use the code in Feature_log_mel to extract log mel features.
- Place the dataset into the
Datasetfolder. - If the audio file is not in
.wavformat, please run theconvert_flac_to_wav.pyfirst. (This may generate ~132 GB of data as WAV files.) - Run
log_mel_spectrogram.py
- Place the dataset into the
-
Loudness features (ISO 532-1) Use the code in Feature_loudness_ISO532_1 to extract the ISO 532-1:2017 standard loudness features.
- Download the ISO_532-1.exe file, (which has already been placed in the folder
ISO_532_bin). -Please place the audio clip files in.wavformat to be processed into theDataset_wavfolder -If the audio file is not in.wavformat, please use theconvert_flac_to_wav.pyto convert it. (This may generate ~132 GB of data as WAV files.) -RunISO_loudness.py
- Download the ISO_532-1.exe file, (which has already been placed in the folder
3) Training SoundAQnet
- Prepare the training, validation, and test sets according to the corresponding files in the Dataset_training_validation_test directory
- Modify the
DataGenerator_Mel_loudness_graphfunction to load the dataset - Run
Training.pyin theapplicationdirectory
1. Use SoundAQnet to infer soundscape audio clips for LLMs
This part Inferring_soundscape_clips_for_LLM aims to convert the soundscape audio clips into the predicted audio event probabilities, the acoustic scene labels, and the ISOP, ISOE, and emotion-related PAQ values.
This part bridges the acoustic model and the language model, organising the output of the acoustic model in preparation for the input of the language model.
1) Data preparation
-
Place the log Mel features files from the
Feature_log_meldirectory into theDataset_meldirectory -
Place the ISO 532-1 loudness feature files from the
Feature_loudness_ISO532_1directory into theDataset_wav_loudnessdirectory
2) Run the inference script
-
cd
application, pythonInference_for_LLM.py -
The results, which will be fed into the LLM, will be automatically saved into the corresponding directories:
SoundAQnet_event_probability,SoundAQnet_scene_ISOPl_ISOEv_PAQ8DAQs -
There are four similar SoundAQnet models in the
system/modeldirectory; please feel free to use them- SoundAQnet_ASC96_AEC94_PAQ1027.pth
- SoundAQnet_ASC96_AEC94_PAQ1039.pth
- SoundAQnet_ASC96_AEC94_PAQ1041.pth
- SoundAQnet_ASC96_AEC95_PAQ1052.pth
3) Inference with other models
This part Inferring_soundscape_clips_for_LLM uses SoundAQnet to infer the values of audio events, acoustic scenes, and emotion-related AQs.
If you want to replace SoundAQnet with another model to generate the soundscape captions,
- replace
using_model = SoundAQnetinInference_for_LLM.pywith the code for that model, - and place the corresponding trained model into the
system/modeldirectory.
4) Demonstration
Please see details here.
2. Generate soundscape captions using generic LLM
This part, LLM_scripts_for_generating_soundscape_caption, loads the acoustic scene, audio events, and PAQ 8-dimensional affective quality values corresponding to the soundscape audio clip predicted by SoundAQnet, and then outputs the corresponding soundscape descriptions. <br> Please fill in your OpenAI username and password in LLM_GPT_soundscape_caption.py.
1) Data preparation
-
Place the matrix file of audio event probabilities predicted by the SoundAQnet into the
SoundAQnet_event_probabilitydirectory -
Place the SoundAQnet prediction file, including the predicted acoustic scene label, ISOP value, ISOE value, and the 8D AQ values, into the
SoundAQnet_scene_ISOPl_ISOEv_PAQ8DAQsdirectory
2) Generate soundscape caption
-
Replace the "YOUR_API_KEY_HERE" in line 26 of the
LLM_GPT_soundscape_caption.pyfile with your OpenAI API key -
Run
LLM_GPT_soundscape_caption.py
3) Demonstration
Please see details here.
3. Expert evaluation of soundscape caption quality
Human_assessment contains
-
a call for experiment
- assessment audio dataset
- participant instruction file
- local and online questionnaires
- [Expert assessment results analysis](Huma
Related Skills
node-connect
352.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
