SoundSCaper

No description available

Generate Convert Improve

Install / Use

/learn @Yuanbo2020/SoundSCaper

About this skill

Quality Score

0/100

README

Automatic soundscape captioner (SoundSCaper): Soundscape Captioning using Sound Affective Quality Network and Large Language Model

Paper link: IEEE TMM

ResearchGate: SoundSCaper

Acknowledgement

We appreciate Dr. Francesco Aletta for valuable discussions and Dr. Gunnar Cerwen for professional soundscape captions. We also appreciate the following 32 audio/soundscape experts and general users who participated in the human evaluation experiments, including Boyan Zhang, Prof. Dr. Catherine Lavandier, Dr. Huizhong Zhang, Hupeng Wu, Jiayu Xie, Dr. Karlo Filipan, Kening Guo, Kenneth Ooi, Xiaochao Chen, Xiang Fang, Yi Yuan, Yanzhao Bi, Zibo Liu, Xinyi Che, Xin Shen, Chen Fang, Yanru Wu, Yuze Li, Zexi Lu, Shiheng Zhang, Xuefeng Yang, Tong Ye, Zeyu Xu, as well as three anonymous experts and six anonymous general users. We thank Prof. Jian Guan and Feiyang Xiao for their valuable support in revising this paper and assisting with model training. <br />

If relevant, please feel free to consider citing our paper:

@ARTICLE{TMM_SoundScaper,
  author={Hou, Yuanbo and Ren, Qiaoqiao and Mitchell, Andrew and Wang, Wenwu and Kang, Jian and Belpaeme, Tony and Botteldooren, Dick},
  journal={IEEE Transactions on Multimedia}, 
  title={Soundscape Captioning Using Sound Affective Quality Network and Large Language Model}, 
  year={2026},
  volume={28},
  number={},
  pages={2186-2200}, 
  doi={10.1109/TMM.2026.3651023}}

SoundSCaper

Introduction

0. SounAQnet training steps (Optional)

1) Dataset preparation

Download and place the ARAUS dataset (ARAUS_repository) into the Dataset_all_ARAUS directory or the Dataset_training_validation_test directory (recommended)
Follow the ARAUS steps (ARAUS_repository) to generate the raw audio dataset. The dataset is about 53 GB, please reserve enough space when preparing the dataset. (If it is in WAV, it may be about 134 GB.)
Split the raw audio dataset according to the training, validation, and test audio file IDs in the Dataset_training_validation_test directory.

The labels of our annotated acoustic scenes and audio events, for the audio clips in the ARAUS dataset, are placed in the Dataset_all_ARAUS directory and the Dataset_training_validation_test directory.

Acoustic feature extraction

Log Mel spectrogram Use the code in Feature_log_mel to extract log mel features.
- Place the dataset into the Dataset folder.
- If the audio file is not in .wav format, please run the convert_flac_to_wav.py first. (This may generate ~132 GB of data as WAV files.)
- Run log_mel_spectrogram.py
Loudness features (ISO 532-1) Use the code in Feature_loudness_ISO532_1 to extract the ISO 532-1:2017 standard loudness features.
- Download the ISO_532-1.exe file, (which has already been placed in the folder ISO_532_bin). -Please place the audio clip files in .wav format to be processed into the Dataset_wav folder -If the audio file is not in .wav format, please use the convert_flac_to_wav.py to convert it. (This may generate ~132 GB of data as WAV files.) -Run ISO_loudness.py

3) Training SoundAQnet

Prepare the training, validation, and test sets according to the corresponding files in the Dataset_training_validation_test directory
Modify the DataGenerator_Mel_loudness_graph function to load the dataset
Run Training.py in the application directory

1. Use SoundAQnet to infer soundscape audio clips for LLMs

This part Inferring_soundscape_clips_for_LLM aims to convert the soundscape audio clips into the predicted audio event probabilities, the acoustic scene labels, and the ISOP, ISOE, and emotion-related PAQ values.

This part bridges the acoustic model and the language model, organising the output of the acoustic model in preparation for the input of the language model.

1) Data preparation

Place the log Mel features files from the Feature_log_mel directory into the Dataset_mel directory
Place the ISO 532-1 loudness feature files from the Feature_loudness_ISO532_1 directory into the Dataset_wav_loudness directory

2) Run the inference script

cd application, python Inference_for_LLM.py
The results, which will be fed into the LLM, will be automatically saved into the corresponding directories: SoundAQnet_event_probability, SoundAQnet_scene_ISOPl_ISOEv_PAQ8DAQs
There are four similar SoundAQnet models in the system/model directory; please feel free to use them
- SoundAQnet_ASC96_AEC94_PAQ1027.pth
- SoundAQnet_ASC96_AEC94_PAQ1039.pth
- SoundAQnet_ASC96_AEC94_PAQ1041.pth
- SoundAQnet_ASC96_AEC95_PAQ1052.pth

3) Inference with other models

This part Inferring_soundscape_clips_for_LLM uses SoundAQnet to infer the values of audio events, acoustic scenes, and emotion-related AQs.

If you want to replace SoundAQnet with another model to generate the soundscape captions,

replace using_model = SoundAQnet in Inference_for_LLM.py with the code for that model,
and place the corresponding trained model into the system/model directory.

4) Demonstration

Please see details here.

2. Generate soundscape captions using generic LLM

This part, LLM_scripts_for_generating_soundscape_caption, loads the acoustic scene, audio events, and PAQ 8-dimensional affective quality values corresponding to the soundscape audio clip predicted by SoundAQnet, and then outputs the corresponding soundscape descriptions. <br> Please fill in your OpenAI username and password in LLM_GPT_soundscape_caption.py.

1) Data preparation

Place the matrix file of audio event probabilities predicted by the SoundAQnet into the SoundAQnet_event_probability directory
Place the SoundAQnet prediction file, including the predicted acoustic scene label, ISOP value, ISOE value, and the 8D AQ values, into the SoundAQnet_scene_ISOPl_ISOEv_PAQ8DAQs directory

2) Generate soundscape caption

Replace the "YOUR_API_KEY_HERE" in line 26 of the LLM_GPT_soundscape_caption.py file with your OpenAI API key
Run LLM_GPT_soundscape_caption.py

3) Demonstration

Please see details here.

3. Expert evaluation of soundscape caption quality

Human_assessment contains

a call for experiment
assessment raw materials

assessment audio dataset
participant instruction file
local and online questionnaires

[Expert assessment results analysis](Huma

Related Skills

node-connect

352.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。