SkillAgentSearch skills...

SoundSCaper

No description available

Install / Use

/learn @Yuanbo2020/SoundSCaper
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Automatic soundscape captioner (SoundSCaper): Soundscape Captioning using Sound Affective Quality Network and Large Language Model

Paper link: IEEE TMM

ResearchGate: SoundSCaper

Acknowledgement

We appreciate Dr. Francesco Aletta for valuable discussions and Dr. Gunnar Cerwen for professional soundscape captions. We also appreciate the following 32 audio/soundscape experts and general users who participated in the human evaluation experiments, including Boyan Zhang, Prof. Dr. Catherine Lavandier, Dr. Huizhong Zhang, Hupeng Wu, Jiayu Xie, Dr. Karlo Filipan, Kening Guo, Kenneth Ooi, Xiaochao Chen, Xiang Fang, Yi Yuan, Yanzhao Bi, Zibo Liu, Xinyi Che, Xin Shen, Chen Fang, Yanru Wu, Yuze Li, Zexi Lu, Shiheng Zhang, Xuefeng Yang, Tong Ye, Zeyu Xu, as well as three anonymous experts and six anonymous general users. We thank Prof. Jian Guan and Feiyang Xiao for their valuable support in revising this paper and assisting with model training. <br />

If relevant, please feel free to consider citing our paper:

@ARTICLE{TMM_SoundScaper,
  author={Hou, Yuanbo and Ren, Qiaoqiao and Mitchell, Andrew and Wang, Wenwu and Kang, Jian and Belpaeme, Tony and Botteldooren, Dick},
  journal={IEEE Transactions on Multimedia}, 
  title={Soundscape Captioning Using Sound Affective Quality Network and Large Language Model}, 
  year={2026},
  volume={28},
  number={},
  pages={2186-2200}, 
  doi={10.1109/TMM.2026.3651023}}

Introduction

0. SounAQnet training steps (Optional)

1) Dataset preparation

The labels of our annotated acoustic scenes and audio events, for the audio clips in the ARAUS dataset, are placed in the Dataset_all_ARAUS directory and the Dataset_training_validation_test directory.

  1. Acoustic feature extraction
  • Log Mel spectrogram Use the code in Feature_log_mel to extract log mel features.

    • Place the dataset into the Dataset folder.
    • If the audio file is not in .wav format, please run the convert_flac_to_wav.py first. (This may generate ~132 GB of data as WAV files.)
    • Run log_mel_spectrogram.py
  • Loudness features (ISO 532-1) Use the code in Feature_loudness_ISO532_1 to extract the ISO 532-1:2017 standard loudness features.

    • Download the ISO_532-1.exe file, (which has already been placed in the folder ISO_532_bin). -Please place the audio clip files in .wav format to be processed into the Dataset_wav folder -If the audio file is not in .wav format, please use the convert_flac_to_wav.py to convert it. (This may generate ~132 GB of data as WAV files.) -Run ISO_loudness.py

3) Training SoundAQnet

  • Prepare the training, validation, and test sets according to the corresponding files in the Dataset_training_validation_test directory
  • Modify the DataGenerator_Mel_loudness_graph function to load the dataset
  • Run Training.py in the application directory

1. Use SoundAQnet to infer soundscape audio clips for LLMs

This part Inferring_soundscape_clips_for_LLM aims to convert the soundscape audio clips into the predicted audio event probabilities, the acoustic scene labels, and the ISOP, ISOE, and emotion-related PAQ values.

This part bridges the acoustic model and the language model, organising the output of the acoustic model in preparation for the input of the language model.

1) Data preparation

  • Place the log Mel features files from the Feature_log_mel directory into the Dataset_mel directory

  • Place the ISO 532-1 loudness feature files from the Feature_loudness_ISO532_1 directory into the Dataset_wav_loudness directory

2) Run the inference script

  • cd application, python Inference_for_LLM.py

  • The results, which will be fed into the LLM, will be automatically saved into the corresponding directories: SoundAQnet_event_probability, SoundAQnet_scene_ISOPl_ISOEv_PAQ8DAQs

  • There are four similar SoundAQnet models in the system/model directory; please feel free to use them

    • SoundAQnet_ASC96_AEC94_PAQ1027.pth
    • SoundAQnet_ASC96_AEC94_PAQ1039.pth
    • SoundAQnet_ASC96_AEC94_PAQ1041.pth
    • SoundAQnet_ASC96_AEC95_PAQ1052.pth

3) Inference with other models

This part Inferring_soundscape_clips_for_LLM uses SoundAQnet to infer the values of audio events, acoustic scenes, and emotion-related AQs.

If you want to replace SoundAQnet with another model to generate the soundscape captions,

  • replace using_model = SoundAQnet in Inference_for_LLM.py with the code for that model,
  • and place the corresponding trained model into the system/model directory.

4) Demonstration

Please see details here.


2. Generate soundscape captions using generic LLM

This part, LLM_scripts_for_generating_soundscape_caption, loads the acoustic scene, audio events, and PAQ 8-dimensional affective quality values corresponding to the soundscape audio clip predicted by SoundAQnet, and then outputs the corresponding soundscape descriptions. <br> Please fill in your OpenAI username and password in LLM_GPT_soundscape_caption.py.

1) Data preparation

  • Place the matrix file of audio event probabilities predicted by the SoundAQnet into the SoundAQnet_event_probability directory

  • Place the SoundAQnet prediction file, including the predicted acoustic scene label, ISOP value, ISOE value, and the 8D AQ values, into the SoundAQnet_scene_ISOPl_ISOEv_PAQ8DAQs directory

2) Generate soundscape caption

  • Replace the "YOUR_API_KEY_HERE" in line 26 of the LLM_GPT_soundscape_caption.py file with your OpenAI API key

  • Run LLM_GPT_soundscape_caption.py

3) Demonstration

Please see details here.


3. Expert evaluation of soundscape caption quality

Human_assessment contains

  1. a call for experiment

  2. assessment raw materials

  1. [Expert assessment results analysis](Huma

Related Skills

View on GitHub
GitHub Stars11
CategoryDevelopment
Updated13d ago
Forks2

Languages

Python

Security Score

70/100

Audited on Mar 26, 2026

No findings