AudioTrust
AudioTrust: Benchmarking the Multi-faceted Trustworthiness of Audio Large Language Models
Install / Use
/learn @JusperLee/AudioTrustREADME
AudioTrust is a large-scale benchmark designed to evaluate the multifaceted trustworthiness of Multimodal Audio Language Models (ALLMs). It examines model behavior across six critical dimensions:
💥 News
- [2026-01-26] AudioTrust got accepted to ICLR'26! 🚀
- [2025-09-30] Added support for Kimi-Audio, Step-Fun, Step-Audio2, OpenS2S, and Qwen2.5-Omni.
- [2025-05-16] We release the AudioTrust benchmark! 🚀
📌 Table of Contents
- 🔍 Overview
- 📁 Repository Structure
- 📦 Dataset Description
- 🧪 Scripts Overview
- 🚀 Quick Start
- 📊 Benchmark Tasks
- 📌 Citation
- 🙏 Acknowledgements
- 📬 Contact
🔍 Overview
- 🎯 Hallucination: Fabricating content unsupported by audio
- 🛡️ Robustness: Performance under audio degradation
- 🧑💻 Authentication: Resistance to speaker spoofing/cloning
- 🕵️ Privacy: Avoiding leakage of personal/private content
- ⚖️ Fairness: Consistency across demographic factors
- 🚨 Safety: Generating safe, non-toxic, legal content

The benchmark provides:
- ✅ Expert-annotated prompts across six sub-datasets
- 🔬 Model-vs-model evaluation with judge LLMs (e.g., GPT-4o)
- 📈 Baseline results and reproducible evaluation scripts
📁 Repository Structure
AudioTrust/
├── assets/ # Logo and visual assets
├── audio_evals/ # Core evaluation engine
│ ├── agg/ # Metric aggregation logic
│ ├── dataset/ # Dataset preprocessing
│ ├── evaluator/ # Scoring logic
│ ├── process/, models/, prompt/, lib/ # Support code
│ ├── eval_task.py # Evaluation controller
│ ├── isolate.py # Single model inference
│ ├── recorder.py # Output logging
│ ├── registry.py # Registry entrypoint
│ └── utils.py # Shared utilities
│
├── registry/ # Modular registry structure
│ ├── agg/, dataset/, eval_task/, evaluator/, model/, prompt/, process/, recorder/
│
├── scripts/ # Shell scripts per task
│ └── hallucination/
│ ├── inference/
│ └── evaluation/
├── data/ # Organized audio files by task
│ ├── hallucination/, robustness/, privacy/, fairness/, authentication/, safety/
├── res/ # Outputs and logs
├── tests/, utils/ # Tests and preprocessing
├── main.py # Main execution entry
├── requirments.txt
├── requirments-offline-model.txt
└── README.md
📦 Dataset Description
- Language: English
- Audio Format: WAV, mono, 16kHz
- Size: ~10.4GB across 6 sub-datasets
Each sample includes:
Audio: decoded waveform (if using Hugging Face loader)AudioPath: path to original WAV fileInferencePrompt: prompt used for model response generationEvaluationPrompt: prompt for evaluator modelRef: reference (expected) answer for scoring
Sub-datasets:
{hallucination, robustness, authentication, privacy, fairness, safety}
🧪 Scripts Overview
Each subtask contains:
| Folder | Purpose |
| ------------- | ----------------------------------------------------------------- |
| inference/ | Use a target model (e.g., Gemini) to generate responses |
| evaluation/ | Use an evaluator model (e.g., GPT-4o) to assess generated outputs |
This supports model-vs-model evaluation pipelines.
🧩 Example: Hallucination Task
scripts/hallucination/
├── inference/
│ └── gemini-2.5-pro.sh
└── evaluation/
└── gpt-4o.sh
🚀 Quick Start
1. Install Dependencies
git clone https://github.com/JusperLee/AudioTrust.git
cd AudioTrust
pip install -r requirments.txt
Or for offline model use:
pip install -r requirments-offline-model.txt
2. Load Dataset from Hugging Face
from datasets import load_dataset
dataset = load_dataset("JusperLee/AudioTrust", split="hallucination")
Materialize the HF dataset to the project data/ layout
If you plan to run the evaluation scripts that expect a local data/ folder, first materialize the Hugging Face dataset into the required directory structure:
python utils/materialize_hf_audio.py --dataset-path JusperLee/AudioTrust
3. Run Inference and Evaluation
# Make sure your API keys are set before running:
export OPENAI_API_KEY=your-openai-api-key
export GOOGLE_API_KEY=your-google-api-key
# Step 1: Run inference with Gemini
bash scripts/hallucination/inference/gemini-2.5-pro.sh
# Step 2: Run evaluation using GPT-4o
bash scripts/hallucination/evaluation/gpt-4o.sh
Or directly with Python:
export OPENAI_API_KEY=your-openai-api-key
python main.py \
--dataset hallucination-content_mismatch \
--prompt hallucination-inference-content-mismatch-exp1-v1 \
--model gemini-1.5-pro
📊 Benchmark Tasks
| Task | Metric | Description | | ----------------------- | ------------------- | --------------------------------------- | | Hallucination Detection | Accuracy / Recall | Groundedness of response in audio | | Robustness Evaluation | Accuracy / Δ Score | Performance drop under corruption | | Authentication Testing | Attack Success Rate | Resistance to spoofing / voice cloning | | Privacy Leakage | Leakage Rate | Does the model leak private content? | | Fairness Auditing | Bias Index | Demographic response disparity | | Safety Assessment | Violation Score | Generation of unsafe or harmful content |
📌 Citation
@inproceedings{li2025audiotrust,
title={AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models},
author={Li, Kai and Shen, Can and Liu, Yile and Han, Jirui and Zheng, Kelong and Zou, Xuechao and Wang, Zhe and Du, Xingjian and Zhang, Shun and Luo, Hanjun and others},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}
🙏 Acknowledgements
We gratefully acknowledge UltraEval-Audio for providing the core infrastructure that inspired and supported parts of this benchmark.
📬 Contact
For questions or collaboration inquiries:
- Kai Li: tsinghua.kaili@gmail.com, Xinfeng Li: lxfmakeit@gmail.com
- Project Page — Coming Soon
