VGGSounder
VGGSounder, a multi-label audio-visual classification dataset with modality annotations.
Install / Use
/learn @Bizilizi/VGGSounderREADME
📰 News
- [11.06.2025] 📃 Released technical report of VGGSounder. Contains detailed discussion on how we built the first multimodal benchmark for video tagging with complete per-modality annotations for every class.
🌟 Introduction
VGGSounder is a re-annotated benchmark built upon the VGGSound dataset, designed to rigorously evaluate audio-visual foundation models and understand how they utilize modalities. VGGSounder introduces:
- 🔍 Per-label modality tags (audible / visible / both) for all classes in the sample
- 🎵 Meta labels for background music, voice-over, and static images
- 📊 Multiple classes per one sample
🚀 Installation
The VGGSounder dataset is now available as a Python package! Install it via pip:
pip install vggsounder
Or install from source using uv:
git clone https://github.com/bizilizi/vggsounder.git
cd vggsounder
uv build
pip install dist/vggsounder-*.whl
🐍 Python Package Usage
Quick Start
import vggsounder
# Load the dataset
labels = vggsounder.VGGSounder()
# Access video data by ID
video_data = labels["--U7joUcTCo_000000"]
print(video_data.labels) # List of labels for this video
print(video_data.meta_labels) # Metadata (background_music, static_image, voice_over)
print(video_data.modalities) # Modality for each label (A, V, AV)
# Get dataset statistics
stats = labels.stats()
print(f"Total videos: {stats['total_videos']}")
print(f"Unique labels: {stats['unique_labels']}")
# Search functionality
piano_videos = labels.get_videos_with_labels("playing piano")
voice_over_videos = labels.get_videos_with_meta(voice_over=True)
Downloading Dataset Samples
You can optionally download the underlying video/audio samples and attach them to
each VideoData item via download_samples=True. This uses the HuggingFace dataset under the hood.
from vggsounder.labels import VGGSounder
# Enable sample download
vggsounder = VGGSounder(download_samples=True)
# Access a sample by index or video_id
sample = vggsounder[0]
print(sample.video_id)
print(sample.video is not None, sample.audio is not None)
To preview samples in a notebook:
from IPython.display import display, Video, Audio, HTML
import base64
video_b64 = base64.b64encode(sample.video).decode("utf-8")
video_html = f'''
<h4>Video</h4>
<video width="480" height="360" controls>
<source src="data:video/mp4;base64,{video_b64}" type="video/mp4">
Your browser does not support the video tag.
</video>
'''
display(HTML(video_html))
Advanced Usage
# Dict-like interface
print(len(labels)) # Number of videos
print("video_id" in labels) # Check if video exists
for video_id in labels: # Iterate over video IDs
video_data = labels[video_id]
# Get all unique labels
all_labels = labels.get_all_labels()
# Complex queries
static_speech_videos = labels.get_videos_with_meta(
static_image=True, voice_over=True
)
🏷️ Label Format
VGGSounder annotations are stored in a CSV file located at vggsounder/data/vggsounder.csv and vggsounder/data/vggsounder+background-music.csv. Each row corresponds to a single label for a specific video sample. The dataset supports multi-label, multi-modal classification with additional meta-information for robust evaluation.
Columns
video_id: Unique identifier for a 10-second video clip.label: Human-readable label representing a sound or visual category (e.g.male singing,playing timpani).modality: The modality in which the label is perceivable:A= AudibleV= VisibleAV= Both audible and visible
background_music:Trueif the video contains background music.static_image:Trueif the video consists of a static image.voice_over:Trueif the video contains voice-over narration.
Example
| video_id | label | modality | background_music | static_image | voice_over |
|--------------------|------------------|----------|------------------|--------------|------------|
| ---g-f_I2yQ_000001 | male singing | A | True | False | False |
| ---g-f_I2yQ_000001 | people crowd | AV | True | False | False |
| ---g-f_I2yQ_000001 | playing timpani | A | True | False | False |
🧪 Benchmark Evaluation
VGGSounder provides a comprehensive benchmarking system to evaluate audio-visual foundation models across multiple modalities and metrics. The benchmark supports both discrete predictions and continuous logits-based evaluation.
Supported Modalities
a: Audio - includes samples with audio component (A + AV)v: Visual - includes samples with visual component (V + AV)av: Audio-Visual - samples with both modalities (AV only)a only: Audio-only - pure audio samples (excludes AV samples)v only: Visual-only - pure visual samples (excludes AV samples)
Available Metrics
The benchmark computes a comprehensive set of metrics:
- Top-k metrics:
hit_rate@k,f1@k,accuracy@k,precision@k,recall@k,jaccard@k(for k=1,3,5,10) - Aggregate metrics:
f1,f1_macro,accuracy,precision,recall,jaccard,hit_rate - AUC metrics:
auc_roc,auc_pr(ROC-AUC and Precision-Recall AUC) - Modality confusion:
mu(measures when single modalities succeed where multimodal fails)
Model Results Format
Model predictions should be saved as pickle files with the following structure:
{
"video_id": {
"predictions": { # Optional: discrete predictions
"a": ["label1", "label2", ...], # Audio predictions
"v": ["label1", "label3", ...], # Visual predictions
"av": ["label1", "label2", ...] # Audio-visual predictions
},
"logits": { # Optional: continuous scores
"a": [0.1, 0.8, 0.3, ...], # Audio logits (310 classes)
"v": [0.2, 0.1, 0.9, ...], # Visual logits (310 classes)
"av": [0.4, 0.6, 0.2, ...] # Audio-visual logits (310 classes)
}
},
# ... more video_ids
}
Note: Either predictions or logits (or both) should be provided. Logits enable more detailed top-k and AUC analysis.
Running the Benchmark
Quick Start
from vggsounder.benchmark import benchmark
# Define model display names
display_names = {
"cav-mae": "CAV-MAE",
"deepavfusion": "DeepAVFusion",
"equiav": "Equi-AV",
"gemini-1.5-flash": "Gemini 1.5 Flash",
"gemini-1.5-pro": "Gemini 1.5 Pro"
}
# Specify metrics and modalities to evaluate
metrics = [
("accuracy", ["a", "v", "av"]),
("f1", ["a", "v", "av", "a only", "v only"]),
("hit_rate", ["a", "v", "av"]),
("mu", ["a", "v", "av"]) # Modality confusion
]
# Run benchmark
results_table = benchmark(
models_path="path/to/model/pickles",
display_names=display_names,
metrics=metrics
)
print(results_table)
For a detailed example of how we generate the tables used in our paper, please see the example notebook.
Detailed Modality Confusion Analysis
VGGSounder provides a specialized function for analyzing modality confusion at the sample level, helping you understand why certain samples exhibit confusion between unimodal and multimodal predictions.
from vggsounder.benchmark import analyze_modality_confusion_detailed
from vggsounder import VGGSounder
# Analyze modality confusion for a specific model
confusion_analysis = analyze_modality_confusion_detailed(
models_path="path/to/model/pickles",
model_name="gemini-1.5-flash", # Model name without .pkl extension
vggsounder=VGGSounder(background_music=None, voice_over=None, static_image=None)
)
print(f"Found {len(confusion_analysis)} samples with modality confusion")
# Filter by specific confusion types
audio_confused = confusion_analysis[confusion_analysis['confused_a'] == True]
visual_confused = confusion_analysis[confusion_analysis['confused_v'] == True]
combined_confused = confusion_analysis[confusion_analysis['confused_av'] == True]
print(f"Audio confusion: {len(audio_confused)} samples")
print(f"Visual confusion: {len(visual_confused)} samples")
print(f"Combined confusion: {len(combined_confused)} samples")
# Examine specific confused samples
display_cols = ['id', 'ground_truth', 'pred_a', 'pred_v', 'pred_av', 'confused_a', 'confused_v', 'confused_av']
print("\nFirst 3 audio-confused samples:")
print(audio_confused[display_cols
