<h1 align="center"><a href="https://vggsounder.github.io/static/vggsounder.pdf"> VGGSounder: Audio-Visual Evaluations for Foundation Models</a></h1> <h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏</h2> <h5 align="center">

<br>

</h5>

📰 News

[11.06.2025] 📃 Released technical report of VGGSounder. Contains detailed discussion on how we built the first multimodal benchmark for video tagging with complete per-modality annotations for every class.

🌟 Introduction

VGGSounder is a re-annotated benchmark built upon the VGGSound dataset, designed to rigorously evaluate audio-visual foundation models and understand how they utilize modalities. VGGSounder introduces:

🔍 Per-label modality tags (audible / visible / both) for all classes in the sample
🎵 Meta labels for background music, voice-over, and static images
📊 Multiple classes per one sample

🚀 Installation

The VGGSounder dataset is now available as a Python package! Install it via pip:

pip install vggsounder

Or install from source using uv:

git clone https://github.com/bizilizi/vggsounder.git
cd vggsounder
uv build
pip install dist/vggsounder-*.whl

🐍 Python Package Usage

Quick Start

import vggsounder

# Load the dataset
labels = vggsounder.VGGSounder()

# Access video data by ID
video_data = labels["--U7joUcTCo_000000"]
print(video_data.labels)        # List of labels for this video
print(video_data.meta_labels)   # Metadata (background_music, static_image, voice_over)
print(video_data.modalities)    # Modality for each label (A, V, AV)

# Get dataset statistics
stats = labels.stats()
print(f"Total videos: {stats['total_videos']}")
print(f"Unique labels: {stats['unique_labels']}")

# Search functionality
piano_videos = labels.get_videos_with_labels("playing piano")
voice_over_videos = labels.get_videos_with_meta(voice_over=True)

Downloading Dataset Samples

You can optionally download the underlying video/audio samples and attach them to each VideoData item via download_samples=True. This uses the HuggingFace dataset under the hood.

from vggsounder.labels import VGGSounder

# Enable sample download
vggsounder = VGGSounder(download_samples=True)

# Access a sample by index or video_id
sample = vggsounder[0]
print(sample.video_id)
print(sample.video is not None, sample.audio is not None)

To preview samples in a notebook:

from IPython.display import display, Video, Audio, HTML
import base64

video_b64 = base64.b64encode(sample.video).decode("utf-8")
video_html = f'''
<h4>Video</h4>
<video width="480" height="360" controls>
    <source src="data:video/mp4;base64,{video_b64}" type="video/mp4">
    Your browser does not support the video tag.
</video>
'''
display(HTML(video_html))

Advanced Usage

# Dict-like interface
print(len(labels))                    # Number of videos
print("video_id" in labels)           # Check if video exists
for video_id in labels:               # Iterate over video IDs
    video_data = labels[video_id]

# Get all unique labels
all_labels = labels.get_all_labels()

# Complex queries
static_speech_videos = labels.get_videos_with_meta(
    static_image=True, voice_over=True
)

🏷️ Label Format

VGGSounder annotations are stored in a CSV file located at vggsounder/data/vggsounder.csv and vggsounder/data/vggsounder+background-music.csv. Each row corresponds to a single label for a specific video sample. The dataset supports multi-label, multi-modal classification with additional meta-information for robust evaluation.

Columns

video_id: Unique identifier for a 10-second video clip.
label: Human-readable label representing a sound or visual category (e.g. male singing, playing timpani).
modality: The modality in which the label is perceivable:
- A = Audible
- V = Visible
- AV = Both audible and visible
background_music: True if the video contains background music.
static_image: True if the video consists of a static image.
voice_over: True if the video contains voice-over narration.

Example

| video_id | label | modality | background_music | static_image | voice_over | |--------------------|------------------|----------|------------------|--------------|------------| | ---g-f_I2yQ_000001 | male singing | A | True | False | False | | ---g-f_I2yQ_000001 | people crowd | AV | True | False | False | | ---g-f_I2yQ_000001 | playing timpani | A | True | False | False |

🧪 Benchmark Evaluation

VGGSounder provides a comprehensive benchmarking system to evaluate audio-visual foundation models across multiple modalities and metrics. The benchmark supports both discrete predictions and continuous logits-based evaluation.

Supported Modalities

a: Audio - includes samples with audio component (A + AV)
v: Visual - includes samples with visual component (V + AV)
av: Audio-Visual - samples with both modalities (AV only)
a only: Audio-only - pure audio samples (excludes AV samples)
v only: Visual-only - pure visual samples (excludes AV samples)

Available Metrics

The benchmark computes a comprehensive set of metrics:

Top-k metrics: hit_rate@k, f1@k, accuracy@k, precision@k, recall@k, jaccard@k (for k=1,3,5,10)
Aggregate metrics: f1, f1_macro, accuracy, precision, recall, jaccard, hit_rate
AUC metrics: auc_roc, auc_pr (ROC-AUC and Precision-Recall AUC)
Modality confusion: mu (measures when single modalities succeed where multimodal fails)

Model Results Format

Model predictions should be saved as pickle files with the following structure:

{
    "video_id": {
        "predictions": {  # Optional: discrete predictions
            "a": ["label1", "label2", ...],     # Audio predictions
            "v": ["label1", "label3", ...],     # Visual predictions
            "av": ["label1", "label2", ...]     # Audio-visual predictions
        },
        "logits": {      # Optional: continuous scores
            "a": [0.1, 0.8, 0.3, ...],         # Audio logits (310 classes)
            "v": [0.2, 0.1, 0.9, ...],         # Visual logits (310 classes)  
            "av": [0.4, 0.6, 0.2, ...]         # Audio-visual logits (310 classes)
        }
    },
    # ... more video_ids
}

Note: Either predictions or logits (or both) should be provided. Logits enable more detailed top-k and AUC analysis.

Running the Benchmark

Quick Start

from vggsounder.benchmark import benchmark

# Define model display names
display_names = {
    "cav-mae": "CAV-MAE",
    "deepavfusion": "DeepAVFusion", 
    "equiav": "Equi-AV",
    "gemini-1.5-flash": "Gemini 1.5 Flash",
    "gemini-1.5-pro": "Gemini 1.5 Pro"
}

# Specify metrics and modalities to evaluate
metrics = [
    ("accuracy", ["a", "v", "av"]),
    ("f1", ["a", "v", "av", "a only", "v only"]), 
    ("hit_rate", ["a", "v", "av"]),
    ("mu", ["a", "v", "av"])  # Modality confusion
]

# Run benchmark
results_table = benchmark(
    models_path="path/to/model/pickles",
    display_names=display_names,
    metrics=metrics
)

print(results_table)

For a detailed example of how we generate the tables used in our paper, please see the example notebook.

Detailed Modality Confusion Analysis

VGGSounder provides a specialized function for analyzing modality confusion at the sample level, helping you understand why certain samples exhibit confusion between unimodal and multimodal predictions.

from vggsounder.benchmark import analyze_modality_confusion_detailed
from vggsounder import VGGSounder

# Analyze modality confusion for a specific model
confusion_analysis = analyze_modality_confusion_detailed(
    models_path="path/to/model/pickles",
    model_name="gemini-1.5-flash",  # Model name without .pkl extension
    vggsounder=VGGSounder(background_music=None, voice_over=None, static_image=None)
)

print(f"Found {len(confusion_analysis)} samples with modality confusion")

# Filter by specific confusion types
audio_confused = confusion_analysis[confusion_analysis['confused_a'] == True]
visual_confused = confusion_analysis[confusion_analysis['confused_v'] == True]
combined_confused = confusion_analysis[confusion_analysis['confused_av'] == True]

print(f"Audio confusion: {len(audio_confused)} samples")
print(f"Visual confusion: {len(visual_confused)} samples")
print(f"Combined confusion: {len(combined_confused)} samples")

# Examine specific confused samples
display_cols = ['id', 'ground_truth', 'pred_a', 'pred_v', 'pred_av', 'confused_a', 'confused_v', 'confused_av']
print("\nFirst 3 audio-confused samples:")
print(audio_confused[display_cols

VGGSounder

Install / Use

README

📰 News

🌟 Introduction

🚀 Installation

🐍 Python Package Usage

Quick Start

Downloading Dataset Samples

Advanced Usage

🏷️ Label Format

Columns

Example

🧪 Benchmark Evaluation

Supported Modalities

Available Metrics

Model Results Format

Running the Benchmark

Quick Start

Detailed Modality Confusion Analysis