WASD

No description available

Generate Convert Improve

Install / Use

/learn @Tiago-Roxo/WASD

About this skill

Quality Score

0/100

README

Wilder Active Speaker Detection (WASD) Dataset

This repository contains the code and data for our paper (TBIOM 2025):

WASD: A Wilder Active Speaker Detection Dataset
Tiago Roxo, Joana Cabral Costa, Pedro R. M. Inácio, and Hugo Proença

For further details about WASD, please visit our dataset website

⭐ What's New

Last updated: 2026-03-19

| | Update | Description | |---|---|---| | 📥 | New download method | We will update this repository to have direct links to preprocessed WASD | | 💻 | Code update | We fix minor errors in preparing and downloading WASD content |

Wilder Active Speaker Detection (WASD) dataset has increased difficulty by targeting the two key components of current Active Speaker Detection: audio and face. Grouped into 5 categories, ranging from optimal conditions to surveillance settings, WASD contains incremental challenges for Active Speaker Detection with tactical impairment of audio and face data.

dataset_main_image Considered categories of WASD, with relative audio and face quality represented. Categories range from low (Optimal Conditions) to high (Surveillance Settings) ASD difficulty by varying audio and face quality. Easier categories contain similar characteristics to AVA-ActiveSpeaker (AVA-like), while harder ones are the novelty of WASD.

📊 State-of-the-art Results

Models Trained on AVA-ActiveSpeaker

Comparison of AVA-ActiveSpeaker trained state-of-the-art models on AVA-ActiveSpeaker and categories of WASD, using the mAP metric. We train and evaluate each model following the authors’ implementation. OC refers to Optimal Conditions, SI to Speech Impairment, FO to Face Occulsion, HVN to Human Voice Noise, and SS to Surveillance Settings. AVA refers to AVA-ActiveSpeaker.

| Model | AVA | OC | SI | FO | HVN | SS | WASD | Pretrained | |:-------------------------------------------------------------|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:----------:| | ASC | 83.6 | 86.4 | 84.8 | 69.9 | 66.4 | 51.1 | 74.6 | Download | | MAAS | 82.0 | 83.3 | 81.3 | 68.6 | 65.6 | 46.0 | 70.7 | Download | | ASDNet | 91.1 | 91.1 | 90.4 | 78.2 | 74.9 | 48.1 | 79.2 | Download | | TalkNet | 91.8 | 91.6 | 93.0 | 86.4 | 77.2 | 64.6 | 85.0 | Download | | TS-TalkNet | 92.7 | 91.1 | 93.7 | 88.6 | 79.2 | 64.0 | 85.7 | Download | | Light-ASD | 93.4 | 93.1 | 93.8 | 88.7 | 80.1 | 65.2 | 86.2 | Download |

Models Trained on WASD

Comparison of state-of-the-art models on the different categories of WASD, using the mAP metric. OC refers to Optimal Conditions, SI to Speech Impairment, FO to Face Occulsion, HVN to Human Voice Noise, and SS to Surveillance Settings.

| Model | OC | SI | FO | HVN | SS | WASD | Pretrained | |:-------------------------------------------------------------|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|------------:| | ASC | 91.2 | 92.3 | 87.1 | 66.8 | 72.2 | 85.7 | Download | | MAAS | 90.7 | 92.6 | 87.0 | 67.0 | 76.5 | 86.4 | Download | | ASDNet | 96.5 | 97.4 | 92.1 | 77.4 | 77.8 | 92.0 | Download | | TalkNet | 95.8 | 97.5 | 93.1 | 81.4 | 77.5 | 92.3 | Download | | TS-TalkNet | 96.8 | 97.9 | 94.4 | 84.0 | 79.3 | 93.1 | Download | | Light-ASD | 97.8 | 98.3 | 95.4 | 84.7 | 77.9 | 93.7 | Download | | | | | | | | | | | BIAS | 97.8 | 98.4 | 95.9 | 85.6 | 82.5 | 94.5 | Download | | ASDnB | 98.7 | 98.9 | 97.2 | 89.5 | 82.7 | 95.6 | Download | |

🗂️ Download Dataset

The dataset can be obtained in two ways:

Option A - Direct Download

| Format | Size | Description | Link | |--------|------|------|------| | clips_audios.zip | 7 GB | Preprocessed audio files | Coming soon | | clips_videos_body.zip | 246 GB | Preprocessed body frames | Coming soon | | clips_videos.zip | 46 GB | Preprocessed face frames | Coming soon | | csv.zip | 0.3 GB | CSV files | Coming soon |

Option B - Preprocessing from Source

⚠️ The preprocessing downloads the WASD source videos from a Google Drive link, in prepare_setup.py in function download_videos. If you have trouble downloading from this link, we provide a direct link in Coming soon

Download the content of this GitHub repository;
Execute python3 prepare_setup.py to create the WASD directory and necessary subfolders;
Execute python3 create_dataset.py to extract audio and face data;
1. (OPTIONAL) If you want to obtain body data, execute python3 create_dataset.py --body;

Expected WASD Folder Structure

In the end you should have the following WASD folder structure:

|-- WASD
|   |-- clips_audios
|   |   |-- ...
|   |-- clips_videos
|   |   |-- ...
|   |-- clips_videos_body
|   |   |-- ...
|   |-- csv
|       |-- train_body_loader.csv
|       |-- train_body_orig.csv
|       |-- train_loader.csv
|       |-- train_orig.csv
|       |-- val_body_loader.csv
|       |-- val_body_orig.csv
|       |-- val_loader.csv
|       |-- val_orig.csv

The following folders are created from Option B - Preprocessing from Source and are not necessary for ASD and can be deleted (if you want) from the WASD folder:

orig_videos;
orig_audios;
WASD_videos.

(OPTIONAL) If you wish to use the dataset in a format compatible with ASC, ASDNet, and MAAS, execute python3 convert_dataset.py.

⚠️ Note: This will change the WASD folder to this format. If you want to have both formats available, do a backup of the original WASD.

Evaluate Models on WASD

To evaluate models we use the official implementation to compute active speaker detection on AVA-ActiveSpeaker:

python3 -O WASD_evaluation.py -g $GT -p $PRED

where $GT is the groundtruth CSV (*val_orig

Related Skills

node-connect

345.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

104.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

345.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

345.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。