<div align="center"> <h1> EmoBox </h1> <p> This repository holds code, processed meta-data, and benchmark for <br> <b><em>EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark</em></b> </p> <p> <img src="docs/logo.png" alt="emobox Logo" style="width: 580px; height: 200px;"> </p> <p> </p> <a href="https://emo-box.github.io/index.html"><img src="https://img.shields.io/badge/Benchmark-link-lightgrey" alt="version"></a> <a href="https://arxiv.org/abs/2406.07162"><img src="https://img.shields.io/badge/Paper-link-orange" alt="version"></a> <a href="https://github.com/emo-box/EmoBox"><img src="https://img.shields.io/badge/License-MIT-red.svg" alt="version"></a> </div>

Guides

EmoBox is an out-of-the-box multilingual multi-corpus speech emotion recognition toolkit, along with a benchmark for both intra-corpus and cross-corpus settings on mainstream pre-trained foundation models. We hope that our toolkit and benchmark can facilitate the research of SER in the community.

Datasets

We include 32 speech emotion datasets spanning 14 distinct languages with download links, some of them require license or registration. We provide data preparation and partitioning of each datasets. Refer to the paper for more details.

| Dataset | Source | Lang | Emo | Spk | #Utts | #Hrs | | -------------- | ---------- | -------- | ------- | -------- | ---------- | --------- | | AESDD | Act | Greek | 5 | 5 | 604 | 0.7 | | ASED | Act | Amharic | 5 | 65 | 2474 | 2.1 | | ASVP-ESD | Media | Mix | 12 | 131 | 13964 | 18.0 | | CaFE | Act | French | 7 | 12 | 936 | 1.2 | | CASIA | Act | Mandarin | 6 | 4 | 1200 | 0.6 | | CREMA-D | Act | English | 6 | 91 | 7442 | 5.3 | | EMNS | Act | English | 8 | 1 | 1181 | 1.9 | | EmoDB | Act | German | 7 | 10 | 535 | 0.4 | | EmoV-DB | Act | English | 5 | 4 | 6887 | 9.5 | | EMOVO | Act | Italian | 7 | 6 | 588 | 0.5 | | Emozionalmente | Act | Italian | 7 | 431 | 6902 | 6.3 | | eNTERFACE | Act | English | 6 | 44 | 1263 | 1.1 | | ESD | Act | Mix | 5 | 20 | 35000 | 29.1 | | IEMOCAP | Act | English | 5 | 10 | 5531 | 7.0 | | JL-Corpus | Act | English | 5 | 4 | 2400 | 1.4 | | M3ED | TV | Mandarin | 7 | 626 | 24437 | 9.8 | | MEAD | Act | English | 8 | 48 | 31729 | 37.3 | | MELD | TV | English | 7 | 304 | 13706 | 12.1 | | MER2023 | TV | Mandarin | 6 | / | 5030 | 5.9 | | MESD | Act | Spanish | 6 | 11 | 862 | 0.2 | | MSP-Podcast | Podcast | English | 8 | 1273 | 73042 | 113.6 | | Oreau | Act | French | 7 | 32 | 434 | 0.3 | | PAVOQUE | Act | German | 5 | 1 | 7334 | 12.2 | | Polish | Act | Polish | 3 | 5 | 450 | 0.1 | | RAVDESS | Act | English | 8 | 24 | 1440 | 1.5 | | RESD | Act | Russian | 7 | 200 | 1396 | 2.3 | | SAVEE | Act | English | 7 | 4 | 480 | 0.5 | | ShEMO | Act | Persian | 6 | 87 | 2838 | 3.3 | | SUBESCO | Act | Bangla | 7 | 20 | 7000 | 7.8 | | TESS | Act | English | 7 | 2 | 2800 | 1.6 | | TurEV-DB | Act | Turkish | 4 | 6 | 1735 | 0.5 | | URDU | Talk show | Urdu | 4 | 29 | 400 | 0.3 | | Total | -- | -- | -- | 3510 | 262020 | 294.4 |

Benchmark

Intra-corpus Benchmark

Intra-corpus SER results of 10 pre-trained speech models on 32 emotion datasets spanning 14 distinct languages with EmoBox data partitioning. Refer to the intra-corpus benchmark and the paper for more details.

Cross-corpus Benchmark

Cross-corpus SER results of 10 pre-trained speech models on 4 EmoBox fully balanced test sets. Refer to the cross-corpus benchmark and the paper for more details.

Play with EmoBox

Prepare datasets

You need to download datasets and put them into downloads folder. Make sure paths to your downloaded audio files follow the audio paths in jsonl files in data/.

For example, you need to download the iemocap dataset into relative paths such as downloads/iemocap/Session1/sentences/wav/Ses01F_impro04/Ses01F_impro04_F000.wav, these audio files should follow the audio paths in data/iemocap/iemocap.jsonl

Metadata

We prepare metadata files for each datasets, including several types of formats: json, jsonl, .... For example, the format of metadata in jsonl files is:

[  {
	"key": "Ses01M_impro01_F000",
	"dataset": "iemocap",
	"wav": "downloads/iemocap/Session1/sentences/wav/Ses01M_impro01/Ses01M_impro01_F000.wav",
   "type": "raw" # raw, feature
	"sample_rate": 16000,
	"length": 3.2,
	"task": "category", # category, valence, arousal
	"emo": "hap",
	"channel": 1
	}
	,
	..., 
	{...}
]

Some datasets (e.g. iemocap) require merging labels, so we provide a label_map.json file for this purpose.

Quick Start

EmoBox provides a torch dataset class EmoDataset and a evaluation class EmoEval. You may train your own models using any recipes or toolkits.

Using EmoDataset and EmoEval, it is easy to compare results from any model trained by any recipie or toolkit. Results can be submitted to our benchmark.

We provide an example pipeline code using EmoDataset and EmoEval:

from EmoBox.EmoDataset import EmoDataset
from EmoBox.EmoEval import EmoEval
import json
from pathlib import Path

dataset = "iemocap"
folds = 5 # different datasets have different number of folds, which can be find in data/ 
user_data_dir = "./" # path to EmoBox
meta_data_dir = "data/" # path to data folder

def load_label_map(meta_data_dir: Path, dataset: str):
    # TypeError: unsupported operand type(s) for /: 'str' and 'str'
    lm_path = Path(meta_data_dir) / dataset / "label_map.json"
    if not lm_path.exists():
        raise FileNotFoundError(f"label_map.json not found at {lm_path}")
    label_map = json.loads(lm_path.read_text(encoding="utf-8"))
    labels = sorted(set(label_map.values()))
    label2idx = {lab: i for i, lab in enumerate(labels)}
    return labels, label2idx, label_map

labels, label2idx, label_map = load_label_map(meta_data_dir, dataset)

## take n-flod cross-validation as an example
for fold in range(1, folds+1):
	
	train = EmoDataset(dataset, user_data_dir, meta_data_dir, label_map, fold=fold, split="train")
	val = EmoDataset(dataset, user_data_dir, meta_data_dir, label_map, fold=fold, split="valid")
	
	"""
		Train your model
	"""
	for data in train:
		audio = data["audio"] # raw wav tensor
		label = data["label"] # label, e.g. 'hap'
	
	
	"""
		Evaluate on test data
	"""	
	test = EmoDataset(dataset, user_data_dir, meta_data_dir, label_map, fold=fold, split="test")
	test_pred = [

EmoBox

Install / Use

README