ISDrama

Dataset and evaluation code of ISDrama(ACM-MM 2025): Immersive Spatial Drama Generation through Multimodal Prompting

Generate Convert Improve

Install / Use

/learn @AaronZ345/ISDrama

About this skill

Quality Score

0/100

README

ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting

Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao | Zhejiang University

Dataset and evaluation code of ISDrama (ACM-MM 2025): Immersive Spatial Drama Generation through Multimodal Prompting.

We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting.

We provide the evaluation code in this repository.

Moreover, you can visit our Demo Page for the audio samples of our dataset as well as the results of our model.

Updates

2025.07: We released the evaluation code of MRSDrama!
2025.07: We released the full dataset of MRSDrama!
2025.07: ISDrama is accepted by ACMMM 2025!

Key Features

We develop MRSDrama, the first multimodal recorded spatial drama dataset, accompanying videos, scripts, alignments, positions, and textual prompts.
We introduce ISDrama, the first immersive spatial drama generation model through multimodal prompting. We design the Multimodal Pose Encoder to extract pose from multimodal inputs, while the Immersive Drama Transformer produces binaural speech.
Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics.

Dataset

Where to download

Click to access our full dataset (videos, scripts, alignments, positions, and textual prompts) on Hugging Face for free! Hope our data is helpful for your research.

Besides, we also provide our dataset on .

Please note that, if you are using MRSDrama, it means that you have accepted the terms of license.

Data Architecture

Our dataset is organized hierarchically.

Each top-level folder contains a set of dramas. Each folder contains a subfolder with cut WAV files, an MP4 video file, and a JSON file containing all annotation information. Additionally, the geometric_pose subdirectory stores NumPy (.npy) sequences—listener‑centric 3D positions, head-orientation quaternions, and radial velocities with respect to the left and right ears. These sequences are aligned at the frame level and generated with a 48 kHz sample rate and a 256-sample hop size.

Evaluation of ISDrama

The evaluation process is based on the code and models of "BAT: Learning to Reason about Spatial Sounds with Large Language Models" .

Dependencies

A suitable conda environment named isdrama_eva can be created and activated with:

conda env create -f environment.yml
bash timm_patch/patch.sh
conda activate isdrama_eva

Preparation

Checkpoint Preparation

Please download the finetuned BAT encoder checkpoint and place it at:

./evaluation/ckpt/finetuned.pth

Make sure the path exists (create the `ckpt`` directory if necessary).

Data Preparation

For evaluation, you must prepare paired ground‑truth audio and generated audio. Place them respectively in:

./evaluation/data/gt
./evaluation/data/infer

The expected directory layout is:

.
├── gt
│   ├── 0000.wav
│   ├── 0001.wav
│   ├── 0002.wav
│   └── 0003.wav
└── infer
    ├── 0000.wav
    ├── 0001.wav
    ├── 0002.wav
    └── 0003.wav

Important:

The files inside gt and infer must correspond one‑to‑one.
Filenames and counts must match exactly (e.g., gt/0002.wav pairs with infer/0002.wav).
Ensure sampling rates and channel configurations are consistent if required by downstream metrics.

Metrics

Semantic & Acoustic Metrics

Character Error Rate (CER): Assesses transcript/content accuracy.
Cosine Similarity (SIM): Measures speaker timbre similarity between the generated audio and the prompt/reference audio (e.g., via speaker embeddings).
F0 Frame Error (FFE): Evaluates prosody fidelity by comparing voiced/unvoiced decisions and pitch (F0) frames.

Spatial Metrics

IPD MAE: Mean Absolute Error between ground‑truth and generated Interaural Phase Differences.
ILD MAE: Mean Absolute Error between ground‑truth and generated Interaural Level Differences.
Angle Cosine Similarity (ANG Cos): Cosine similarity between ground‑truth and generated direction (azimuth / elevation) angle embeddings.
Distance Cosine Similarity (Dis Cos): Cosine similarity between ground‑truth and generated distance embeddings.

Note: Cosine‑based spatial scores are in the range [-1, 1], with higher values indicating closer alignment of spatial embeddings.

Running the Evaluation

Run the following script to perform the evaluation pipeline:

cd evaluation
bash ./evaluate/eval.sh

The script evaluate/eval.sh executes the following three stages:

Extract angle and distance embeddings using the BAT encoder.
Extract IPD & ILD features from paired ground‑truth and generated stereo audio.
Compute metrics: MAE (for IPD / ILD) and cosine similarities (for angle and distance).

Ensure that ground‑truth and generated audio files are correctly paired and preprocessed before running the script.

Citations

If you find this code useful in your research, please cite our work:

@article{zhang2025isdrama,
  title={ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting},
  author={Zhang, Yu and Guo, Wenxiang and Pan, Changhao and Zhu, Zhiyuan and Jin, Tao and Zhao, Zhou},
  journal={arXiv preprint arXiv:2504.20630},
  year={2025}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Related Skills

qqbot-channel

342.5k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

99.6k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

342.5k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

AaronZ345

View profile

View on GitHub

GitHub Stars237

CategoryContent

Updated1mo ago

Forks0

AaronZ345/ISDrama

Languages

Python

Security Score

85/100

Audited on Feb 25, 2026

No findings