ISDrama
Dataset and evaluation code of ISDrama(ACM-MM 2025): Immersive Spatial Drama Generation through Multimodal Prompting
Install / Use
/learn @AaronZ345/ISDramaREADME
ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
Yu Zhang*, Wenxiang Guo*, Changhao Pan*, Zhiyuan Zhu*, Tao Jin, Zhou Zhao | Zhejiang University
Dataset and evaluation code of ISDrama (ACM-MM 2025): Immersive Spatial Drama Generation through Multimodal Prompting.
We construct MRSDrama, the first multimodal recorded spatial drama dataset, containing binaural drama audios, scripts, videos, geometric poses, and textual prompts. Then, we propose ISDrama, the first immersive spatial drama generation model through multimodal prompting.
We provide the evaluation code in this repository.
Moreover, you can visit our Demo Page for the audio samples of our dataset as well as the results of our model.
Updates
- 2025.07: We released the evaluation code of MRSDrama!
- 2025.07: We released the full dataset of MRSDrama!
- 2025.07: ISDrama is accepted by ACMMM 2025!
Key Features
- We develop MRSDrama, the first multimodal recorded spatial drama dataset, accompanying videos, scripts, alignments, positions, and textual prompts.
- We introduce ISDrama, the first immersive spatial drama generation model through multimodal prompting. We design the Multimodal Pose Encoder to extract pose from multimodal inputs, while the Immersive Drama Transformer produces binaural speech.
- Experimental results show that ISDrama outperforms baseline models on objective and subjective metrics.
Dataset
Where to download
Click to access our full dataset (videos, scripts, alignments, positions, and textual prompts) on Hugging Face for free! Hope our data is helpful for your research.
Besides, we also provide our dataset on .
Please note that, if you are using MRSDrama, it means that you have accepted the terms of license.
Data Architecture
Our dataset is organized hierarchically.
Each top-level folder contains a set of dramas. Each folder contains a subfolder with cut WAV files, an MP4 video file, and a JSON file containing all annotation information. Additionally, the geometric_pose subdirectory stores NumPy (.npy) sequences—listener‑centric 3D positions, head-orientation quaternions, and radial velocities with respect to the left and right ears. These sequences are aligned at the frame level and generated with a 48 kHz sample rate and a 256-sample hop size.
Evaluation of ISDrama
The evaluation process is based on the code and models of "BAT: Learning to Reason about Spatial Sounds with Large Language Models" .
Dependencies
A suitable conda environment named isdrama_eva can be created
and activated with:
conda env create -f environment.yml
bash timm_patch/patch.sh
conda activate isdrama_eva
Preparation
Checkpoint Preparation
Please download the finetuned BAT encoder checkpoint and place it at:
./evaluation/ckpt/finetuned.pth
Make sure the path exists (create the `ckpt`` directory if necessary).
Data Preparation
For evaluation, you must prepare paired ground‑truth audio and generated audio. Place them respectively in:
./evaluation/data/gt
./evaluation/data/infer
The expected directory layout is:
.
├── gt
│ ├── 0000.wav
│ ├── 0001.wav
│ ├── 0002.wav
│ └── 0003.wav
└── infer
├── 0000.wav
├── 0001.wav
├── 0002.wav
└── 0003.wav
Important:
- The files inside gt and infer must correspond one‑to‑one.
- Filenames and counts must match exactly (e.g.,
gt/0002.wavpairs withinfer/0002.wav). - Ensure sampling rates and channel configurations are consistent if required by downstream metrics.
Metrics
Semantic & Acoustic Metrics
- Character Error Rate (CER): Assesses transcript/content accuracy.
- Cosine Similarity (SIM): Measures speaker timbre similarity between the generated audio and the prompt/reference audio (e.g., via speaker embeddings).
- F0 Frame Error (FFE): Evaluates prosody fidelity by comparing voiced/unvoiced decisions and pitch (F0) frames.
Spatial Metrics
- IPD MAE: Mean Absolute Error between ground‑truth and generated Interaural Phase Differences.
- ILD MAE: Mean Absolute Error between ground‑truth and generated Interaural Level Differences.
- Angle Cosine Similarity (ANG Cos): Cosine similarity between ground‑truth and generated direction (azimuth / elevation) angle embeddings.
- Distance Cosine Similarity (Dis Cos): Cosine similarity between ground‑truth and generated distance embeddings.
Note: Cosine‑based spatial scores are in the range [-1, 1], with higher values indicating closer alignment of spatial embeddings.
Running the Evaluation
Run the following script to perform the evaluation pipeline:
cd evaluation
bash ./evaluate/eval.sh
The script evaluate/eval.sh executes the following three stages:
-
Extract angle and distance embeddings using the BAT encoder.
-
Extract IPD & ILD features from paired ground‑truth and generated stereo audio.
-
Compute metrics: MAE (for IPD / ILD) and cosine similarities (for angle and distance).
Ensure that ground‑truth and generated audio files are correctly paired and preprocessed before running the script.
Citations
If you find this code useful in your research, please cite our work:
@article{zhang2025isdrama,
title={ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting},
author={Zhang, Yu and Guo, Wenxiang and Pan, Changhao and Zhu, Zhiyuan and Jin, Tao and Zhao, Zhou},
journal={arXiv preprint arXiv:2504.20630},
year={2025}
}
Disclaimer
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
Related Skills
qqbot-channel
342.5kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
99.6k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
342.5kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
