VoxBox
A large-scale speech corpus introduced in Spark-TTS, built from diverse open-source datasets for training text-to-speech (TTS) systems.
Install / Use
/learn @SparkAudio/VoxBoxREADME
VoxBox
A large-scale speech corpus introduced in Spark-TTS, built from diverse open-source datasets for training text-to-speech (TTS) systems.
📊 VoxBox Dataset Overview
Chinese Datasets
| Data | Language | #Utterance | Male (h) | Female (h) | Total (h) | |------|----------|------------|----------|------------|-----------| | AISHELL-3 | Chinese | 88,035 | 16.01 | 69.61 | 85.62 | | CASIA | Chinese | 857 | 0.25 | 0.2 | 0.44 | | Emilia-CN | Chinese | 15,629,241 | 22,017.56 | 12,741.89 | 34,759.45 | | ESD | Chinese | 16,101 | 6.69 | 7.68 | 14.37 | | HQ-Conversations | Chinese | 50,982 | 35.77 | 64.23 | 100 | | M3ED | Chinese | 253 | 0.04 | 0.06 | 0.1 | | MAGICDATA | Chinese | 609,474 | 360.31 | 393.81 | 754.13 | | MER2023 | Chinese | 1,667 | 0.86 | 1.07 | 1.93 | | NCSSD-CL-CN | Chinese | 98,628 | 53.83 | 59.21 | 113.04 | | NCSSD-RC-CN | Chinese | 21,688 | 7.05 | 22.53 | 29.58 | | WenetSpeech4TTS | Chinese | 8,856,480 | 7,504.19 | 4,264.3 | 11,768.49 | | Total (Chinese) | | 25,373,406 | 30,002.56 | 17,624.59 | 47,627.15 |
English Datasets
| Data | Language | #Utterance | Male (h) | Female (h) | Total (h) | |------|----------|------------|----------|------------|-----------| | CREMA-D | English | 809 | 0.3 | 0.27 | 0.57 | | Dailytalk | English | 23,754 | 10.79 | 10.86 | 21.65 | | Emilia-EN | English | 8,303,103 | 13,724.76 | 6,573.22 | 20,297.98 | | EMNS | English | 918 | 0 | 1.49 | 1.49 | | EmoV-DB | English | 3,647 | 2.22 | 2.79 | 5 | | Expresso | English | 11,595 | 5.47 | 5.39 | 10.86 | | Gigaspeech | English | 6,619,339 | 4,310.19 | 2,885.66 | 7,195.85 | | Hi-Fi TTS | English | 323,911 | 133.31 | 158.38 | 291.68 | | IEMOCAP | English | 2,423 | 1.66 | 1.31 | 2.97 | | JL-Corpus | English | 893 | 0.26 | 0.26 | 0.52 | | Librispeech | English | 230,865 | 393.95 | 367.67 | 761.62 | | LibriTTS-R | English | 363,270 | 277.87 | 283.03 | 560.9 | | MEAD | English | 3,767 | 2.26 | 2.42 | 4.68 | | MELD | English | 5,100 | 2.14 | 1.94 | 4.09 | | MLS-English | English | 6,319,002 | 14,366.25 | 11,212.92 | 25,579.18 | | MSP-Podcast | English | 796 | 0.76 | 0.56 | 1.32 | | NCSSD-CL-EN | English | 62,107 | 36.84 | 32.93 | 69.77 | | NCSSD-RL-EN | English | 10,032 | 4.18 | 14.92 | 19.09 | | RAVDESS | English | 950 | 0.49 | 0.48 | 0.97 | | SAVEE | English | 286 | 0.15 | 0.15 | 0.31 | | TESS | English | 1,956 | 0 | 1.15 | 1.15 | | VCTK | English | 44,283 | 16.95 | 24.51 | 41.46 | | Total (English) | | 22,332,806 | 33,290.8 | 21,582.31 | 54,873.11 |
Overall
| Data | #Utterance | Male (h) | Female (h) | Total (h) | |------|------------|----------|------------|-----------| | Overall Total | 47,706,212 | 63,293.36 | 39,206.9 | 102,500.26 |
Dataset Structure
The dataset is organized as follows:
.
├── audios/
│ └── aishell-3/ # Audio files (organized by sub-corpus)
│ └── ...
└── metadata/
├── aishell-3.jsonl
├── casia.jsonl
├── commonvoice_cn.jsonl
├── ...
└── wenetspeech4tts.jsonl # JSONL metadata files
Each JSONL file corresponds to a specific sub-corpus and contains metadata records for individual audio samples.
Metadata Format
Each line in the JSONL files is a JSON object detailing an audio sample. For example:
{
"index": "VCTK_0000044280",
"split": "train",
"language": "en",
"age": "Youth-Adult",
"gender": "female",
"emotion": "UNKNOWN",
"pitch": 180.626,
"pitch_std": 0.158,
"speed": 4.2,
"duration": 3.84,
"speech_duration": 3.843,
"syllable_num": 16,
"text": "Clearly, the need for a personal loan is written in the stars.",
"syllables": "K-L-IH1-R L-IY0 DH-AH0 N-IY1-D F-AO1 R-AH0 P-ER1 S-IH0 N-IH0-L L-OW1 N-IH1 Z-R-IH1 T-AH0 N-IH0-N DH-AH0-S T-AA1-R-Z",
"wav_path": "vctk/VCTK_0000044280.flac"
}
Key fields include:
index: Unique identifier for the audio sample.split: Dataset split (e.g., train, test).language: Language of the audio sample (e.g., "en" for English, "zh" for Chinese).age,gender,emotion: Speaker attributes.pitch,pitch_std,speed: Acoustic features.duration: Total duration of the audio sample in seconds.speech_duration: Duration excluding silence at both ends.syllable_num: Number of syllables in the utterance.text: Transcription of the utterance.syllables: Syllable-level transcription.wav_path: Relative path to the audio file within the dataset.
📥 Download Data
You can download the VoxBox dataset via the Hugging Face Datasets Hub:
1️⃣ Download the Full Dataset
You can clone the entire dataset repository (metadata + all audio files):
git lfs install
git clone https://huggingface.co/datasets/SparkAudio/voxbox
⚠️ The full dataset is large (5.82 TB), and may take considerable time and storage.
2️⃣ Download Specific Subsets
from huggingface_hub import HfApi, hf_hub_download
# ✅ Specify the subsets you want to download
target_subsets = ['casia', 'cremad', 'emns']
REPO_ID = "SparkAudio/voxbox"
REPO_TYPE = "dataset"
api = HfApi()
dataset_info = api.dataset_info(repo_id=REPO_ID)
# Get all available file paths (rfilename)
all_paths = [s.rfilename for s in dataset_info.siblings]
for subset in target_subsets:
print(f"\n🔽 Downloading subset: {subset}")
# Download metadata file
metadata_path = f"metadata/{subset}.jsonl"
if metadata_path in all_paths:
print(f"📄 Metadata found: {metadata_path}")
hf_hub_download(
repo_id=REPO_ID,
repo_type=REPO_TYPE,
filename=metadata_path,
local_dir="./voxbox_subset",
local_dir_use_symlinks=False,
)
else:
print(f"⚠️ Metadata not found: {metadata_path}")
# Match all audio tar.gz files for the subset
audio_tars = [f for f in all_paths if f.startswith(f"audios/{subset}/") and f.endswith(".tar.gz")]
if not audio_tars:
print(f"⚠️ No audio files found for {subset}")
continue
for tar_file in audio_tars:
print(f"🎧 Downloading audio: {tar_file}")
hf_hub_download(
repo_id=REPO_ID,
repo_type=REPO_TYPE,
filename=tar_file,
local_dir="./voxbox_subset",
local_dir_use_symlinks=False,
)
✅ Directory Structure of the Downloaded Results
voxbox_subset/
├── audios
│ ├── casia
│ │ └── casia_0000.tar.gz
│ ├── cremad
│ │ └── cremad_0000.tar.gz
│ └── emns
│ └── emns_0000.tar.gz
└── metadata
├── casia.jsonl
├── cremad.jsonl
└── emns.jsonl
Label Your Own Data
python -m tools.annotation \
--audio_path 'path to the audio' \
--text 'transcription of the audio'
License
Please refer to the original licenses of each sub-corpus. This dataset aggregates and annotates the metadata in a unified structure for research purposes.
Citation
If you use this dataset in your research, please consider citing:
@article{wang2025spark,
title={Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens},
author={Wang, Xinsheng and Jiang, Mingqi and Ma, Ziyang and Zhang, Ziyu and Liu, Songxiang and Li, Linqin and Liang, Zheng and Zheng, Qixi and Wang, Rui and Feng, Xiaoqin and others},
journal={arXiv preprint arXiv:2503.01710},
year={2025}
}
Related Skills
node-connect
335.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
335.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.7kCommit, push, and open a PR
