VoxBox

A large-scale speech corpus introduced in Spark-TTS, built from diverse open-source datasets for training text-to-speech (TTS) systems.

Generate Convert Improve

Install / Use

/learn @SparkAudio/VoxBox

About this skill

Quality Score

0/100

README

VoxBox

A large-scale speech corpus introduced in Spark-TTS, built from diverse open-source datasets for training text-to-speech (TTS) systems.

📊 VoxBox Dataset Overview

Chinese Datasets

| Data | Language | #Utterance | Male (h) | Female (h) | Total (h) | |------|----------|------------|----------|------------|-----------| | AISHELL-3 | Chinese | 88,035 | 16.01 | 69.61 | 85.62 | | CASIA | Chinese | 857 | 0.25 | 0.2 | 0.44 | | Emilia-CN | Chinese | 15,629,241 | 22,017.56 | 12,741.89 | 34,759.45 | | ESD | Chinese | 16,101 | 6.69 | 7.68 | 14.37 | | HQ-Conversations | Chinese | 50,982 | 35.77 | 64.23 | 100 | | M3ED | Chinese | 253 | 0.04 | 0.06 | 0.1 | | MAGICDATA | Chinese | 609,474 | 360.31 | 393.81 | 754.13 | | MER2023 | Chinese | 1,667 | 0.86 | 1.07 | 1.93 | | NCSSD-CL-CN | Chinese | 98,628 | 53.83 | 59.21 | 113.04 | | NCSSD-RC-CN | Chinese | 21,688 | 7.05 | 22.53 | 29.58 | | WenetSpeech4TTS | Chinese | 8,856,480 | 7,504.19 | 4,264.3 | 11,768.49 | | Total (Chinese) | | 25,373,406 | 30,002.56 | 17,624.59 | 47,627.15 |

English Datasets

| Data | Language | #Utterance | Male (h) | Female (h) | Total (h) | |------|----------|------------|----------|------------|-----------| | CREMA-D | English | 809 | 0.3 | 0.27 | 0.57 | | Dailytalk | English | 23,754 | 10.79 | 10.86 | 21.65 | | Emilia-EN | English | 8,303,103 | 13,724.76 | 6,573.22 | 20,297.98 | | EMNS | English | 918 | 0 | 1.49 | 1.49 | | EmoV-DB | English | 3,647 | 2.22 | 2.79 | 5 | | Expresso | English | 11,595 | 5.47 | 5.39 | 10.86 | | Gigaspeech | English | 6,619,339 | 4,310.19 | 2,885.66 | 7,195.85 | | Hi-Fi TTS | English | 323,911 | 133.31 | 158.38 | 291.68 | | IEMOCAP | English | 2,423 | 1.66 | 1.31 | 2.97 | | JL-Corpus | English | 893 | 0.26 | 0.26 | 0.52 | | Librispeech | English | 230,865 | 393.95 | 367.67 | 761.62 | | LibriTTS-R | English | 363,270 | 277.87 | 283.03 | 560.9 | | MEAD | English | 3,767 | 2.26 | 2.42 | 4.68 | | MELD | English | 5,100 | 2.14 | 1.94 | 4.09 | | MLS-English | English | 6,319,002 | 14,366.25 | 11,212.92 | 25,579.18 | | MSP-Podcast | English | 796 | 0.76 | 0.56 | 1.32 | | NCSSD-CL-EN | English | 62,107 | 36.84 | 32.93 | 69.77 | | NCSSD-RL-EN | English | 10,032 | 4.18 | 14.92 | 19.09 | | RAVDESS | English | 950 | 0.49 | 0.48 | 0.97 | | SAVEE | English | 286 | 0.15 | 0.15 | 0.31 | | TESS | English | 1,956 | 0 | 1.15 | 1.15 | | VCTK | English | 44,283 | 16.95 | 24.51 | 41.46 | | Total (English) | | 22,332,806 | 33,290.8 | 21,582.31 | 54,873.11 |

Overall

| Data | #Utterance | Male (h) | Female (h) | Total (h) | |------|------------|----------|------------|-----------| | Overall Total | 47,706,212 | 63,293.36 | 39,206.9 | 102,500.26 |

Dataset Structure

The dataset is organized as follows:

.
├── audios/
│   └── aishell-3/                      # Audio files (organized by sub-corpus)
│   └── ...
└── metadata/
    ├── aishell-3.jsonl
    ├── casia.jsonl
    ├── commonvoice_cn.jsonl
    ├── ...
    └── wenetspeech4tts.jsonl          # JSONL metadata files

Each JSONL file corresponds to a specific sub-corpus and contains metadata records for individual audio samples.

Metadata Format

Each line in the JSONL files is a JSON object detailing an audio sample. For example:

{
  "index": "VCTK_0000044280",
  "split": "train",
  "language": "en",
  "age": "Youth-Adult",
  "gender": "female",
  "emotion": "UNKNOWN",
  "pitch": 180.626,
  "pitch_std": 0.158,
  "speed": 4.2,
  "duration": 3.84,
  "speech_duration": 3.843,
  "syllable_num": 16,
  "text": "Clearly, the need for a personal loan is written in the stars.",
  "syllables": "K-L-IH1-R L-IY0 DH-AH0 N-IY1-D F-AO1 R-AH0 P-ER1 S-IH0 N-IH0-L L-OW1 N-IH1 Z-R-IH1 T-AH0 N-IH0-N DH-AH0-S T-AA1-R-Z",
  "wav_path": "vctk/VCTK_0000044280.flac"
}

Key fields include:

index: Unique identifier for the audio sample.
split: Dataset split (e.g., train, test).
language: Language of the audio sample (e.g., "en" for English, "zh" for Chinese).
age, gender, emotion: Speaker attributes.
pitch, pitch_std, speed: Acoustic features.
duration: Total duration of the audio sample in seconds.
speech_duration: Duration excluding silence at both ends.
syllable_num: Number of syllables in the utterance.
text: Transcription of the utterance.
syllables: Syllable-level transcription.
wav_path: Relative path to the audio file within the dataset.

📥 Download Data

You can download the VoxBox dataset via the Hugging Face Datasets Hub:

1️⃣ Download the Full Dataset

You can clone the entire dataset repository (metadata + all audio files):

git lfs install
git clone https://huggingface.co/datasets/SparkAudio/voxbox

⚠️ The full dataset is large (5.82 TB), and may take considerable time and storage.

2️⃣ Download Specific Subsets


from huggingface_hub import HfApi, hf_hub_download

# ✅ Specify the subsets you want to download
target_subsets = ['casia', 'cremad', 'emns']

REPO_ID = "SparkAudio/voxbox"
REPO_TYPE = "dataset"

api = HfApi()
dataset_info = api.dataset_info(repo_id=REPO_ID)

# Get all available file paths (rfilename)
all_paths = [s.rfilename for s in dataset_info.siblings]

for subset in target_subsets:
    print(f"\n🔽 Downloading subset: {subset}")

    # Download metadata file
    metadata_path = f"metadata/{subset}.jsonl"
    if metadata_path in all_paths:
        print(f"📄 Metadata found: {metadata_path}")
        hf_hub_download(
            repo_id=REPO_ID,
            repo_type=REPO_TYPE,
            filename=metadata_path,
            local_dir="./voxbox_subset",
            local_dir_use_symlinks=False,
        )
    else:
        print(f"⚠️ Metadata not found: {metadata_path}")

    # Match all audio tar.gz files for the subset
    audio_tars = [f for f in all_paths if f.startswith(f"audios/{subset}/") and f.endswith(".tar.gz")]
    if not audio_tars:
        print(f"⚠️ No audio files found for {subset}")
        continue

    for tar_file in audio_tars:
        print(f"🎧 Downloading audio: {tar_file}")
        hf_hub_download(
            repo_id=REPO_ID,
            repo_type=REPO_TYPE,
            filename=tar_file,
            local_dir="./voxbox_subset",
            local_dir_use_symlinks=False,
        )

✅ Directory Structure of the Downloaded Results

voxbox_subset/
├── audios
│   ├── casia
│   │   └── casia_0000.tar.gz
│   ├── cremad
│   │   └── cremad_0000.tar.gz
│   └── emns
│       └── emns_0000.tar.gz
└── metadata
    ├── casia.jsonl
    ├── cremad.jsonl
    └── emns.jsonl

Label Your Own Data

python -m tools.annotation \
    --audio_path 'path to the audio' \
    --text 'transcription of the audio'

License

Please refer to the original licenses of each sub-corpus. This dataset aggregates and annotates the metadata in a unified structure for research purposes.

Citation

If you use this dataset in your research, please consider citing:

@article{wang2025spark,
  title={Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens},
  author={Wang, Xinsheng and Jiang, Mingqi and Ma, Ziyang and Zhang, Ziyu and Liu, Songxiang and Li, Linqin and Liang, Zheng and Zheng, Qixi and Wang, Rui and Feng, Xiaoqin and others},
  journal={arXiv preprint arXiv:2503.01710},
  year={2025}
}

Related Skills

node-connect

335.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

82.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

335.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

82.7k

Commit, push, and open a PR