SkillAgentSearch skills...

VoxBox

A large-scale speech corpus introduced in Spark-TTS, built from diverse open-source datasets for training text-to-speech (TTS) systems.

Install / Use

/learn @SparkAudio/VoxBox
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

VoxBox

A large-scale speech corpus introduced in Spark-TTS, built from diverse open-source datasets for training text-to-speech (TTS) systems.

📊 VoxBox Dataset Overview

Chinese Datasets

| Data | Language | #Utterance | Male (h) | Female (h) | Total (h) | |------|----------|------------|----------|------------|-----------| | AISHELL-3 | Chinese | 88,035 | 16.01 | 69.61 | 85.62 | | CASIA | Chinese | 857 | 0.25 | 0.2 | 0.44 | | Emilia-CN | Chinese | 15,629,241 | 22,017.56 | 12,741.89 | 34,759.45 | | ESD | Chinese | 16,101 | 6.69 | 7.68 | 14.37 | | HQ-Conversations | Chinese | 50,982 | 35.77 | 64.23 | 100 | | M3ED | Chinese | 253 | 0.04 | 0.06 | 0.1 | | MAGICDATA | Chinese | 609,474 | 360.31 | 393.81 | 754.13 | | MER2023 | Chinese | 1,667 | 0.86 | 1.07 | 1.93 | | NCSSD-CL-CN | Chinese | 98,628 | 53.83 | 59.21 | 113.04 | | NCSSD-RC-CN | Chinese | 21,688 | 7.05 | 22.53 | 29.58 | | WenetSpeech4TTS | Chinese | 8,856,480 | 7,504.19 | 4,264.3 | 11,768.49 | | Total (Chinese) | | 25,373,406 | 30,002.56 | 17,624.59 | 47,627.15 |

English Datasets

| Data | Language | #Utterance | Male (h) | Female (h) | Total (h) | |------|----------|------------|----------|------------|-----------| | CREMA-D | English | 809 | 0.3 | 0.27 | 0.57 | | Dailytalk | English | 23,754 | 10.79 | 10.86 | 21.65 | | Emilia-EN | English | 8,303,103 | 13,724.76 | 6,573.22 | 20,297.98 | | EMNS | English | 918 | 0 | 1.49 | 1.49 | | EmoV-DB | English | 3,647 | 2.22 | 2.79 | 5 | | Expresso | English | 11,595 | 5.47 | 5.39 | 10.86 | | Gigaspeech | English | 6,619,339 | 4,310.19 | 2,885.66 | 7,195.85 | | Hi-Fi TTS | English | 323,911 | 133.31 | 158.38 | 291.68 | | IEMOCAP | English | 2,423 | 1.66 | 1.31 | 2.97 | | JL-Corpus | English | 893 | 0.26 | 0.26 | 0.52 | | Librispeech | English | 230,865 | 393.95 | 367.67 | 761.62 | | LibriTTS-R | English | 363,270 | 277.87 | 283.03 | 560.9 | | MEAD | English | 3,767 | 2.26 | 2.42 | 4.68 | | MELD | English | 5,100 | 2.14 | 1.94 | 4.09 | | MLS-English | English | 6,319,002 | 14,366.25 | 11,212.92 | 25,579.18 | | MSP-Podcast | English | 796 | 0.76 | 0.56 | 1.32 | | NCSSD-CL-EN | English | 62,107 | 36.84 | 32.93 | 69.77 | | NCSSD-RL-EN | English | 10,032 | 4.18 | 14.92 | 19.09 | | RAVDESS | English | 950 | 0.49 | 0.48 | 0.97 | | SAVEE | English | 286 | 0.15 | 0.15 | 0.31 | | TESS | English | 1,956 | 0 | 1.15 | 1.15 | | VCTK | English | 44,283 | 16.95 | 24.51 | 41.46 | | Total (English) | | 22,332,806 | 33,290.8 | 21,582.31 | 54,873.11 |

Overall

| Data | #Utterance | Male (h) | Female (h) | Total (h) | |------|------------|----------|------------|-----------| | Overall Total | 47,706,212 | 63,293.36 | 39,206.9 | 102,500.26 |

Dataset Structure

The dataset is organized as follows:

.
├── audios/
│   └── aishell-3/                      # Audio files (organized by sub-corpus)
│   └── ...
└── metadata/
    ├── aishell-3.jsonl
    ├── casia.jsonl
    ├── commonvoice_cn.jsonl
    ├── ...
    └── wenetspeech4tts.jsonl          # JSONL metadata files

Each JSONL file corresponds to a specific sub-corpus and contains metadata records for individual audio samples.

Metadata Format

Each line in the JSONL files is a JSON object detailing an audio sample. For example:

{
  "index": "VCTK_0000044280",
  "split": "train",
  "language": "en",
  "age": "Youth-Adult",
  "gender": "female",
  "emotion": "UNKNOWN",
  "pitch": 180.626,
  "pitch_std": 0.158,
  "speed": 4.2,
  "duration": 3.84,
  "speech_duration": 3.843,
  "syllable_num": 16,
  "text": "Clearly, the need for a personal loan is written in the stars.",
  "syllables": "K-L-IH1-R L-IY0 DH-AH0 N-IY1-D F-AO1 R-AH0 P-ER1 S-IH0 N-IH0-L L-OW1 N-IH1 Z-R-IH1 T-AH0 N-IH0-N DH-AH0-S T-AA1-R-Z",
  "wav_path": "vctk/VCTK_0000044280.flac"
}

Key fields include:

  • index: Unique identifier for the audio sample.
  • split: Dataset split (e.g., train, test).
  • language: Language of the audio sample (e.g., "en" for English, "zh" for Chinese).
  • age, gender, emotion: Speaker attributes.
  • pitch, pitch_std, speed: Acoustic features.
  • duration: Total duration of the audio sample in seconds.
  • speech_duration: Duration excluding silence at both ends.
  • syllable_num: Number of syllables in the utterance.
  • text: Transcription of the utterance.
  • syllables: Syllable-level transcription.
  • wav_path: Relative path to the audio file within the dataset.

📥 Download Data

You can download the VoxBox dataset via the Hugging Face Datasets Hub:

1️⃣ Download the Full Dataset

You can clone the entire dataset repository (metadata + all audio files):

git lfs install
git clone https://huggingface.co/datasets/SparkAudio/voxbox

⚠️ The full dataset is large (5.82 TB), and may take considerable time and storage.

2️⃣ Download Specific Subsets


from huggingface_hub import HfApi, hf_hub_download

# ✅ Specify the subsets you want to download
target_subsets = ['casia', 'cremad', 'emns']

REPO_ID = "SparkAudio/voxbox"
REPO_TYPE = "dataset"

api = HfApi()
dataset_info = api.dataset_info(repo_id=REPO_ID)

# Get all available file paths (rfilename)
all_paths = [s.rfilename for s in dataset_info.siblings]

for subset in target_subsets:
    print(f"\n🔽 Downloading subset: {subset}")

    # Download metadata file
    metadata_path = f"metadata/{subset}.jsonl"
    if metadata_path in all_paths:
        print(f"📄 Metadata found: {metadata_path}")
        hf_hub_download(
            repo_id=REPO_ID,
            repo_type=REPO_TYPE,
            filename=metadata_path,
            local_dir="./voxbox_subset",
            local_dir_use_symlinks=False,
        )
    else:
        print(f"⚠️ Metadata not found: {metadata_path}")

    # Match all audio tar.gz files for the subset
    audio_tars = [f for f in all_paths if f.startswith(f"audios/{subset}/") and f.endswith(".tar.gz")]
    if not audio_tars:
        print(f"⚠️ No audio files found for {subset}")
        continue

    for tar_file in audio_tars:
        print(f"🎧 Downloading audio: {tar_file}")
        hf_hub_download(
            repo_id=REPO_ID,
            repo_type=REPO_TYPE,
            filename=tar_file,
            local_dir="./voxbox_subset",
            local_dir_use_symlinks=False,
        )

✅ Directory Structure of the Downloaded Results

voxbox_subset/
├── audios
│   ├── casia
│   │   └── casia_0000.tar.gz
│   ├── cremad
│   │   └── cremad_0000.tar.gz
│   └── emns
│       └── emns_0000.tar.gz
└── metadata
    ├── casia.jsonl
    ├── cremad.jsonl
    └── emns.jsonl

Label Your Own Data

python -m tools.annotation \
    --audio_path 'path to the audio' \
    --text 'transcription of the audio'

License

Please refer to the original licenses of each sub-corpus. This dataset aggregates and annotates the metadata in a unified structure for research purposes.

Citation

If you use this dataset in your research, please consider citing:

@article{wang2025spark,
  title={Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens},
  author={Wang, Xinsheng and Jiang, Mingqi and Ma, Ziyang and Zhang, Ziyu and Liu, Songxiang and Li, Linqin and Liang, Zheng and Zheng, Qixi and Wang, Rui and Feng, Xiaoqin and others},
  journal={arXiv preprint arXiv:2503.01710},
  year={2025}
}

Related Skills

View on GitHub
GitHub Stars108
CategoryDevelopment
Updated6d ago
Forks5

Languages

Python

Security Score

80/100

Audited on Mar 19, 2026

No findings