SkillAgentSearch skills...

CapSpeech

CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

Install / Use

/learn @WangHelin1997/CapSpeech
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<img src="assets/capspeech_logo.png"> <h3 align="center">🧢 CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech</h3> <p align="center"> 📄 <a href="https://arxiv.org/abs/2506.02863"><strong>Paper</strong></a> &nbsp;|&nbsp; 🌐 <a href="https://wanghelin1997.github.io/CapSpeech-demo/"><strong>Project Page</strong></a> &nbsp;|&nbsp; 🗂 <a href="https://huggingface.co/datasets/OpenSound/CapSpeech"><strong>Dataset</strong></a> &nbsp;|&nbsp; 🤗 <a href="https://huggingface.co/OpenSound/CapSpeech-models/"><strong>Models</strong></a> &nbsp;|&nbsp; 🚀 <a href="https://huggingface.co/spaces/OpenSound/CapSpeech-TTS/"><strong>Live Demo</strong></a> </p> <p align="center"> <!-- <img src="https://visitor-badge.laobi.icu/badge?page_id=WangHelin1997.CapSpeech" alt="Visitor Statistics" /> --> <img src="https://img.shields.io/github/stars/WangHelin1997/CapSpeech" alt="GitHub Stars" /> <img alt="Static Badge" src="https://img.shields.io/badge/license-CC%20BY--NC%204.0-blue.svg" /> </p>

Introduction

🧢 CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. CapSpeech provides a new benchmark including these tasks:

  1. CapTTS: style-captioned TTS

  2. CapTTS-SE: text-to-speech synthesis with sound effects

  3. AccCapTTS: accent-captioned TTS

  4. EmoCapTTS: emotion-captioned TTS

  5. AgentTTS: text-to-speech synthesis for chat agent

Video

Usage

⚡ Quick Start

Explore CapSpeech directly in your browser — no installation needed.

🛠️ Local Deployment

Install and Run CapSpeech locally.

Development

Please refer to the following documents to prepare the data, train the model, and evaluate its performance.

Main Contributors

Citation

If you find this work useful, please consider contributing to this repo and cite this work:

@misc{wang2025capspeechenablingdownstreamapplications,
      title={CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech}, 
      author={Helin Wang and Jiarui Hai and Dading Chong and Karan Thakkar and Tiantian Feng and Dongchao Yang and Junhyeok Lee and Laureano Moro Velazquez and Jesus Villalba and Zengyi Qin and Shrikanth Narayanan and Mounya Elhiali and Najim Dehak},
      year={2025},
      eprint={2506.02863},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2506.02863}, 
}

License

All datasets, listening samples, source code, pretrained checkpoints, and the evaluation toolkit are licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
See the LICENSE file for details.

Acknowledgements

This implementation is based on Parler-TTS, F5-TTS, SSR-Speech, Data-Speech, EzAudio, and Vox-Profile. We appreciate their awesome work.

🌟 Like This Project?

If you find this repo helpful or interesting, consider dropping a ⭐ — it really helps and means a lot!

View on GitHub
GitHub Stars369
CategoryDevelopment
Updated5d ago
Forks41

Languages

Jupyter Notebook

Security Score

80/100

Audited on Apr 1, 2026

No findings