VoiceCloning

Generative voice cloning model using TTS synthesis with state-of-the-art Zero-Shot Multi-Speaker functionality. An web api built with the YourTTS TTS model to clone and generate realistic audio waves

Generate Convert Improve

Install / Use

/learn @MartinMashalov/VoiceCloning

About this skill

Quality Score

0/100

README

Voice Cloning Model with Zero-Shot Attention-Based TTS

The AI used in this API is the YourTTS Zero-Shot Multispeaker TTS implementation of generative audio modeling.

The paper that proposed the YourTTS model was used as a central building block of the API. YourTTS for a multilingual approach for zero-shot multi-speaker TTS which can be utilized on multilingual audio data while building on older VITS approaches.

Reference Implementations used to study TTS concepts can be found here

The Models Researched under open source as provided from Coqui

| Model | URL | |------------------------------|------------------------------------------------------------------------------------------------| | Speaker Encoder | link | | Exp 1. YourTTS-EN(VCTK) | link | | Exp 1. YourTTS-EN(VCTK) + SCL | link | | Exp 2. YourTTS-EN(VCTK)-PT | link | | Exp 2. YourTTS-EN(VCTK)-PT + SCL | link | | Exp 3. YourTTS-EN(VCTK)-PT-FR | link | | Exp 3. YourTTS-EN(VCTK)-PT-FR SCL | link | | Exp 4. YourTTS-EN(VCTK+LibriTTS)-PT-FR SCL | link |

TTS Retraining Data

The audios for the MOS are available here. Also, the MOS the audios are here.

Default TTS Audio Sources:

LibriTTS (test clean): 1188, 1995, 260, 1284, 2300, 237, 908, 1580, 121 and 1089

VCTK: p261, p225, p294, p347, p238, p234, p248, p335, p245, p326 and p302

MLS Portuguese: 12710, 5677, 12249, 12287, 9351, 11995, 7925, 3050, 4367 and 1306

Citation


@ARTICLE{2021arXiv211202418C,
  author = {{Casanova}, Edresson and {Weber}, Julian and {Shulby}, Christopher and {Junior}, Arnaldo Candido and {G{\"o}lge}, Eren and {Antonelli Ponti}, Moacir},
  title = "{YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone}",
  journal = {arXiv e-prints},
  keywords = {Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing},
  year = 2021,
  month = dec,
  eid = {arXiv:2112.02418},
  pages = {arXiv:2112.02418},
  archivePrefix = {arXiv},
  eprint = {2112.02418},
  primaryClass = {cs.SD},
  adsurl = {https://ui.adsabs.harvard.edu/abs/2021arXiv211202418C},
  adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

flutter-tutor

Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

16.9k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary