SkillAgentSearch skills...

VoiceCrafter

Dockerized Voicecraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

Install / Use

/learn @pselvana/VoiceCrafter

README

Disclaimer from the Voicecraft Github repo

Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Description

A dockerized version of VoiceCraft [CUDA only] offering a gradio interface voicecraft github and inspired by this webio implementation.

Screenshot

<img width="1313" alt="image" src="https://github.com/pselvana/VoiceCrafter/assets/1414489/831bcf0e-4682-454c-8f8c-18462f4b328a">

Installation: Create Docker image (5 minutes+)

# git clone https://github.com/pselvana/VoiceCrafter
# cd VoiceCrafter
# docker build -t voicecrafter .

Instructions

  • Run the below to start your instance -- you must run the Installation steps above first
# docker run --gpus=all -p 7860:7860 -it voicecrafter
  • Visit the gradio.live link provided in the output or the local link provided -- commonly localhost:7860

    Note: not currently authenticated so anyone with the link can use it

  • Click the "Original Audio" tile to upload clear audio of only the subject speaking on the order of 5-10 seconds.

    Tip: Trim out anything longer and choose audio with no background noise or crackles and pops (file formats: mp3, m4a, wav)

  • Update the "original_transcript" with the transcript of the audio uploaded or leave the Autotranscribe input checkbox checked if you want whisper to detect the text

  • Update "target_transcript" with the sentence or two of text you want to generate

  • Click "Run" to generate audio

  • Click the play button next to "Generated Audio" to hear the clip and the "..." to download

Models

| Model | Parameters | Memory | Runs on | | -------- | ------- | ------- | ------- | | fast-whisper | | | CPU | | voicecraft | 330M | 4GB+ VRAM | GPU | | voicecraft | 830M | 6GB+ VRAM | GPU |

Original VoiceCraft License

The codebase is under CC BY-NC-SA 4.0 (LICENSE-CODE), and the model weights are under Coqui Public Model License 1.0.0 (LICENSE-MODEL). Note that we use some of the code from other repository that are under different licenses: ./models/codebooks_patterns.py is under MIT license; ./models/modules, ./steps/optim.py, data/tokenizer.py are under Apache License, Version 2.0; the phonemizer we used is under GNU 3.0 License.

Please refer to the below for latest:

Related Skills

View on GitHub
GitHub Stars18
CategoryDevelopment
Updated26d ago
Forks0

Languages

Python

Security Score

95/100

Audited on Mar 9, 2026

No findings