VoiceCrafter
Dockerized Voicecraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
Install / Use
/learn @pselvana/VoiceCrafterREADME
Disclaimer from the Voicecraft Github repo
Any organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
Description
A dockerized version of VoiceCraft [CUDA only] offering a gradio interface voicecraft github and inspired by this webio implementation.
Screenshot
<img width="1313" alt="image" src="https://github.com/pselvana/VoiceCrafter/assets/1414489/831bcf0e-4682-454c-8f8c-18462f4b328a">Installation: Create Docker image (5 minutes+)
# git clone https://github.com/pselvana/VoiceCrafter
# cd VoiceCrafter
# docker build -t voicecrafter .
Instructions
- Run the below to start your instance -- you must run the Installation steps above first
# docker run --gpus=all -p 7860:7860 -it voicecrafter
-
Visit the gradio.live link provided in the output or the local link provided -- commonly localhost:7860
Note: not currently authenticated so anyone with the link can use it
-
Click the "Original Audio" tile to upload clear audio of only the subject speaking on the order of 5-10 seconds.
Tip: Trim out anything longer and choose audio with no background noise or crackles and pops (file formats: mp3, m4a, wav)
-
Update the "original_transcript" with the transcript of the audio uploaded or leave the Autotranscribe input checkbox checked if you want whisper to detect the text
-
Update "target_transcript" with the sentence or two of text you want to generate
-
Click "Run" to generate audio
-
Click the play button next to "Generated Audio" to hear the clip and the "..." to download
Models
| Model | Parameters | Memory | Runs on | | -------- | ------- | ------- | ------- | | fast-whisper | | | CPU | | voicecraft | 330M | 4GB+ VRAM | GPU | | voicecraft | 830M | 6GB+ VRAM | GPU |
Original VoiceCraft License
The codebase is under CC BY-NC-SA 4.0 (LICENSE-CODE), and the model weights are under Coqui Public Model License 1.0.0 (LICENSE-MODEL). Note that we use some of the code from other repository that are under different licenses: ./models/codebooks_patterns.py is under MIT license; ./models/modules, ./steps/optim.py, data/tokenizer.py are under Apache License, Version 2.0; the phonemizer we used is under GNU 3.0 License.
Please refer to the below for latest:
Related Skills
node-connect
348.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
348.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
348.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
