Epub2tts

Turn an epub or text file into an audiobook

Generate Convert Improve

Install / Use

/learn @aedocw/Epub2tts

About this skill

Quality Score

0/100

README

NOTE: If you are using this with Coqui for voice cloning, I urge you to check out epub2tts-chatterbox! It is SO much better! Requires GPU, and is kind of slow, but the quality is amazing. No halucinations, way better emotion, it is a HUGE improvement!

Also take a look at epub2tts-vibevoice, which uses Microsoft VibeVoice. That TTS engine handles long text really well, and the output is very high quality.

epub2tts is a free and open source python app to easily create a full-featured audiobook from an epub or text file using realistic text-to-speech from Coqui AI TTS, OpenAI or MS Edge. Also check out epub2tts-edge and epub2tts-kokoro for lighter-weight engine-specific versions. The kokoro engine is especially good and fast!

🚀 Features

[x] Creates standard format M4B audiobook file
[x] Automatic chapter break detection
[x] Embeds cover art if specified
[x] Can use MS Edge for free cloud-based TTS
[x] Easy voice cloning with Coqui XTTS model
[x] 58 studio quality voices from Coqui AI
[x] Uses deepspeed if available for faster processing
[x] Resumes where it left off if interrupted
[x] NOTE: epub file must be DRM-free

NOTE: NEW MULTIPROCESSING FEATURE ADDED! You can now use --threads N to specify the number of threads to run where chapters will be processed in parallel! If you're using Edge or OpenAI you can set threads to as many chapters as you've got and they can all be processed at the same time. When using TTS/XTTS, you'll need to do some experimenting to see what your system can handle.

NOTE: Check out epub2tts-edge for a VERY fast lightweight alternative that only works with MS Edge. That version reads multiple sentences in parallel and goes much quicker!

📖 Usage

<details> <summary> Usage instructions</summary>

Extract epub contents to text:

epub2tts mybook.epub --export txt
edit mybook.txt, replacing # Part 1 etc with desired chapter names, and removing front matter like table of contents and anything else you do not want read. Note: First two lines can be Title: and Author: to use that in audiobook metadata. ALSO NOTE After Author/Title, the book copy MUST start with a chapter or section marked by a line with a hashmark at the beginning (like # Introduction).
The speaker can be set to change per chapter by appending % <speaker> after the chapter name, for instance # Chapter One % en-US-AvaMultilingualNeural. See the file multi-speaker-sample-edge.txt for an example. Note: Only works with Coqui TTS multi-speaker engine (default) or --engine edge.

Default audiobook, fairly quick:

Using VITS model, all defaults, no GPU required:

epub2tts mybook.epub (To change speaker (ex p307 for a good male voice w/Coqui TTS), add: --speaker p307)

Kokoro

Uses Kokoro, really high quality TTS.

Specify a speaker with --speaker <speaker>. Check here for available voices, default speaker is af_sky if --speaker is not specified.
epub2tts mybook.txt --engine kokoro --speaker am_michael --speed 1.3
NOTE: Speed config is ignored for now, will fix at some point :)

MS Edge Cloud TTS:

Uses Microsoft Edge TTS in the cloud, FREE, only minimal CPU required, and it's pretty fast (100 minutes for 7hr book for instance). Many voices and languages to choose from, and the quality is really good (listen to sample-en-US-AvaNeural-edge.m4b for an example).

List available voices with edge-tts --list-voices, default speaker is en-US-AndrewNeural if --speaker is not specified.
epub2tts mybook.txt --engine edge --speaker en-US-AvaNeural --cover cover-image.jpg --sayparts

XTTS with Coqui Studio voice:

Choose a studio voice, samples here
epub2tts mybook.txt --engine xtts --speaker "Damien Black" --cover cover-image.jpg --sayparts

XTTS using your own voice clone:

epub2tts mybook.epub --scan, determine which part to start and end on so you can skip TOC, etc.
Secure 1-3 30 second clips of a speaker you really like (`voice-1.wav``, etc)
epub2tts my-book.epub --start 4 --end 20 --xtts voice-1.wav,voice-2.wav,voice-3.wav --cover cover-image.jpg

All options

-h, --help - show this help message and exit
--threads [N] - process N number of chapters in parallel. If you're using Edge or OpenAI you can basically do as many threads as you have chapters. With TTS or XTTS you'll need to experiment to see what works best on your environment. Default number of threads is 2.
--engine [ENGINE] - Which TTS engine to use [tts|xtts|openai|edge|kokoro]
--xtts [sample-1.wav,sample-2.wav] - Sample wave/mp3 file(s) for XTTS v2 training separated by commas
--openai OPENAI_API_KEY - OpenAI API key if engine is OpenAI
--model [MODEL] - TTS model to use, default: tts_models/en/vctk/vits
--speaker SPEAKER - Speaker to use (examples: p335 for VITS, onyx for OpenAI, "Damien Black" for XTTS v2, en-US-EricNeural for edge)
--scan - Scan the epub to show beginning of chapters, then exit
--start [START] - Chapter/part to start from
--end [END] - Chapter/part to end with
--language [LANGUAGE] - Language of the epub, default: en
--minratio [MINRATIO] - Minimum match ratio between text and transcript, 0 to disable whisper
--skiplinks - Skip reading any HTML links
--skipfootnotes - Try to skip reading footnotes
--skip-cleanup - Do not replace special characters with ","
--sayparts - Say each part number at start of section
--bitrate [BITRATE] - Specify bitrate for output file
--debug - Enable debug output
--export txt - Export epub contents to file (txt, md coming soon)
--parapause - when using --export txt, this option inserts %P% at each paragraph break. Then when creating audio with --engine edge, any time %P% is found in the copy a 1.2 second pause in inserted.
--no-deepspeed - Disable deepspeed
--cover image.jpg - jpg image to use for cover

</details>

🐞 Reporting bugs

<details> <summary>How to report bugs/issues</summary>

Thank you in advance for reporting any bugs/issues you encounter! If you are having issues, first please search existing issues to see if anyone else has run into something similar previously.

If you've found something new, please open an issue and be sure to include:

The full command you executed
The platform (Linux, Windows, OSX, Docker)
Your Python version if not using Docker
Try running the command again with --debug --minratio 0 added on, to get more information
Relevant output around the crash, including the sentence (should be in debug output) if it crashed during a TTS step

</details>

🗒️ Release notes

<details> <summary>Release notes </summary>

20250216: Added Kokoro engine, still need to fix using speed parameter
20241005: A few new releases thanks to excellent contributions from https://github.com/calledit - this includes significant refactoring to improve the code base, adding --threads N feature for multiprocessing, and support for NCX files that improves detection of how text is separated in an epub.
20240403: Added support for specifying speaker per chapter, https://github.com/aedocw/epub2tts/issues/229
20240320: Added MS Edge cloud TTS support
20240301: Added --skip-cleanup option to skip replacement of special characters with ","
20240222: Implemented pause between sentences, https://github.com/aedocw/epub2tts/issues/208 and https://github.com/aedocw/epub2tts/issues/153
20240131: Repaired missing pause between chapters
20240114: Updated README
20240111: Added support for Title & Author in text files
20240110: Added support for "--cover image.jpg"

</details>

Performance

<details> <summary>Some benchmarks</summary> VITS model is the fastest, does not require GPU, but does not sound as good as using XTTS. We have not done any comparative benchmarks with that model.

Typical inference times for xtts_v2 averaged over 4 processing chunks (about 4 sentences each) that can be expected:

| Hardware                            | Inference Time |
|-------------------------------------|----------------|
| 20x CPU Xeon E5-2630 (without AVX)  | 3.7x realtime  |
| 20x CPU Xeon Silver 4214 (with AVX) | 1.7x realtime  |
| 8x CPU Xeon Silver 4214 (with AVX)  | 2.0x realtime  |
| 2x CPU Xeon Silver 4214 (with AVX)  | 2.9x realtime  |
| Intel N4100 Atom (NAS)              | 4.7x realtime  |
| GPU RTX A2000 4GB (w/o deepspeed)   | 0.4x realtime  |
| GPU RTX A2000 4GB (w deepspeed)     | 0.15x realtime |

</details>

📦 Install

Required Python version is 3.11.

<details> <summary>MAC INSTALLATION</summary>

This installation requires Python < 3.12 and Homebrew (I use homebrew to install espeak, pyenv and ffmpeg). Per this bug, mecab should also be installed via homebrew.

Voice models will be saved locally in ~/.local/share/tts

#install dependencies
brew install espeak pyenv ffmpeg mecab
#install epub2tts
git clone https://github.com/aedocw/epub2tts
cd epub2tts
pyenv install 3.11
pyenv local 3.11
#OPTIONAL but recommended - install this in a virtual environment
pip install coqui-tts --only-binary spacy
python -m venv .venv && source .venv/bin/activate
pip install .

</details> <details> <summary>LINUX INSTALLATION</summary>

These instruction

Related Skills

node-connect

343.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

90.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。