Pandrator: a multilingual GUI audiobook, subtitle and dubbing generator with voice cloning and translation

[!TIP] TL;DR:

Pandrator is not an AI model itself, but a GUI framework for Text-to-Speech, subtitle generation and translation projects. It can generate audiobooks and subtitles/dubbing by leveraging several AI tools, custom workflows and algorithms. It works on Windows out of the box. It does work on Linux, but you have to perform a manual installation at the moment.

The easiest way to use it is to download one of the precompiled archives - simply unpack them and use the included launcher. See this table for their contents and sizes.

You can talk to me or share tips/workflows/ideas on the Discord server.

Quick Demonstration

This video shows the process of launching Pandrator, selecting a source file, starting generation, stopping it and previewing the saved file. It has not been sped up as it's intended to illustrate the real performance (you may skip the first 35s when the XTTS server is launching, and please remember to turn on the sound).

https://github.com/user-attachments/assets/7cab141a-e043-4057-8166-72cb29281c50

And here you can see the dubbing workflow - from a YT video, through transcription, translation, speech generation to synchronisation.

https://github.com/user-attachments/assets/dfd4b6e8-3eda-49e4-bff4-f1683ec4cf21

About Pandrator

Pandrator aspires to be easy to use and install - it has a one-click installer and a graphical user interface. It is a tool designed to perform two tasks:

transform text, PDF (including see-through cropping), EPUB and SRT files into spoken audio in multiple languages based chiefly on open source software run locally, including preprocessing to make the generated speech sound as natural as possible by, among other things, splitting the text into paragraphs, sentences and smaller logical text blocks (clauses), which the TTS models can process with minimal artifacts. Each sentence can be regenerated if the first attempt is not satisfacory, including marking for regeneration using mouse or keyboard actions when listening back to the generation. Voice cloning is possible for models that support it, and text can be additionally preprocessed using LLMs (to remove OCR artifacts or spell out things that the TTS models struggle with, like Roman numerals and abbreviations, for example),
generate dubbing either directly from a video file, including transcription (using WhisperX), or from an .srt file. It includes a complete workflow from a video file to a dubbed video file with subtitles - including translation using a variety of APIs and techniques to improve the quality of translation. Subdub, a companion app developed for this purpose, can also be used on its own. You can also correct or translate subtitles without generating audio.

At the moment, it leverages XTTS for its exceptional multilingual capabilities, good quality and easy fine-tuning, and Silero for text-to-speech conversion and voice cloning, enhanced by RVC_CLI for quality improvement and better voice cloning results, and NISQA for audio quality evaluation. Additionally, it incorporates Text Generation Webui's API for local LLM-based text pre-processing, enabling a wide range of text manipulations before audio generation.

Supported Languages

XTTS supports English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh-cn), Japanese (ja), Hungarian (hu) and Korean (ko).
Silero supports English, German, Russian, Spanish, French, Hindi, Russian, Tatar, Ukrainian, Uzbek and Kalmyk.

[!NOTE] Please note that Pandrator is still in an alpha stage and I'm not an experienced developer (I'm a noob, in fact), so the code is far from perfect in terms of optimisation, features and reliability. Please keep this in mind and contribute, if you want to help me make it better.

Samples

The samples were generated using the minimal settings - no LLM text processing, RVC or TTS evaluation, and no sentences were regenerated. Both XTTS and Silero generations were faster than playback speed, and Silero used only one CPU core.

https://github.com/user-attachments/assets/1c763c94-c66b-4c22-a698-6c4bcf3e875d

https://github.com/lukaszliniewicz/Pandrator/assets/75737665/118f5b9c-641b-4edd-8ef6-178dd924a883

Dubbing sample, including translation (video source):

https://github.com/user-attachments/assets/1ba8068d-986e-4dec-a162-3b7cc49052f4

Requirements

Hardware Requirements

| TTS Model | CPU Requirements | GPU Requirements | |------------|---------------------------------------------------------------|-------------------------------------------------------------------------| | XTTS | A reasonably modern CPU with 4+ cores (for CPU-only generation) | NVIDIA GPU with 4GB+ of VRAM for good performance | | Silero | Performs well on most CPUs regardless of core count | N/A |

Dependencies

This project relies on several APIs and services (running locally) and libraries, notably:

Required

XTTS API Server by daswer123 for Text-to-Speech (TTS) generation using Coqui XTTSv2 OR Silero API Server by ouoertheo for TTS generaton using the Silero models.
FFmpeg for audio encoding.
Sentence Splitter by mediacloud for splitting .txt files into sentences, customtkinter by TomSchimansky, num2words by savoirfairelinux, and many others. For a full list, see requirements.txt.

Optional

Subdub, a command line app that transcribes video files, translates subtitles and synchronises the generated speech with the video, made specially for Pandrator.
WhisperX by m-bain, an enhanced implementation of OpenAI's Whisper model with improved alignment, used for dubbing and XTTS training.
Easy XTTS Trainer, a command line app that enables XTTS fine-tuning using one or more audio files, made specially for Pandrator.
RVC Python by daswer123 for enhancing voice quality and cloning results with Retrieval Based Voice Conversion.
Text Generation Webui API by oobabooga for LLM-based text pre-processing.
NISQA by gabrielmittag for evaluating TTS generations (using the FastAPI implementation).

Installation

Self-contained packages

I've prepared packages (archives) that you can simply unpack - everything is preinstalled in its own portable conda environment. You can download them from here.

You can use the launcher to start Pandrator, update it and install new features.

| Package | Contents | Unpacked Size | |---------|-------------------------------------------------------------|---------------| | 1 | Pandrator and Silero | 4GB | | 2 | Pandrator and XTTS | 14GB | | 3 | Pandrator, XTTS, RVC, WhisperX (for dubbing) and XTTS fine-tuning | 36GB |

GUI Installer and Launcher (Windows)

pandrator_installer_launcher_KLoHrNDIps

Run pandrator_installer_launcher.exe with administrator priviliges. You will find it under Releases. The executable was created using pyinstaller from pandrator_installer_launcher.py in the repository.

The file may be flagged as a threat by antivirus software, so you may have to add it as an exception; if you're not comfortable doing that, install C++ Build Tools and Calibre manually or perform a fully manual installation

You can choose which TTS engines to install and whether to install the software that enables RVC voice cloning (RVC Python), dubbing (WhisperX) and XTTS fine-tuning (Easy XTTS Trainer). You may install more components later.

The Installer/Launcher performs the following tasks:

Creates the Pandrator folder
Installs necessary tools if not already present:
- C++ Build Tools
- Calibre
Installs Miniconda (locally, not system-wide)
Clones the following repositories:
- Pandrator
- Subdub
- PyPDFCrop

Pandrator

Install / Use

README