PolyglotPDF

(eBook，PDFs Translation) A multilingual eBook processing tool supporting all eBook formats. Features online and offline translation while preserving original layouts. Compatible with both scanned and digital PDFs. Elegant user interface. The world's highest-performing open-source layout-preserving eBook translator.

Generate Convert Improve

Install / Use

/learn @CBIhalsen/PolyglotPDF

About this skill

Quality Score

0/100

README

python包在2.2版本之前预计不会更新，2.2版本预估采取解析最底层span获取更信息的布局逻辑解决，预估解决：行内公式错误判断为公式块，错误将粗体文本进行分段bug,以及insert_html方法重复嵌入字体文件导致处理页数较大pdf时浪费计算资源极其卡顿。目前效果，对于基于文本的pdf,polyglotpdf的解析方式依旧是最优解。 ocr和布局分析并不总是完美。（考虑处理文本上下标问题，大部分pdf文件中上标下标文本通过指定坐标和字体大小实现伪上下标，考虑替换为真正的上下标文字对应的Unicode编码，但并不完美），对于报告型表格文档，polyglotpdf效果相当完美，当然表格中的复杂矢量数学公式依旧无法正确处理）。寻求意见的改进方法，对于复杂的颜色布局文本或者粗体参杂常规字体文本，提出以下方法，对于流内容我们可以解析为html格式如下：

<p style="color: red; display: inline;">ABSTRACT: </p>
<p style="display: inline;">
    The swine industry annually suffers significant economic losses caused by porcine reproductive and respiratory syndrome virus (PRRSV). Because the available commercial vaccines have limited protective efficacy against epidemic PRRSV, there is an urgent need for innovative solutions. Nanoparticle vaccines induce robust immune responses and have become a promising direction in vaccine development. In this study, we designed and produced a self-assembling nanoparticle vaccine derived from thermophilic archaeal ferritin to combat epidemic PRRSV. First, multiple T cell epitopes targeting viral structural proteins were identified by IFN-γ screening after PRRSV infection. Three different self-assembled nanoparticles with epitopes targeting viral GP3, GP4, and GP5.
</p>

这种解析内容只能由llms翻译，翻译结果如下：

<p style="color: red; display: inline;">摘要：</p>
<p style="display: inline;">
  猪产业每年因猪繁殖与呼吸综合征病毒（PRRSV）造成显著的经济损失。由于现有的商业疫苗对流行性PRRSV的保护效果有限，迫切需要创新的解决方案。纳米粒子疫苗能够引发强烈的免疫反应，已成为疫苗开发的一个有前景的方向。在本研究中，我们设计并生产了一种源自嗜热古细菌铁蛋白的自组装纳米粒子疫苗，以对抗流行性PRRSV。首先，通过PRRSV感染后的IFN-γ筛选，识别出针对病毒结构蛋白的多个T细胞表位。三种不同的自组装纳米粒子携带针对病毒GP3、GP4和GP5的表位。
</p>

甚至包括粗体：

<p style="color: blue; font-weight: bold; display: inline;">摘要：</p>
<p style="display: inline;">
  猪产业每年因猪繁殖与呼吸综合征病毒（PRRSV）造成显著的经济损失。由于现有的商业疫苗对流行性PRRSV的保护效果有限，迫切需要创新的解决方案。纳米粒子疫苗能够引发强烈的免疫反应，已成为疫苗开发的一个有前景的方向。在本研究中，我们设计并生产了一种源自嗜热古细菌铁蛋白的自组装纳米粒子疫苗，以对抗流行性PRRSV。首先，通过PRRSV感染后的IFN-γ筛选，识别出针对病毒结构蛋白的多个T细胞表位。三种不同的自组装纳米粒子携带针对病毒GP3、GP4和GP5的表位。
</p>

这种方法会无线接近于完美的处理，目前考虑将此方法作为强化功能选用

English | 简体中文 | 繁體中文 | 日本語 | 한국어

PolyglotPDF

Demo

🎬 Watch Full Video

llms has been added as the translation api of choice, Doubao ,Qwen ,deepseek v3 , gpt4-o-mini are recommended. The color space error can be resolved by filling the white areas in PDF files. The old text to text translation api has been removed.

In addition, consider adding arxiv search function and rendering arxiv papers after latex translation.

Pages show

LLM API Application

302.AI

AI service aggregation platform supporting multiple international mainstream AI models:

Official Website: 302.AI
Registration: Sign up with invitation link (Use invitation code JBmCb1 to get $1 bonus)
Available Models: GPT-4o, GPT-4o-mini, Claude-3.7-Sonnet, DeepSeek-V3 and more
Features: Access multiple AI models with one account, pay-per-use pricing

Doubao & Deepseek

Apply through Volcengine platform:

Application URL: Volcengine-Doubao
Available Models: Doubao, Deepseek series models

Tongyi Qwen

Apply through Alibaba Cloud platform:

Application URL: Alibaba Cloud-Tongyi Qwen
Available Models: Qwen-Max, Qwen-Plus series models

Overview

PolyglotPDF is an advanced PDF processing tool that employs specialized techniques for ultra-fast text, table, and formula recognition in PDF documents, typically completing processing within 1 second. It features OCR capabilities and layout-preserving translation, with full document translations usually completed within 10 seconds (speed may vary depending on the translation API provider).

Features

Ultra-Fast Recognition: Processes text, tables, and formulas in PDFs within ~1 second
Layout-Preserving Translation: Maintains original document formatting while translating content
OCR Support: Handles scanned documents efficiently
Text-based PDF：No GPU required
Quick Translation: Complete PDF translation in approximately 10 seconds
Flexible API Integration: Compatible with various translation service providers
Web-based Comparison Interface: Side-by-side comparison of original and translated documents
Enhanced OCR Capabilities: Improved accuracy in text recognition and processing
Support for offline translation: Use smaller translation model

Installation and Usage

<details> <summary>Standard Installation</summary>

Clone the repository:

git clone https://github.com/CBIhalsen/PolyglotPDF.git
cd polyglotpdf

Install required packages:

pip install -r requirements.txt

Configure your API key in config.json. The alicloud translation API is not recommended.
Run the application:

python app.py

Access the web interface: Open your browser and navigate to http://127.0.0.1:8000

</details> <details> <summary>Docker Installation</summary>

Quick Start Without Persistence

If you want to quickly test PolyglotPDF without setting up persistent directories:

# Pull the image first
docker pull 2207397265/polyglotpdf:latest

# Run container without mounting volumes (data will be lost when container is removed)
docker run -d -p 12226:12226 --name polyglotpdf 2207397265/polyglotpdf:latest

This is the fastest way to try PolyglotPDF, but all uploaded PDFs and configuration changes will be lost when the container stops.

Installation with Persistent Storage

# Create necessary directories
mkdir -p config fonts static/original static/target static/merged_pdf

# Create config file
nano config/config.json    # or use any text editor
# Copy configuration template from the project into this file
# Make sure to fill in your API keys and other configuration details

# Set permissions
chmod -R 755 config fonts static

Quick Start

Use the following commands to pull and run the PolyglotPDF Docker image:

# Pull image
docker pull 2207397265/polyglotpdf:latest

# Run container
docker run -d -p 12226:12226 --name polyglotpdf \
  -v ./config/config.json:/app/config.json \
  -v ./fonts:/app/fonts \
  -v ./static/original:/app/static/original \
  -v ./static/target:/app/static/target \
  -v ./static/merged_pdf:/app/static/merged_pdf \
  2207397265/polyglotpdf:latest

Access the Application

After the container starts, open in your browser:

http://localhost:12226

Using Docker Compose

Create a docker-compose.yml file:

version: '3'
services:
  polyglotpdf:
    image: 2207397265/polyglotpdf:latest
    ports:
      - "12226:12226"
    volumes:
      - ./config.json:/app/config.json # Configuration file
      - ./fonts:/app/fonts # Font files
      - ./static/original:/app/static/original # Original PDFs
      - ./static/target:/app/static/target # Translated PDFs
      - ./static/merged_pdf:/app/static/merged_pdf # Merged PDFs
    restart: unless-stopped

Then run:

docker-compose up -d

Common Docker Commands

# Stop container
docker stop polyglotpdf

# Restart container
docker restart polyglotpdf

# View logs
docker logs polyglotpdf

</details>

Requirements

Python 3.8+
deepl==1.17.0
Flask==2.0.1
Flask-Cors==5.0.0
langdetect==1.0.9
Pillow==10.2.0
PyMuPDF==1.24.0
pytesseract==0.3.10
requests==2.31.0
tiktoken==0.6.0
Werkzeug==2.0.1

Acknowledgments

This project leverages PyMuPDF's capabilities for efficient PDF processing and layout preservation.

Upcoming Improvements

PDF chat functionality
Academic PDF search integration
Optimization for even faster processing speeds

Known Issues

Issue Description: Error during text re-editing: code=4: only Gray, RGB, and CMYK colorspaces supported
Symptom: Unsupported color space encountered during text block editing
Current Workaround: Skip text blocks with unsupported color spaces
Proposed Solution: Switch to OCR mode for entire pages containing unsupported color spaces
Example: View PDF sample with unsupported color spaces

TODO

□ Custom Terminology Database: Support custom terminology databases with prompts for domain-specific professional translation
□ AI Reflow Feature: Convert double-column PDFs to single-column HTML blog format for easier reading on mobile devices
□ Multi-format Export: Export translation results