PolyglotPDF
(eBook,PDFs Translation) A multilingual eBook processing tool supporting all eBook formats. Features online and offline translation while preserving original layouts. Compatible with both scanned and digital PDFs. Elegant user interface. The world's highest-performing open-source layout-preserving eBook translator.
Install / Use
/learn @CBIhalsen/PolyglotPDFREADME
python包在2.2版本之前预计不会更新,2.2版本预估采取解析最底层span获取更信息的布局逻辑解决,预估解决:行内公式错误判断为公式块,错误将粗体文本进行分段bug,以及insert_html方法重复嵌入字体文件导致处理页数较大pdf时浪费计算资源极其卡顿。 目前效果,对于基于文本的pdf,polyglotpdf的解析方式依旧是最优解。 ocr和布局分析并不总是完美。(考虑处理文本上下标问题,大部分pdf文件中上标下标文本通过指定坐标和字体大小实现伪上下标,考虑替换为真正的上下标文字对应的Unicode编码,但并不完美),对于报告型表格文档,polyglotpdf效果相当完美,当然表格中的复杂矢量数学公式依旧无法正确处理)。 寻求意见的改进方法,对于复杂的颜色布局文本或者粗体参杂常规字体文本,提出以下方法,对于流内容我们可以解析为html格式如下:
<p style="color: red; display: inline;">ABSTRACT: </p>
<p style="display: inline;">
The swine industry annually suffers significant economic losses caused by porcine reproductive and respiratory syndrome virus (PRRSV). Because the available commercial vaccines have limited protective efficacy against epidemic PRRSV, there is an urgent need for innovative solutions. Nanoparticle vaccines induce robust immune responses and have become a promising direction in vaccine development. In this study, we designed and produced a self-assembling nanoparticle vaccine derived from thermophilic archaeal ferritin to combat epidemic PRRSV. First, multiple T cell epitopes targeting viral structural proteins were identified by IFN-γ screening after PRRSV infection. Three different self-assembled nanoparticles with epitopes targeting viral GP3, GP4, and GP5.
</p>
这种解析内容只能由llms翻译,翻译结果如下:
<p style="color: red; display: inline;">摘要:</p>
<p style="display: inline;">
猪产业每年因猪繁殖与呼吸综合征病毒(PRRSV)造成显著的经济损失。由于现有的商业疫苗对流行性PRRSV的保护效果有限,迫切需要创新的解决方案。纳米粒子疫苗能够引发强烈的免疫反应,已成为疫苗开发的一个有前景的方向。在本研究中,我们设计并生产了一种源自嗜热古细菌铁蛋白的自组装纳米粒子疫苗,以对抗流行性PRRSV。首先,通过PRRSV感染后的IFN-γ筛选,识别出针对病毒结构蛋白的多个T细胞表位。三种不同的自组装纳米粒子携带针对病毒GP3、GP4和GP5的表位。
</p>
甚至包括粗体:
<p style="color: blue; font-weight: bold; display: inline;">摘要:</p>
<p style="display: inline;">
猪产业每年因猪繁殖与呼吸综合征病毒(PRRSV)造成显著的经济损失。由于现有的商业疫苗对流行性PRRSV的保护效果有限,迫切需要创新的解决方案。纳米粒子疫苗能够引发强烈的免疫反应,已成为疫苗开发的一个有前景的方向。在本研究中,我们设计并生产了一种源自嗜热古细菌铁蛋白的自组装纳米粒子疫苗,以对抗流行性PRRSV。首先,通过PRRSV感染后的IFN-γ筛选,识别出针对病毒结构蛋白的多个T细胞表位。三种不同的自组装纳米粒子携带针对病毒GP3、GP4和GP5的表位。
</p>
这种方法会无线接近于完美的处理,目前考虑将此方法作为强化功能选用
English | 简体中文 | 繁體中文 | 日本語 | 한국어
PolyglotPDF
Demo
<img src="https://github.com/CBIhalsen/PolyglotPDF/blob/main/static/demo.gif?raw=true" width="80%" height="40%">🎬 Watch Full Video
llms has been added as the translation api of choice, Doubao ,Qwen ,deepseek v3 , gpt4-o-mini are recommended. The color space error can be resolved by filling the white areas in PDF files. The old text to text translation api has been removed.
In addition, consider adding arxiv search function and rendering arxiv papers after latex translation.
Pages show
<div style="display: flex; margin-bottom: 20px;"> <img src="https://github.com/CBIhalsen/PolyglotPDF/blob/main/static/page1.png?raw=true" width="40%" height="20%" style="margin-right: 20px;"> <img src="https://github.com/CBIhalsen/PolyglotPDF/blob/main/static/page2.jpeg?raw=true" width="40%" height="20%"> </div> <div style="display: flex;"> <img src="https://github.com/CBIhalsen/PolyglotPDF/blob/main/static/page3.png?raw=true" width="40%" height="20%" style="margin-right: 20px;"> <img src="https://github.com/CBIhalsen/PolyglotPDF/blob/main/static/page4.png?raw=true" width="40%" height="20%"> </div>LLM API Application
302.AI
AI service aggregation platform supporting multiple international mainstream AI models:
- Official Website: 302.AI
- Registration: Sign up with invitation link (Use invitation code
JBmCb1to get $1 bonus) - Available Models: GPT-4o, GPT-4o-mini, Claude-3.7-Sonnet, DeepSeek-V3 and more
- Features: Access multiple AI models with one account, pay-per-use pricing
Doubao & Deepseek
Apply through Volcengine platform:
- Application URL: Volcengine-Doubao
- Available Models: Doubao, Deepseek series models
Tongyi Qwen
Apply through Alibaba Cloud platform:
- Application URL: Alibaba Cloud-Tongyi Qwen
- Available Models: Qwen-Max, Qwen-Plus series models
Overview
PolyglotPDF is an advanced PDF processing tool that employs specialized techniques for ultra-fast text, table, and formula recognition in PDF documents, typically completing processing within 1 second. It features OCR capabilities and layout-preserving translation, with full document translations usually completed within 10 seconds (speed may vary depending on the translation API provider).
Features
- Ultra-Fast Recognition: Processes text, tables, and formulas in PDFs within ~1 second
- Layout-Preserving Translation: Maintains original document formatting while translating content
- OCR Support: Handles scanned documents efficiently
- Text-based PDF:No GPU required
- Quick Translation: Complete PDF translation in approximately 10 seconds
- Flexible API Integration: Compatible with various translation service providers
- Web-based Comparison Interface: Side-by-side comparison of original and translated documents
- Enhanced OCR Capabilities: Improved accuracy in text recognition and processing
- Support for offline translation: Use smaller translation model
Installation and Usage
<details> <summary>Standard Installation</summary>- Clone the repository:
git clone https://github.com/CBIhalsen/PolyglotPDF.git
cd polyglotpdf
- Install required packages:
pip install -r requirements.txt
-
Configure your API key in config.json. The alicloud translation API is not recommended.
-
Run the application:
python app.py
- Access the web interface:
Open your browser and navigate to
http://127.0.0.1:8000
Quick Start Without Persistence
If you want to quickly test PolyglotPDF without setting up persistent directories:
# Pull the image first
docker pull 2207397265/polyglotpdf:latest
# Run container without mounting volumes (data will be lost when container is removed)
docker run -d -p 12226:12226 --name polyglotpdf 2207397265/polyglotpdf:latest
This is the fastest way to try PolyglotPDF, but all uploaded PDFs and configuration changes will be lost when the container stops.
Installation with Persistent Storage
# Create necessary directories
mkdir -p config fonts static/original static/target static/merged_pdf
# Create config file
nano config/config.json # or use any text editor
# Copy configuration template from the project into this file
# Make sure to fill in your API keys and other configuration details
# Set permissions
chmod -R 755 config fonts static
Quick Start
Use the following commands to pull and run the PolyglotPDF Docker image:
# Pull image
docker pull 2207397265/polyglotpdf:latest
# Run container
docker run -d -p 12226:12226 --name polyglotpdf \
-v ./config/config.json:/app/config.json \
-v ./fonts:/app/fonts \
-v ./static/original:/app/static/original \
-v ./static/target:/app/static/target \
-v ./static/merged_pdf:/app/static/merged_pdf \
2207397265/polyglotpdf:latest
Access the Application
After the container starts, open in your browser:
http://localhost:12226
Using Docker Compose
Create a docker-compose.yml file:
version: '3'
services:
polyglotpdf:
image: 2207397265/polyglotpdf:latest
ports:
- "12226:12226"
volumes:
- ./config.json:/app/config.json # Configuration file
- ./fonts:/app/fonts # Font files
- ./static/original:/app/static/original # Original PDFs
- ./static/target:/app/static/target # Translated PDFs
- ./static/merged_pdf:/app/static/merged_pdf # Merged PDFs
restart: unless-stopped
Then run:
docker-compose up -d
Common Docker Commands
# Stop container
docker stop polyglotpdf
# Restart container
docker restart polyglotpdf
# View logs
docker logs polyglotpdf
</details>
Requirements
- Python 3.8+
- deepl==1.17.0
- Flask==2.0.1
- Flask-Cors==5.0.0
- langdetect==1.0.9
- Pillow==10.2.0
- PyMuPDF==1.24.0
- pytesseract==0.3.10
- requests==2.31.0
- tiktoken==0.6.0
- Werkzeug==2.0.1
Acknowledgments
This project leverages PyMuPDF's capabilities for efficient PDF processing and layout preservation.
Upcoming Improvements
- PDF chat functionality
- Academic PDF search integration
- Optimization for even faster processing speeds
Known Issues
- Issue Description: Error during text re-editing:
code=4: only Gray, RGB, and CMYK colorspaces supported - Symptom: Unsupported color space encountered during text block editing
- Current Workaround: Skip text blocks with unsupported color spaces
- Proposed Solution: Switch to OCR mode for entire pages containing unsupported color spaces
- Example: View PDF sample with unsupported color spaces
TODO
- □ Custom Terminology Database: Support custom terminology databases with prompts for domain-specific professional translation
- □ AI Reflow Feature: Convert double-column PDFs to single-column HTML blog format for easier reading on mobile devices
- □ Multi-format Export: Export translation results
Related Skills
summarize
343.1kSummarize or extract text/transcripts from URLs, podcasts, and local files (great fallback for “transcribe this YouTube/video”).
feishu-doc
343.1k|
obsidian
343.1kWork with Obsidian vaults (plain Markdown notes) and automate via obsidian-cli.
openhue
343.1kControl Philips Hue lights and scenes via the OpenHue CLI.
