RapidOCRPDF
Based on RapidOCR, extract the PDF content
Install / Use
/learn @RapidAI/RapidOCRPDFREADME
<a href="https://huggingface.co/spaces/RapidAI/RapidOCRPDF" target="_blank"><img src="https://img.shields.io/badge/%F0%9F%A4%97-Hugging Face Demo-blue"></a> <a href="https://www.modelscope.cn/studios/RapidAI/RapidOCRPDF/summary" target="_blank"><img src="https://img.shields.io/badge/魔搭-Demo-blue"></a> <a href=""><img src="https://img.shields.io/badge/Python->=3.6-aff.svg"></a> <a href=""><img src="https://img.shields.io/badge/OS-Linux%2C%20Win%2C%20Mac-pink.svg"></a> <a href="https://pypi.org/project/rapidocr-pdf/"><img alt="PyPI" src="https://img.shields.io/pypi/v/rapidocr-pdf"></a> <a href="https://pepy.tech/project/rapidocr-pdf"><img src="https://static.pepy.tech/personalized-badge/rapidocr-pdf?period=total&units=abbreviation&left_color=grey&right_color=blue&left_text=Downloads"></a> <a href="https://semver.org/"><img alt="SemVer2.0" src="https://img.shields.io/badge/SemVer-2.0-brightgreen"></a> <a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg"></a> <a href="https://choosealicense.com/licenses/apache-2.0/"><img alt="GitHub" src="https://img.shields.io/github/license/RapidAI/RapidOCRPDF"></a>
</div>简介
本仓库依托于RapidOCR仓库,快速提取PDF中文字,包括扫描版PDF、加密版PDF、可直接复制文字版PDF。
整体流程
flowchart LR
A(PDF) --> B{是否可以直接提取内容} --是--> C(PyMuPDF)
B --否--> D(RapidOCR)
C & D --> E(结果)
安装
pip install rapidocr_pdf
使用
脚本使用
⚠️注意:在rapidocr_pdf>=0.4.0中,支持page_num_list参数为负数,假设总页数为2,范围为[-2, 1]。
⚠️注意:在rapidocr_pdf>=0.3.0中,支持了page_num_list参数,默认为None,全部提取。如果指定,页码从0开始。
⚠️注意:在rapidocr_pdf>=0.2.0中,已经适配rapidocr>=2.0.0版本,可以通过参数来使用不同OCR推理引擎来提速。
下面的ocr_params为示例参数,详细请参见RapidOCR官方文档:docs 。
from rapidocr_pdf import RapidOCRPDF
pdf_extracter = RapidOCRPDF(ocr_params={"Global.with_torch": True})
pdf_path = "tests/test_files/direct_and_image.pdf"
# page_num_list=[1]: 仅提取第2页
texts = pdf_extracter(pdf_path, force_ocr=False, page_num_list=[1])
print(texts)
命令行使用
$ rapidocr_pdf -h
usage: rapidocr_pdf [-h] [--dpi DPI] [-f] [--page_num_list [PAGE_NUM_LIST ...]] pdf_path
positional arguments:
pdf_path
options:
-h, --help show this help message and exit
--dpi DPI
-f, --force_ocr Whether to use ocr for all pages.
--page_num_list [PAGE_NUM_LIST ...]
Which pages will be extracted. e.g. 0 1 2.
$ rapidocr_pdf tests/test_files/direct_and_image.pdf --page_num_list 0 1
输入输出说明
输入:Union[str, Path, bytes]
输出:List [页码, 文本内容, 置信度], 具体参见下例:
[
[0, '人之初,性本善。性相近,习相远。', 0.8969868],
[1, 'Men at their birth, are naturally good.', 0.8969868],
]
Related Skills
qqbot-channel
343.1kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
99.7k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
343.1kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
ddd
Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso
