RapidOCRPDF

Based on RapidOCR, extract the PDF content

Generate Convert Improve

Install / Use

/learn @RapidAI/RapidOCRPDF

About this skill

Quality Score

0/100

README

<div align="center"> <div align="center"> <h1><b><i>RapidOCR 📄 PDF</i></b></h1> </div>

</div>

简介

本仓库依托于RapidOCR仓库，快速提取PDF中文字，包括扫描版PDF、加密版PDF、可直接复制文字版PDF。

整体流程

flowchart LR

A(PDF) --> B{是否可以直接提取内容} --是--> C(PyMuPDF)
B --否--> D(RapidOCR)

C & D --> E(结果)

安装

pip install rapidocr_pdf

使用

脚本使用

⚠️注意：在rapidocr_pdf>=0.4.0中，支持page_num_list参数为负数，假设总页数为2，范围为[-2, 1]。

⚠️注意：在rapidocr_pdf>=0.3.0中，支持了page_num_list参数，默认为None，全部提取。如果指定，页码从0开始。

⚠️注意：在rapidocr_pdf>=0.2.0中，已经适配rapidocr>=2.0.0版本，可以通过参数来使用不同OCR推理引擎来提速。下面的ocr_params为示例参数，详细请参见RapidOCR官方文档：docs 。

from rapidocr_pdf import RapidOCRPDF

pdf_extracter = RapidOCRPDF(ocr_params={"Global.with_torch": True})

pdf_path = "tests/test_files/direct_and_image.pdf"

# page_num_list=[1]: 仅提取第2页
texts = pdf_extracter(pdf_path, force_ocr=False, page_num_list=[1])
print(texts)

命令行使用

$ rapidocr_pdf -h
usage: rapidocr_pdf [-h] [--dpi DPI] [-f] [--page_num_list [PAGE_NUM_LIST ...]] pdf_path

positional arguments:
  pdf_path

options:
  -h, --help            show this help message and exit
  --dpi DPI
  -f, --force_ocr       Whether to use ocr for all pages.
  --page_num_list [PAGE_NUM_LIST ...]
                        Which pages will be extracted. e.g. 0 1 2.

$ rapidocr_pdf tests/test_files/direct_and_image.pdf --page_num_list 0 1

输入输出说明

输入：Union[str, Path, bytes]

输出：List [页码, 文本内容, 置信度]，具体参见下例：

[
    [0, '人之初，性本善。性相近，习相远。', 0.8969868],
    [1, 'Men at their birth, are naturally good.', 0.8969868],
]

Related Skills

qqbot-channel

343.1k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

99.7k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

343.1k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

ddd

Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso

RapidAI

View profile

View on GitHub

GitHub Stars186

CategoryContent

Updated7d ago

Forks19

RapidAI/RapidOCRPDF

Languages

Python

Security Score

100/100

Audited on Mar 24, 2026

No findings