PDFparser
Here is a demo for PDF parser (Including OCR, object detection tools)
Install / Use
/learn @WalkerMitty/PDFparserREADME
PDFparser
Here is a demo for PDF parser (Including OCR, object detection tools). PDF module recognition, extraction of multi-level headings, and more.
Requirements
Firstly, I strongly recommend testing it on Linux.
pip install -r requirements
pip install "unstructured[pdf]"
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
# using layoutparser tool and download the CV models (Detectron2)
pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/detectron2.git@v0.5#egg=detectron2"
# layoutparser also supports paddle tool
pip install "layoutparser[paddledetection]"
For unstructured installation, please refer to here. More details in layoutparser.
How to use
# Extraction of Multi-level Headings
python multi_title.py
# Extraction other things
python parser.py
# Note that the test files used in multi_title.py I have generated can be obtained from some tools in parser.py ('23.2307.14893.json' is a result from unstructured, 'test2_photo' is from pdf2image tool.)
Visualization of Extracted Multi-level Headings

Notes
Due to project constraints, I can only share a portion of the code. However, feel free to ask any questions.
Reference
- https://unstructured.io/
- https://github.com/Layout-Parser/layout-parser/tree/main
- https://github.com/PaddlePaddle/PaddleDetection
- https://github.com/PaddlePaddle/PaddleOCR
- https://github.com/tesseract-ocr/tesseract
