DocParse
A hybrid Computer Vision pipeline designed to transform high-variance smartphone document photos into structured, digital PDFs. It bridges the gap between raw pixel data and semantic document understanding using a combination of classical geometric processing and Deep Learning.
Install / Use
/learn @GitGud-f/DocParseREADME
DocParse: Intelligent Document Reconstruction Pipeline
DocParse is a hybrid computer vision pipeline designed to transform high-variance smartphone photos of documents into structured, searchable, and geometrically corrected PDF files. It bridges the gap between raw pixel data and digital document reconstruction using a combination of classical geometric computer vision and state-of-the-art Deep Learning models.
🚀 Key Features
📐 Phase I: Geometric Correction (Classical CV)
- Automatic Corner Detection: Utilizes multi-channel edge detection and morphological processing to find document boundaries.
- Sub-pixel Refinement: Refines corner coordinates to decimal precision for superior rectification.
- Interactive Correction: A Streamlit-based UI allows users to manually adjust corners if automatic detection fails.
- Illumination Normalization: Applies CLAHE (Contrast Limited Adaptive Histogram Equalization) and adaptive thresholding to remove shadows and uneven lighting.
🧠 Phase II: Semantic Layout Analysis (Deep Learning)
- YOLO-DocLayNet Inference: Utilize YOLOv10 Fine-tuned on the DocLayNet dataset to segment the document into semantic regions:
Title,Text,Table,Figure,Table Caption,Table Footer,Figure Caption,Formula,Formula Caption.
- Reading Order Resolution: Implements the Recursive XY-Cut algorithm to sort detected elements into a natural reading order (top-down, left-right), handling multi-column layouts correctly.
📝 Phase III: Hybrid OCR & Content Extraction
- Context-Aware OCR: Dynamically adjusts Tesseract Page Segmentation Modes (PSM) based on the semantic label (e.g., treating a "Title" differently from a "Table Cell").
- Table Structure Parsing: Uses morphological projection profiles to reconstruct the grid structure of tables, converting them into editable data rather than static images.
- Data Cleaning: Includes heuristic post-processing and spellchecking to correct common OCR errors (e.g.,
0vsO).
📄 Phase IV: PDF Synthesis
- Layout Preservation: Reconstructs the document on a digital canvas using the detected coordinates.
- Searchable Assets: Embeds a hidden text layer behind images and tables, ensuring the entire PDF is searchable (Ctrl+F compatible).
- Dynamic Typography: Automatically scales font sizes to fit the text within the original bounding boxes.
🛠️ Project Structure
DocParse/
├── data/
│ ├── raw/ # Input images
│ └── output/ # Generated PDFs and debug visuals
├── models/
│ └── weights/ # YOLOv10 .pt checkpoints
├── src/
│ ├── scanner/ # Phase I: Geometry, Hough, Filters
│ ├── segmentation/ # Phase II: Inference, XY-Cut Sorting
│ ├── ocr/ # Phase III: Tesseract Engine, Table Parser
│ ├── synthesis/ # Phase IV: PDF Generation (PyMuPDF)
│ └── utils/ # Config loader, image helpers
├── app.py # Streamlit Interactive Dashboard
├── config.yaml # Centralized configuration
└── requirements.txt # Python dependencies
💻 Installation
-
Clone the Repository
git clone https://github.com/GitGud-f/DocParse.git cd DocParse -
Install Dependencies It is recommended to use a virtual environment.
pip install -r requirements.txt -
Install Tesseract OCR
- Linux:
sudo apt-get install tesseract-ocr - Windows: Download the installer from UB-Mannheim and set the path in
config.yaml.
- Linux:
-
Model Weights Download the YOLO weights (
doclayout_yolo_docstructbench_imgsz1024.pt) and place them inmodels/weights/.
⚡ Usage
Running the Interactive Web App
The best way to experience the pipeline is via the Streamlit dashboard.
streamlit run app.py
- Sidebar: Select "Upload New Image" or choose a sample.
- Phase I: Verify the detected red corners. Drag them if necessary.
- Phase II: Click "Run Deep Layout Analysis" to see segmentation boxes.
- Phase III: Click "Extract Text & Data" to perform OCR.
- Phase IV: Click "Download PDF" to get the reconstructed document.
<!-- ## 📊 Pipeline Visualization | 1. Input Image | 2. Geometric Correction | 3. Semantic Segmentation | 4. Final PDF | | :---: | :---: | :---: | :---: | | *(Raw Photo)* | *(Warped & Binarized)* | *(YOLO Detections)* | *(Reconstructed)* | |  |  |  |  | --- -->
⚙️ Configuration
Modify config.yaml to tune parameters:
preprocessing:
resize_height: 1024
segmentation:
model:
conf_threshold: 0.45
ocr:
lang: "eng"
postprocessing:
enable_spellcheck: True
