DocParse: Intelligent Document Reconstruction Pipeline

DocParse is a hybrid computer vision pipeline designed to transform high-variance smartphone photos of documents into structured, searchable, and geometrically corrected PDF files. It bridges the gap between raw pixel data and digital document reconstruction using a combination of classical geometric computer vision and state-of-the-art Deep Learning models.

🚀 Key Features

📐 Phase I: Geometric Correction (Classical CV)

Automatic Corner Detection: Utilizes multi-channel edge detection and morphological processing to find document boundaries.
Sub-pixel Refinement: Refines corner coordinates to decimal precision for superior rectification.
Interactive Correction: A Streamlit-based UI allows users to manually adjust corners if automatic detection fails.
Illumination Normalization: Applies CLAHE (Contrast Limited Adaptive Histogram Equalization) and adaptive thresholding to remove shadows and uneven lighting.

🧠 Phase II: Semantic Layout Analysis (Deep Learning)

YOLO-DocLayNet Inference: Utilize YOLOv10 Fine-tuned on the DocLayNet dataset to segment the document into semantic regions:
- Title, Text, Table, Figure, Table Caption, Table Footer, Figure Caption, Formula, Formula Caption.
Reading Order Resolution: Implements the Recursive XY-Cut algorithm to sort detected elements into a natural reading order (top-down, left-right), handling multi-column layouts correctly.

📝 Phase III: Hybrid OCR & Content Extraction

Context-Aware OCR: Dynamically adjusts Tesseract Page Segmentation Modes (PSM) based on the semantic label (e.g., treating a "Title" differently from a "Table Cell").
Table Structure Parsing: Uses morphological projection profiles to reconstruct the grid structure of tables, converting them into editable data rather than static images.
Data Cleaning: Includes heuristic post-processing and spellchecking to correct common OCR errors (e.g., 0 vs O).

📄 Phase IV: PDF Synthesis

Layout Preservation: Reconstructs the document on a digital canvas using the detected coordinates.
Searchable Assets: Embeds a hidden text layer behind images and tables, ensuring the entire PDF is searchable (Ctrl+F compatible).
Dynamic Typography: Automatically scales font sizes to fit the text within the original bounding boxes.

🛠️ Project Structure

DocParse/
├── data/
│   ├── raw/                  # Input images
│   └── output/               # Generated PDFs and debug visuals
├── models/
│   └── weights/              # YOLOv10 .pt checkpoints
├── src/
│   ├── scanner/              # Phase I: Geometry, Hough, Filters
│   ├── segmentation/         # Phase II: Inference, XY-Cut Sorting
│   ├── ocr/                  # Phase III: Tesseract Engine, Table Parser
│   ├── synthesis/            # Phase IV: PDF Generation (PyMuPDF)
│   └── utils/                # Config loader, image helpers
├── app.py                    # Streamlit Interactive Dashboard
├── config.yaml               # Centralized configuration
└── requirements.txt          # Python dependencies

💻 Installation

Clone the Repository

git clone https://github.com/GitGud-f/DocParse.git
cd DocParse

Install Dependencies It is recommended to use a virtual environment.
```
pip install -r requirements.txt
```
Install Tesseract OCR
- Linux: sudo apt-get install tesseract-ocr
- Windows: Download the installer from UB-Mannheim and set the path in config.yaml.
Model Weights Download the YOLO weights (doclayout_yolo_docstructbench_imgsz1024.pt) and place them in models/weights/.

⚡ Usage

Running the Interactive Web App

The best way to experience the pipeline is via the Streamlit dashboard.

streamlit run app.py

Sidebar: Select "Upload New Image" or choose a sample.
Phase I: Verify the detected red corners. Drag them if necessary.
Phase II: Click "Run Deep Layout Analysis" to see segmentation boxes.
Phase III: Click "Extract Text & Data" to perform OCR.
Phase IV: Click "Download PDF" to get the reconstructed document.

⚙️ Configuration

Modify config.yaml to tune parameters:

preprocessing:
  resize_height: 1024

segmentation:
  model:
    conf_threshold: 0.45

ocr:
  lang: "eng"
  postprocessing:
    enable_spellcheck: True

DocParse

Install / Use

README